Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Website replica detection with distant supervision
Carvalho C., Moura E., Veloso A., Ziviani N. Information Retrieval21 (4):253-272,2018.Type:Article
Date Reviewed: Jan 18 2019

Detecting duplicate web pages is of great importance for search engines. This is because duplicates are very costly to index. The work of Carvalho et al. advances the technology for detecting web page duplicates, with the potential to improve search engine performance. Using large-scale experiments, the authors’ proposed algorithms are shown to be time-efficient and effective in detecting duplicates.

Their algorithm first learns the classifiers using expectation–maximization (EM) methods with both positive and negative samples. One of the novel properties of the algorithm is to update the classifiers incrementally in order to reduce running time. Another novelty is to use multiple features such as edit distance, hostname matching, full path matching, and IP address matching, for example, four octets (ip4) and three octets (ip3). Multiple features increase the quality of the classifiers. Once the classifiers are built, the classification algorithm computes the likelihood scores and sorts the duplicate URLs. The Pareto algorithm from economics enhances the quality of the classification.

The authors used a large dataset to verify their algorithms. They collected 250 million URLs by crawling the web. The data was subsequently narrowed down to 172004 websites that shared at least one fingerprint with another website, which generated about 111 million replicate candidates. From this collection, the authors built their 50000 positive samples and 800000 negative samples that were used for learning.

The experimental results are very encouraging. Removing the duplicates reduces the size of the collection by 19 percent, with a false positive rate of only 0.005. If combined with URL-level algorithms, the reduction can increase to 21 percent.

Because the learning is automatic, the proposed algorithms are applicable to real-world websites and could potentially improve the quality of search.

Reviewer:  Xiannong Meng Review #: CR146388 (1904-0130)
Bookmark and Share
  Featured Reviewer  
 
World Wide Web (WWW) (H.3.4 ... )
 
 
Abuse And Crime Involving Computers (K.4.1 ... )
 
Would you recommend this review?
yes
no
Other reviews under "World Wide Web (WWW)": Date
Intranet document management
Bannan J., Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1997. Type: Book (9780201873795)
Feb 1 1998
Developing databases for the Web and intranets
Rodley J., Coriolis Group Books, Scottsdale, AZ, 1997. Type: Book (9781576100516)
Jun 1 1998
1001 programming resources
Edward J. J., Jamsa Press, Houston, TX, 1996. Type: Book (9781884133503)
Apr 1 1998
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy