Computing Reviews, the leading online review service for computing literature.

Search

Website replica detection with distant supervision
Carvalho C., Moura E., Veloso A., Ziviani N. Information Retrieval21 (4):253-272,2018.Type:Article

Date Reviewed: Jan 18 2019

Detecting duplicate web pages is of great importance for search engines. This is because duplicates are very costly to index. The work of Carvalho et al. advances the technology for detecting web page duplicates, with the potential to improve search engine performance. Using large-scale experiments, the authors’ proposed algorithms are shown to be time-efficient and effective in detecting duplicates. Their algorithm first learns the classifiers using expectation–maximization (EM) methods with both positive and negative samples. One of the novel properties of the algorithm is to update the classifiers incrementally in order to reduce running time. Another novelty is to use multiple features such as edit distance, hostname matching, full path matching, and IP address matching, for example, four octets (ip4) and three octets (ip3). Multiple features increase the quality of the classifiers. Once the classifiers are built, the classification algorithm computes the likelihood scores and sorts the duplicate URLs. The Pareto algorithm from economics enhances the quality of the classification. The authors used a large dataset to verify their algorithms. They collected 250 million URLs by crawling the web. The data was subsequently narrowed down to 172004 websites that shared at least one fingerprint with another website, which generated about 111 million replicate candidates. From this collection, the authors built their 50000 positive samples and 800000 negative samples that were used for learning. The experimental results are very encouraging. Removing the duplicates reduces the size of the collection by 19 percent, with a false positive rate of only 0.005. If combined with URL-level algorithms, the reduction can increase to 21 percent. Because the learning is automatic, the proposed algorithms are applicable to real-world websites and could potentially improve the quality of search.

Reviewer: Xiannong Meng	Review #: CR146388 (1904-0130)

World Wide Web (WWW) (H.3.4 ... )

Abuse And Crime Involving Computers (K.4.1 ... )

Would you recommend this review?

yes

Other reviews under "World Wide Web (WWW)":	Date

Intranet document management Bannan J., Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1997. Type: Book (9780201873795)	Feb 1 1998

Developing databases for the Web and intranets Rodley J., Coriolis Group Books, Scottsdale, AZ, 1997. Type: Book (9781576100516)	Jun 1 1998

1001 programming resources Edward J. J., Jamsa Press, Houston, TX, 1996. Type: Book (9781884133503)	Apr 1 1998

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy