Detecting duplicate web pages is of great importance for search engines. This is because duplicates are very costly to index. The work of Carvalho et al. advances the technology for detecting web page duplicates, with the potential to improve search engine performance. Using large-scale experiments, the authors’ proposed algorithms are shown to be time-efficient and effective in detecting duplicates.
Their algorithm first learns the classifiers using expectation–maximization (EM) methods with both positive and negative samples. One of the novel properties of the algorithm is to update the classifiers incrementally in order to reduce running time. Another novelty is to use multiple features such as edit distance, hostname matching, full path matching, and IP address matching, for example, four octets (ip4) and three octets (ip3). Multiple features increase the quality of the classifiers. Once the classifiers are built, the classification algorithm computes the likelihood scores and sorts the duplicate URLs. The Pareto algorithm from economics enhances the quality of the classification.
The authors used a large dataset to verify their algorithms. They collected 250 million URLs by crawling the web. The data was subsequently narrowed down to 172004 websites that shared at least one fingerprint with another website, which generated about 111 million replicate candidates. From this collection, the authors built their 50000 positive samples and 800000 negative samples that were used for learning.
The experimental results are very encouraging. Removing the duplicates reduces the size of the collection by 19 percent, with a false positive rate of only 0.005. If combined with URL-level algorithms, the reduction can increase to 21 percent.
Because the learning is automatic, the proposed algorithms are applicable to real-world websites and could potentially improve the quality of search.