The Internet has become an integral part of the infrastructure of modern society. There have been over one billion websites on the web. To locate webpages closely related to one’s interests, people commonly employ handy search engines. While web search engines are very helpful for users, there are web spammers who try to manipulate search engine ranking algorithms in order to raise webpage position in search results. Web spam wastes not only the time of users, but also search engine resources. In the worst case, it can lead users to malicious content that can install malware on the victim’s machine.
Web spam detection methods have been developed for about two decades. Spam detection algorithms can be categorized as link-based and content-based. With the assumption that all pages under a spam host are spam, the technique presented in this paper operates at the host level. By placing a distrust seed (or a set of distrust seeds) to propagate a host network, the proposed algorithm iteratively updates the normalized distrust score for each node. The distrust scores are initialized to zero, except from the distrust seeds. The authors reported their experiments on two datasets: WEBSPAM-UK 2006 and WEBSPAM-UK 2007. With the distrust seed propagation on three available web spam detection algorithms, they claimed that the experiments identified 17.73 percent and 8.59 percent more spam hosts on the two test datasets than without it. The improvements are somewhat remarkable.
This paper presents a specific technique, distrust seed propagation, for web spam detection. It is worthwhile to read, especially for practitioners who work in the field. The battle between web spam and its detection will never stop, just like the endless development of spears and shields with increasing sophistication. Beyond web spam, there are email spam, phishing, and other online attacks through the Internet. Surveys consistently indicate that Internet spam has been a major problem. Antispam algorithms are very much needed, not only for efficiency and productivity, but also for Internet security and privacy.