Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project
Overview Introduction Previous Work Methodology o Word-similarity o Hidden Content o Phrase-similarity Initial Results Conclusion
Introduction: Information is available on the Web. However, 14 % of the Web consists of Spam Web pages. Spam Web pages: o Web Pages that receive an unjustifiably favorable relevance or high ranking, regardless of their true value. o Attempt to deceive a search engine’s relevancy ranking algorithm. Serious retrieval problem: o Quality of Web search is affected. o Search engines’ reputation is damaged. o User’s trust in the retrieval process is weakened.
Previous Work Content Analysis: o [Ntoulas et al ] Introduce and combine several heuristics based on the content of a Web page (number of words in a page, average length of words, fraction of visible content). Link Analysis: o [Becchetti et al ] and [Benczur et al ] consider links to and from a given Web page in order to determine if it is spam.
Methodology Focus on the title and the body of a Web page in order to determine whether they are spam: o In legitimate Web pages the title and the body are closely related. o In spam Web pages, the title and the body are usually not related.
Methodology Computing the title-body similarity: o Word-correlation factors, computed using Wikipedia documents: o Degree of resemblance between t (a word in a title) and B (the body of a Web page): o Degree of similarity between the words in the title and the words in the body of a Web page: Status of a Web page:
Methodology Fraction of Hidden Content: o Proportion of markup content of a given Web page (spam Web pages tend to content less markup than legitimate Web pages): o Threshold value to determine the status of a Web page:
Methodology Phrase similarity value o Use the Odds measure to determine the phrase-correlation factor (based on the word-correlation factor): o Phrase similarity threshold value
Overall Spam Detection Approach
Experimental Results WEBSPAM-UK2006: 77.9 millions of classified (spam, non- spam, borderline) Web pages. Accuracy – Error Rate, using phrase similarity:
Experimental Results Enhancement of the phrase similarity approach: o Method A: only phrase similarity. o Method B: phrase similarity as well as hidden content.
Experimental Results Our performance (in terms of F-Measure) with respect to other known spam-detection approaches.
Conclusion By using the phrase (words) in the title and body of a Web page as well as the fraction of hidden content we achieve 92% accuracy. Computational inexpensive: can be incorporated into existing search engines to enhance Web searches
Questions