Download presentation
Presentation is loading. Please wait.
Published byJade Brook Stokes Modified over 8 years ago
1
Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project
2
Overview Introduction Previous Work Methodology o Word-similarity o Hidden Content o Phrase-similarity Initial Results Conclusion
3
Introduction: Information is available on the Web. However, 14 % of the Web consists of Spam Web pages. Spam Web pages: o Web Pages that receive an unjustifiably favorable relevance or high ranking, regardless of their true value. o Attempt to deceive a search engine’s relevancy ranking algorithm. Serious retrieval problem: o Quality of Web search is affected. o Search engines’ reputation is damaged. o User’s trust in the retrieval process is weakened.
4
Previous Work Content Analysis: o [Ntoulas et al. -2006] Introduce and combine several heuristics based on the content of a Web page (number of words in a page, average length of words, fraction of visible content). Link Analysis: o [Becchetti et al. -2006] and [Benczur et al. -2005] consider links to and from a given Web page in order to determine if it is spam.
5
Methodology Focus on the title and the body of a Web page in order to determine whether they are spam: o In legitimate Web pages the title and the body are closely related. o In spam Web pages, the title and the body are usually not related.
6
Methodology Computing the title-body similarity: o Word-correlation factors, computed using Wikipedia documents: o Degree of resemblance between t (a word in a title) and B (the body of a Web page): o Degree of similarity between the words in the title and the words in the body of a Web page: Status of a Web page:
7
Methodology Fraction of Hidden Content: o Proportion of markup content of a given Web page (spam Web pages tend to content less markup than legitimate Web pages): o Threshold value to determine the status of a Web page:
8
Methodology Phrase similarity value o Use the Odds measure to determine the phrase-correlation factor (based on the word-correlation factor): o Phrase similarity threshold value
9
Overall Spam Detection Approach
10
Experimental Results WEBSPAM-UK2006: 77.9 millions of classified (spam, non- spam, borderline) Web pages. Accuracy – Error Rate, using phrase similarity:
11
Experimental Results Enhancement of the phrase similarity approach: o Method A: only phrase similarity. o Method B: phrase similarity as well as hidden content.
12
Experimental Results Our performance (in terms of F-Measure) with respect to other known spam-detection approaches.
13
Conclusion By using the phrase (words) in the title and body of a Web page as well as the fraction of hidden content we achieve 92% accuracy. Computational inexpensive: can be incorporated into existing search engines to enhance Web searches
14
Questions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.