Download presentation
Presentation is loading. Please wait.
1
Web Spam
2
Outline Motivation Introduction to Web Spam Web Spam techniques
Web Spam Detection Conclusions
3
Outline Motivation Introduction to Web Spam Web Spam Techniques
Web Spam Detection Conclusions
4
Motivation Increased exposure on the World Wide Web may yield significant financial gains E-commerce is rapidly growing Projected to $329 billion by 2010;13% of all US retail sales More traffic more money Large fraction of traffic from Search Engines Increase Search Engine referrals: Place ads Provide genuinely better content Create Web spam …
5
Outline Motivation Introduction to Web Spam Web Spam Techniques
Web Spam Detection Conclusions
6
Web Spam * Giving an exact definition at Web spam is not an easy task. Essentially we know web spam when we see it. Here are some examples.
7
Ranking Defining Web Spam Spamming = misleading search engines
to obtain higher-than-deserved ranking Ranking Relevance is usually measured through the textual similarity between the query and a page.Pages can be given a query-spcific,numeric relevance score;the higher the number,the more relevant the page is to the query. Importance refers to the global popularity of a page,as often inferred from the link structure,or perhaps other indicators. Relevance Importance
8
Why Web Spam is Bad Bad for users Bad for search engines
Makes it harder to satisfy information need Leads to frustrating search experience Bad for search engines Wastes bandwidth, CPU cycles, storage space Pollutes corpus (infinite number of spam pages!) Distorts ranking of results
9
Outline Motivation Introduction to Web Spam Web Spam Techniques
Web Spam Detection Conclusions References
10
Web Spam Techniques Two categories of techniques associated with web spam Boosting Term-based Link-based Hiding Boosting techniques:To achieve high relevance and/or importance for some pages Hiding techniques:To hide the adopted boosting techniques from the eyes of human web users
11
Use incoming link information to assign global importance scores
Techniques/Boosting Used to increase ranking Hypertext boosting Term –Relevance (one/many queries) –Target: TF-IDF variants Link –Importance –Target: inlink/outlink count Assign global hub and authority scores to each page Use incoming link information to assign global importance scores
12
Techniques/Boosting/Term
give a higher weight to terms that appear in the title heavy spamming,low priority or ignore them completely <html> <head> <meta name = “keywords” content = “buy,cheap ,cameras,Lens,accessories,nikon,canon”> <title>free,free,free, cheap</title> </head> <body> Our customers agree that we are the best online retailer of cameras! … </body> </html> Simplest,most popular,as old as search engines the url of a page =>a set of terms,to determine the relevance of the page offer a summary of the pointed document,higer weight <html> …A great <a href = “buy-canon-rebel-20d-lens-case.camerasx.com”> free,great deals,cheap,inexpensive,cheap,free</a> store. </html>
13
Techniques/Boosting/Link
Outgoing links to well-known pages provide useful resourses,BUT,have links to spam pages Spammers can control a large number of sites and create arbitrary link structures Buy expired domains,takes advantage of the false relevance/importance converyed by the pool of old links A group of spammers set up a link exchange structure,their sites point to each other Post messages (containing links) to Blogs;forums;Wikis allow webmasters to post links their sites,maybe spam links
14
Different web pages to users and web crawlers
Techniques/Hiding Different web pages to users and web crawlers <script type=“text/javascript”><!-- location.replace(“target.html”) //--> </script> <body background=“red”> <font color=“red”>hidden text</font> … </body> <meta http-equiv=“refresh” content=“0;url=plush.com”> <div style=“visibility:hidden”>You can’t see me!</div> <a href=“target.html”><img src= “tinyimg.gif”></a>
15
Outline Motivation Introduction to Web Spam Web Spam Taxonomy
Web Spam Detection Conclusions
16
How do we detect spam? Detecting Techniques Content-based Link-based
Cloaking-based other
17
Content-based Detection
Detecting Techniques Content-based Link-based Cloaking-based other
18
Related Work D. Fetterly, M. Manasse and M. Najork. Spam, Damn Spam, and Statistics. [WebDB2004] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. [www 2006] …
19
Related Work D. Fetterly, M. Manasse and M. Najork. Spam, Damn Spam, and Statistics. [WebDB2004] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. [www 2006] …
20
Content-based Detection
Number of words in the page and title Average word length Amount of anchor text Compression rate Fraction of page drawn from globally popular words Fraction of globally popular words detection spam web through content analysis WWW2006
21
Number of Words in <title>
detection spam web through content analysis WWW2006
22
Distribution of Word-counts in <title>
Spam more likely in pages with more words in title detection spam web through content analysis WWW2006
23
Compressibility of a Page
detection spam web through content analysis WWW2006
24
detection spam web through content analysis WWW2006
zipRatio of a page detection spam web through content analysis WWW2006
25
Distribution of zipRatios
Spam more likely in pages with high zipRatio detection spam web through content analysis WWW2006
26
Combine heuristics Use the previously presented metrics as features for a classifier show results for a decision-tree
27
Decision Tree
28
Link-based Detection Detecting Techniques Content-based Link-based
Cloaking-based other
29
Related Work Davison B. Recognizing nepotistic links on the Web. 2000
Z. Gyongyi, H. Garcia-Molina and J. Pedersen. Combating Web Spam with TrustRank. [VLDB 2004] B. Wu and B. Davison. Identifying Link Farm Spam Pages. [WWW 2005] R. Baeza-Yates, C. Castillo and V. Lopez. PageRank Increase under Different Collusion Topologies. [AIRWeb 2005] Krishnan, V. and Raj, R. Web Spam Detection with Anti-Trust-Rank. [AIRWeb2006] L. Becchetti, C. Castillo1 D. Donato1, S. Leonardi, and R. Baeza-Yates. Using Rank Propagation and Probabilistic Counting for Link Based Spam Detection. [WebKDD’06] B. Wu, V. Goel and B. Davison. Topical Trustrank: Using Topicality to Combat Web Spam. [WWW 2006] Z.Gyongyi,P.Berkhin,h.Garcia-Molina,J.Pedersen.Link Spam Detection Based on Mass Estimation.[VLDB 2006]
30
Related Work Davison B. Recognizing nepotistic links on the Web. 2000
Z. Gyongyi, H. Garcia-Molina and J. Pedersen. Combating Web Spam with TrustRank. [VLDB 2004] B. Wu and B. Davison. Identifying Link Farm Spam Pages. [WWW 2005] R. Baeza-Yates, C. Castillo and V. Lopez. PageRank Increase under Different Collusion Topologies. [AIRWeb 2005] Krishnan, V. and Raj, R. Web Spam Detection with Anti-Trust-Rank. [AIRWeb2006] L. Becchetti, C. Castillo1 D. Donato1, S. Leonardi, and R. Baeza-Yates. Using Rank Propagation and Probabilistic Counting for Link Based Spam Detection. [WebKDD’06] B. Wu, V. Goel and B. Davison. Topical Trustrank: Using Topicality to Combat Web Spam. [WWW 2006] Z.Gyongyi,P.Berkhin,h.Garcia-Molina,J.Pedersen.Link Spam Detection Based on Mass Estimation.[VLDB 2006]
31
Link Spam Detection Based on Mass Estimation VLDB2006
Our target Detect pages that achieve high PageRank through link spamming Link Spam Detection Based on Mass Estimation VLDB2006
32
PageRank Contribution
Link Spam Detection Based on Mass Estimation VLDB2006
33
PageRank Contribution
Link Spam Detection Based on Mass Estimation VLDB2006
34
PageRank Contribution
Link Spam Detection Based on Mass Estimation VLDB2006
35
PageRank Contribution
Link Spam Detection Based on Mass Estimation VLDB2006
36
PageRank Contribution
Link Spam Detection Based on Mass Estimation VLDB2006
37
Link Spam Detection Based on Mass Estimation VLDB2006
Spam Mass: Definition Absolute mass Amount (part) of PageRank coming from spam Relative mass Fraction of PageRank coming from spam More useful in practice a.m. = p0– = 5 f.m. =p0-/p0=5/7 Link Spam Detection Based on Mass Estimation VLDB2006
38
Link Spam Detection Based on Mass Estimation VLDB2006
Spam Mass: Estimation Approximate the set of good nodes by a subset called good core Link Spam Detection Based on Mass Estimation VLDB2006
39
Link Spam Detection Based on Mass Estimation VLDB2006
Spam Mass: Estimation Approximate the set of good nodes by a subset called good core Link Spam Detection Based on Mass Estimation VLDB2006
40
Link Spam Detection Based on Mass Estimation VLDB2006
Spam Mass: Algorithm Create good core Compute PageRank scores pi and pi+ For all pages i with large PageRank Mark page as spam if mi > threshold Compute estimated relative mass mi as (pi – pi+) / pi Link Spam Detection Based on Mass Estimation VLDB2006
41
Hiding-based Detection
Detecting Techniques Content-based Link-based Cloaking-based other
42
Related Work M. Najork. System and method for identifying cloaked web servers, June U.S. Patent number6,910,077. B. Wu and B. D. Davison. Cloaking and redirection: A preliminary study. [AIRWeb2005] B. Wu and B. D. Davison. Detecting Semantic Cloaking on the Web. [www2006]
43
Related Work M. Najork. System and method for identifying cloaked web servers, June U.S. Patent number6,910,077. B. Wu and B. D. Davison. Cloaking and redirection: A preliminary study. [AIRWeb2005] B. Wu and B. D. Davison. Detecting Semantic Cloaking on the Web. [www2006]
44
Cloaking and redirection: A preliminary study AIRWeb2005
Motivation C (a page from crawler’s perspective) B (a page from browser’s Perspective) Web pages be updated frequently compare The difference between C1 and B1 is bigger than the difference between C1 and C2,this evidence is enough that the page is cloaking Cloaking and redirection: A preliminary study AIRWeb2005
45
Cloaking and redirection: A preliminary study AIRWeb2005
Detecting Cloaking C2 C1 B1 compare compare term link 选定阀值 选定阀值 Cloaking and redirection: A preliminary study AIRWeb2005
46
Other Detection methods
Detecting Techniques Content-based Link-based Cloaking-based other
47
Related Work Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,Fabrizio Silvestri: Know your neighbors: web spam detection using the web topology.[SIGIR 2007] S.Webb,J.Caverlee and C.Pu.Characterizing web spam using content and http session anlysis.[CEAS 07] S.Webb,J.Caverlee and C.Pu.Predicting Web Spam with HTTP Session Information[CIKM 08] …
48
Related Work Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,Fabrizio Silvestri: Know your neighbors: web spam detection using the web topology.[SIGIR 2007] S.Webb,J.Caverlee and C.Pu.Characterizing web spam using content and http session anlysis.[CEAS 07] S.Webb,J.Caverlee and C.Pu.Predicting Web Spam with HTTP Session Information[CIKM 08] …
49
Web Topology Detection
Pages topologically close to each other are more likely to have the same label (spam/nonspam) than random pairs of pages. Pages linked together are more likely to be on the same topic than random pairs of pages [Davison, 2000] Spam tends to be clustered on the Web (black on figure) know your neighbors:Web Spam Detection using the Web Topology SIGIR2007
50
if the majority of a cluster is predicted to be spam then we change the prediction for all hosts in the cluster to spam. The inverse holds true too. Clustering know your neighbors:Web Spam Detection using the Web Topology SIGIR2007
51
Outline Motivation Introduction to Web Spam Web Spam Taxonomy
Web Spam Detection Conclusions
52
Conclusions Two spamming techniques Detection techniques Introduction
Above all,although there are many techniques to detect web spam,spam is still widespread.
53
Thank you Q&A
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.