Web Spam 2008.12.20
Outline Motivation Introduction to Web Spam Web Spam techniques Web Spam Detection Conclusions
Outline Motivation Introduction to Web Spam Web Spam Techniques Web Spam Detection Conclusions
Motivation Increased exposure on the World Wide Web may yield significant financial gains E-commerce is rapidly growing Projected to $329 billion by 2010;13% of all US retail sales More traffic more money Large fraction of traffic from Search Engines Increase Search Engine referrals: Place ads Provide genuinely better content Create Web spam …
Outline Motivation Introduction to Web Spam Web Spam Techniques Web Spam Detection Conclusions
Web Spam * Giving an exact definition at Web spam is not an easy task. Essentially we know web spam when we see it. Here are some examples.
Ranking Defining Web Spam Spamming = misleading search engines to obtain higher-than-deserved ranking Ranking Relevance is usually measured through the textual similarity between the query and a page.Pages can be given a query-spcific,numeric relevance score;the higher the number,the more relevant the page is to the query. Importance refers to the global popularity of a page,as often inferred from the link structure,or perhaps other indicators. Relevance Importance
Why Web Spam is Bad Bad for users Bad for search engines Makes it harder to satisfy information need Leads to frustrating search experience Bad for search engines Wastes bandwidth, CPU cycles, storage space Pollutes corpus (infinite number of spam pages!) Distorts ranking of results
Outline Motivation Introduction to Web Spam Web Spam Techniques Web Spam Detection Conclusions References
Web Spam Techniques Two categories of techniques associated with web spam Boosting Term-based Link-based Hiding Boosting techniques:To achieve high relevance and/or importance for some pages Hiding techniques:To hide the adopted boosting techniques from the eyes of human web users
Use incoming link information to assign global importance scores Techniques/Boosting Used to increase ranking Hypertext boosting Term –Relevance (one/many queries) –Target: TF-IDF variants Link –Importance –Target: inlink/outlink count Assign global hub and authority scores to each page Use incoming link information to assign global importance scores
Techniques/Boosting/Term give a higher weight to terms that appear in the title heavy spamming,low priority or ignore them completely <html> <head> <meta name = “keywords” content = “buy,cheap ,cameras,Lens,accessories,nikon,canon”> <title>free,free,free, cheap</title> </head> <body> Our customers agree that we are the best online retailer of cameras! … </body> </html> Simplest,most popular,as old as search engines the url of a page =>a set of terms,to determine the relevance of the page offer a summary of the pointed document,higer weight <html> …A great <a href = “buy-canon-rebel-20d-lens-case.camerasx.com”> free,great deals,cheap,inexpensive,cheap,free</a> store. </html>
Techniques/Boosting/Link Outgoing links to well-known pages provide useful resourses,BUT,have links to spam pages Spammers can control a large number of sites and create arbitrary link structures Buy expired domains,takes advantage of the false relevance/importance converyed by the pool of old links A group of spammers set up a link exchange structure,their sites point to each other Post messages (containing links) to Blogs;forums;Wikis allow webmasters to post links their sites,maybe spam links
Different web pages to users and web crawlers Techniques/Hiding Different web pages to users and web crawlers <script type=“text/javascript”><!-- location.replace(“target.html”) //--> </script> <body background=“red”> <font color=“red”>hidden text</font> … </body> <meta http-equiv=“refresh” content=“0;url=plush.com”> <div style=“visibility:hidden”>You can’t see me!</div> <a href=“target.html”><img src= “tinyimg.gif”></a>
Outline Motivation Introduction to Web Spam Web Spam Taxonomy Web Spam Detection Conclusions
How do we detect spam? Detecting Techniques Content-based Link-based Cloaking-based other
Content-based Detection Detecting Techniques Content-based Link-based Cloaking-based other
Related Work D. Fetterly, M. Manasse and M. Najork. Spam, Damn Spam, and Statistics. [WebDB2004] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. [www 2006] …
Related Work D. Fetterly, M. Manasse and M. Najork. Spam, Damn Spam, and Statistics. [WebDB2004] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. [www 2006] …
Content-based Detection Number of words in the page and title Average word length Amount of anchor text Compression rate Fraction of page drawn from globally popular words Fraction of globally popular words detection spam web through content analysis WWW2006
Number of Words in <title> detection spam web through content analysis WWW2006
Distribution of Word-counts in <title> Spam more likely in pages with more words in title detection spam web through content analysis WWW2006
Compressibility of a Page detection spam web through content analysis WWW2006
detection spam web through content analysis WWW2006 zipRatio of a page detection spam web through content analysis WWW2006
Distribution of zipRatios Spam more likely in pages with high zipRatio detection spam web through content analysis WWW2006
Combine heuristics Use the previously presented metrics as features for a classifier show results for a decision-tree
Decision Tree
Link-based Detection Detecting Techniques Content-based Link-based Cloaking-based other
Related Work Davison B. Recognizing nepotistic links on the Web. 2000 Z. Gyongyi, H. Garcia-Molina and J. Pedersen. Combating Web Spam with TrustRank. [VLDB 2004] B. Wu and B. Davison. Identifying Link Farm Spam Pages. [WWW 2005] R. Baeza-Yates, C. Castillo and V. Lopez. PageRank Increase under Different Collusion Topologies. [AIRWeb 2005] Krishnan, V. and Raj, R. Web Spam Detection with Anti-Trust-Rank. [AIRWeb2006] L. Becchetti, C. Castillo1 D. Donato1, S. Leonardi, and R. Baeza-Yates. Using Rank Propagation and Probabilistic Counting for Link Based Spam Detection. [WebKDD’06] B. Wu, V. Goel and B. Davison. Topical Trustrank: Using Topicality to Combat Web Spam. [WWW 2006] Z.Gyongyi,P.Berkhin,h.Garcia-Molina,J.Pedersen.Link Spam Detection Based on Mass Estimation.[VLDB 2006]
Related Work Davison B. Recognizing nepotistic links on the Web. 2000 Z. Gyongyi, H. Garcia-Molina and J. Pedersen. Combating Web Spam with TrustRank. [VLDB 2004] B. Wu and B. Davison. Identifying Link Farm Spam Pages. [WWW 2005] R. Baeza-Yates, C. Castillo and V. Lopez. PageRank Increase under Different Collusion Topologies. [AIRWeb 2005] Krishnan, V. and Raj, R. Web Spam Detection with Anti-Trust-Rank. [AIRWeb2006] L. Becchetti, C. Castillo1 D. Donato1, S. Leonardi, and R. Baeza-Yates. Using Rank Propagation and Probabilistic Counting for Link Based Spam Detection. [WebKDD’06] B. Wu, V. Goel and B. Davison. Topical Trustrank: Using Topicality to Combat Web Spam. [WWW 2006] Z.Gyongyi,P.Berkhin,h.Garcia-Molina,J.Pedersen.Link Spam Detection Based on Mass Estimation.[VLDB 2006]
Link Spam Detection Based on Mass Estimation VLDB2006 Our target Detect pages that achieve high PageRank through link spamming Link Spam Detection Based on Mass Estimation VLDB2006
PageRank Contribution Link Spam Detection Based on Mass Estimation VLDB2006
PageRank Contribution Link Spam Detection Based on Mass Estimation VLDB2006
PageRank Contribution Link Spam Detection Based on Mass Estimation VLDB2006
PageRank Contribution Link Spam Detection Based on Mass Estimation VLDB2006
PageRank Contribution Link Spam Detection Based on Mass Estimation VLDB2006
Link Spam Detection Based on Mass Estimation VLDB2006 Spam Mass: Definition Absolute mass Amount (part) of PageRank coming from spam Relative mass Fraction of PageRank coming from spam More useful in practice a.m. = p0– = 5 f.m. =p0-/p0=5/7 Link Spam Detection Based on Mass Estimation VLDB2006
Link Spam Detection Based on Mass Estimation VLDB2006 Spam Mass: Estimation Approximate the set of good nodes by a subset called good core Link Spam Detection Based on Mass Estimation VLDB2006
Link Spam Detection Based on Mass Estimation VLDB2006 Spam Mass: Estimation Approximate the set of good nodes by a subset called good core Link Spam Detection Based on Mass Estimation VLDB2006
Link Spam Detection Based on Mass Estimation VLDB2006 Spam Mass: Algorithm Create good core Compute PageRank scores pi and pi+ For all pages i with large PageRank Mark page as spam if mi > threshold Compute estimated relative mass mi as (pi – pi+) / pi Link Spam Detection Based on Mass Estimation VLDB2006
Hiding-based Detection Detecting Techniques Content-based Link-based Cloaking-based other
Related Work M. Najork. System and method for identifying cloaked web servers, June 21 2005. U.S. Patent number6,910,077. B. Wu and B. D. Davison. Cloaking and redirection: A preliminary study. [AIRWeb2005] B. Wu and B. D. Davison. Detecting Semantic Cloaking on the Web. [www2006]
Related Work M. Najork. System and method for identifying cloaked web servers, June 21 2005. U.S. Patent number6,910,077. B. Wu and B. D. Davison. Cloaking and redirection: A preliminary study. [AIRWeb2005] B. Wu and B. D. Davison. Detecting Semantic Cloaking on the Web. [www2006]
Cloaking and redirection: A preliminary study AIRWeb2005 Motivation C (a page from crawler’s perspective) B (a page from browser’s Perspective) Web pages be updated frequently compare The difference between C1 and B1 is bigger than the difference between C1 and C2,this evidence is enough that the page is cloaking Cloaking and redirection: A preliminary study AIRWeb2005
Cloaking and redirection: A preliminary study AIRWeb2005 Detecting Cloaking C2 C1 B1 compare compare term link 选定阀值 选定阀值 Cloaking and redirection: A preliminary study AIRWeb2005
Other Detection methods Detecting Techniques Content-based Link-based Cloaking-based other
Related Work Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,Fabrizio Silvestri: Know your neighbors: web spam detection using the web topology.[SIGIR 2007] S.Webb,J.Caverlee and C.Pu.Characterizing web spam using content and http session anlysis.[CEAS 07] S.Webb,J.Caverlee and C.Pu.Predicting Web Spam with HTTP Session Information[CIKM 08] …
Related Work Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,Fabrizio Silvestri: Know your neighbors: web spam detection using the web topology.[SIGIR 2007] S.Webb,J.Caverlee and C.Pu.Characterizing web spam using content and http session anlysis.[CEAS 07] S.Webb,J.Caverlee and C.Pu.Predicting Web Spam with HTTP Session Information[CIKM 08] …
Web Topology Detection Pages topologically close to each other are more likely to have the same label (spam/nonspam) than random pairs of pages. Pages linked together are more likely to be on the same topic than random pairs of pages [Davison, 2000] Spam tends to be clustered on the Web (black on figure) know your neighbors:Web Spam Detection using the Web Topology SIGIR2007
if the majority of a cluster is predicted to be spam then we change the prediction for all hosts in the cluster to spam. The inverse holds true too. Clustering know your neighbors:Web Spam Detection using the Web Topology SIGIR2007
Outline Motivation Introduction to Web Spam Web Spam Taxonomy Web Spam Detection Conclusions
Conclusions Two spamming techniques Detection techniques Introduction Above all,although there are many techniques to detect web spam,spam is still widespread.
Thank you Q&A