Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Spam 2008.12.20.

Similar presentations


Presentation on theme: "Web Spam 2008.12.20."— Presentation transcript:

1 Web Spam

2 Outline Motivation Introduction to Web Spam Web Spam techniques
Web Spam Detection Conclusions

3 Outline Motivation Introduction to Web Spam Web Spam Techniques
Web Spam Detection Conclusions

4 Motivation Increased exposure on the World Wide Web may yield significant financial gains E-commerce is rapidly growing Projected to $329 billion by 2010;13% of all US retail sales More traffic  more money Large fraction of traffic from Search Engines Increase Search Engine referrals: Place ads  Provide genuinely better content  Create Web spam … 

5 Outline Motivation Introduction to Web Spam Web Spam Techniques
Web Spam Detection Conclusions

6 Web Spam * Giving an exact definition at Web spam is not an easy task. Essentially we know web spam when we see it. Here are some examples.

7 Ranking Defining Web Spam Spamming = misleading search engines
to obtain higher-than-deserved ranking Ranking Relevance is usually measured through the textual similarity between the query and a page.Pages can be given a query-spcific,numeric relevance score;the higher the number,the more relevant the page is to the query. Importance refers to the global popularity of a page,as often inferred from the link structure,or perhaps other indicators. Relevance Importance

8 Why Web Spam is Bad Bad for users Bad for search engines
Makes it harder to satisfy information need Leads to frustrating search experience Bad for search engines Wastes bandwidth, CPU cycles, storage space Pollutes corpus (infinite number of spam pages!) Distorts ranking of results

9 Outline Motivation Introduction to Web Spam Web Spam Techniques
Web Spam Detection Conclusions References

10 Web Spam Techniques Two categories of techniques associated with web spam Boosting Term-based Link-based Hiding Boosting techniques:To achieve high relevance and/or importance for some pages Hiding techniques:To hide the adopted boosting techniques from the eyes of human web users

11 Use incoming link information to assign global importance scores
Techniques/Boosting Used to increase ranking Hypertext boosting Term –Relevance (one/many queries) –Target: TF-IDF variants Link –Importance –Target: inlink/outlink count Assign global hub and authority scores to each page Use incoming link information to assign global importance scores

12 Techniques/Boosting/Term
give a higher weight to terms that appear in the title heavy spamming,low priority or ignore them completely <html> <head> <meta name = “keywords” content = “buy,cheap ,cameras,Lens,accessories,nikon,canon”> <title>free,free,free, cheap</title> </head> <body> Our customers agree that we are the best online retailer of cameras! </body> </html> Simplest,most popular,as old as search engines the url of a page =>a set of terms,to determine the relevance of the page offer a summary of the pointed document,higer weight <html> …A great <a href = “buy-canon-rebel-20d-lens-case.camerasx.com”> free,great deals,cheap,inexpensive,cheap,free</a> store. </html>

13 Techniques/Boosting/Link
Outgoing links to well-known pages provide useful resourses,BUT,have links to spam pages Spammers can control a large number of sites and create arbitrary link structures Buy expired domains,takes advantage of the false relevance/importance converyed by the pool of old links A group of spammers set up a link exchange structure,their sites point to each other Post messages (containing links) to Blogs;forums;Wikis allow webmasters to post links their sites,maybe spam links

14 Different web pages to users and web crawlers
Techniques/Hiding Different web pages to users and web crawlers <script type=“text/javascript”><!-- location.replace(“target.html”) //--> </script> <body background=“red”> <font color=“red”>hidden text</font> </body> <meta http-equiv=“refresh” content=“0;url=plush.com”> <div style=“visibility:hidden”>You can’t see me!</div> <a href=“target.html”><img src= “tinyimg.gif”></a>

15 Outline Motivation Introduction to Web Spam Web Spam Taxonomy
Web Spam Detection Conclusions

16 How do we detect spam? Detecting Techniques Content-based Link-based
Cloaking-based other

17 Content-based Detection
Detecting Techniques Content-based Link-based Cloaking-based other

18 Related Work D. Fetterly, M. Manasse and M. Najork. Spam, Damn Spam, and Statistics. [WebDB2004] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. [www 2006]

19 Related Work D. Fetterly, M. Manasse and M. Najork. Spam, Damn Spam, and Statistics. [WebDB2004] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. [www 2006]

20 Content-based Detection
Number of words in the page and title Average word length Amount of anchor text Compression rate Fraction of page drawn from globally popular words Fraction of globally popular words detection spam web through content analysis WWW2006

21 Number of Words in <title>
detection spam web through content analysis WWW2006

22 Distribution of Word-counts in <title>
Spam more likely in pages with more words in title detection spam web through content analysis WWW2006

23 Compressibility of a Page
detection spam web through content analysis WWW2006

24 detection spam web through content analysis WWW2006
zipRatio of a page detection spam web through content analysis WWW2006

25 Distribution of zipRatios
Spam more likely in pages with high zipRatio detection spam web through content analysis WWW2006

26 Combine heuristics Use the previously presented metrics as features for a classifier show results for a decision-tree

27 Decision Tree

28 Link-based Detection Detecting Techniques Content-based Link-based
Cloaking-based other

29 Related Work Davison B. Recognizing nepotistic links on the Web. 2000
Z. Gyongyi, H. Garcia-Molina and J. Pedersen. Combating Web Spam with TrustRank. [VLDB 2004] B. Wu and B. Davison. Identifying Link Farm Spam Pages. [WWW 2005] R. Baeza-Yates, C. Castillo and V. Lopez. PageRank Increase under Different Collusion Topologies. [AIRWeb 2005] Krishnan, V. and Raj, R. Web Spam Detection with Anti-Trust-Rank. [AIRWeb2006] L. Becchetti, C. Castillo1 D. Donato1, S. Leonardi, and R. Baeza-Yates. Using Rank Propagation and Probabilistic Counting for Link Based Spam Detection. [WebKDD’06] B. Wu, V. Goel and B. Davison. Topical Trustrank: Using Topicality to Combat Web Spam. [WWW 2006] Z.Gyongyi,P.Berkhin,h.Garcia-Molina,J.Pedersen.Link Spam Detection Based on Mass Estimation.[VLDB 2006]

30 Related Work Davison B. Recognizing nepotistic links on the Web. 2000
Z. Gyongyi, H. Garcia-Molina and J. Pedersen. Combating Web Spam with TrustRank. [VLDB 2004] B. Wu and B. Davison. Identifying Link Farm Spam Pages. [WWW 2005] R. Baeza-Yates, C. Castillo and V. Lopez. PageRank Increase under Different Collusion Topologies. [AIRWeb 2005] Krishnan, V. and Raj, R. Web Spam Detection with Anti-Trust-Rank. [AIRWeb2006] L. Becchetti, C. Castillo1 D. Donato1, S. Leonardi, and R. Baeza-Yates. Using Rank Propagation and Probabilistic Counting for Link Based Spam Detection. [WebKDD’06] B. Wu, V. Goel and B. Davison. Topical Trustrank: Using Topicality to Combat Web Spam. [WWW 2006] Z.Gyongyi,P.Berkhin,h.Garcia-Molina,J.Pedersen.Link Spam Detection Based on Mass Estimation.[VLDB 2006]

31 Link Spam Detection Based on Mass Estimation VLDB2006
Our target Detect pages that achieve high PageRank through link spamming Link Spam Detection Based on Mass Estimation VLDB2006

32 PageRank Contribution
Link Spam Detection Based on Mass Estimation VLDB2006

33 PageRank Contribution
Link Spam Detection Based on Mass Estimation VLDB2006

34 PageRank Contribution
Link Spam Detection Based on Mass Estimation VLDB2006

35 PageRank Contribution
Link Spam Detection Based on Mass Estimation VLDB2006

36 PageRank Contribution
Link Spam Detection Based on Mass Estimation VLDB2006

37 Link Spam Detection Based on Mass Estimation VLDB2006
Spam Mass: Definition Absolute mass Amount (part) of PageRank coming from spam Relative mass Fraction of PageRank coming from spam More useful in practice a.m. = p0– = 5 f.m. =p0-/p0=5/7 Link Spam Detection Based on Mass Estimation VLDB2006

38 Link Spam Detection Based on Mass Estimation VLDB2006
Spam Mass: Estimation Approximate the set of good nodes by a subset called good core Link Spam Detection Based on Mass Estimation VLDB2006

39 Link Spam Detection Based on Mass Estimation VLDB2006
Spam Mass: Estimation Approximate the set of good nodes by a subset called good core Link Spam Detection Based on Mass Estimation VLDB2006

40 Link Spam Detection Based on Mass Estimation VLDB2006
Spam Mass: Algorithm Create good core Compute PageRank scores pi and pi+ For all pages i with large PageRank Mark page as spam if mi > threshold Compute estimated relative mass mi as (pi – pi+) / pi Link Spam Detection Based on Mass Estimation VLDB2006

41 Hiding-based Detection
Detecting Techniques Content-based Link-based Cloaking-based other

42 Related Work M. Najork. System and method for identifying cloaked web servers, June U.S. Patent number6,910,077. B. Wu and B. D. Davison. Cloaking and redirection: A preliminary study. [AIRWeb2005] B. Wu and B. D. Davison. Detecting Semantic Cloaking on the Web. [www2006]

43 Related Work M. Najork. System and method for identifying cloaked web servers, June U.S. Patent number6,910,077. B. Wu and B. D. Davison. Cloaking and redirection: A preliminary study. [AIRWeb2005] B. Wu and B. D. Davison. Detecting Semantic Cloaking on the Web. [www2006]

44 Cloaking and redirection: A preliminary study AIRWeb2005
Motivation C (a page from crawler’s perspective) B (a page from browser’s Perspective) Web pages be updated frequently compare The difference between C1 and B1 is bigger than the difference between C1 and C2,this evidence is enough that the page is cloaking Cloaking and redirection: A preliminary study AIRWeb2005

45 Cloaking and redirection: A preliminary study AIRWeb2005
Detecting Cloaking C2 C1 B1 compare compare term link 选定阀值 选定阀值 Cloaking and redirection: A preliminary study AIRWeb2005

46 Other Detection methods
Detecting Techniques Content-based Link-based Cloaking-based other

47 Related Work Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,Fabrizio Silvestri: Know your neighbors: web spam detection using the web topology.[SIGIR 2007] S.Webb,J.Caverlee and C.Pu.Characterizing web spam using content and http session anlysis.[CEAS 07] S.Webb,J.Caverlee and C.Pu.Predicting Web Spam with HTTP Session Information[CIKM 08]

48 Related Work Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,Fabrizio Silvestri: Know your neighbors: web spam detection using the web topology.[SIGIR 2007] S.Webb,J.Caverlee and C.Pu.Characterizing web spam using content and http session anlysis.[CEAS 07] S.Webb,J.Caverlee and C.Pu.Predicting Web Spam with HTTP Session Information[CIKM 08]

49 Web Topology Detection
Pages topologically close to each other are more likely to have the same label (spam/nonspam) than random pairs of pages. Pages linked together are more likely to be on the same topic than random pairs of pages [Davison, 2000] Spam tends to be clustered on the Web (black on figure) know your neighbors:Web Spam Detection using the Web Topology SIGIR2007

50 if the majority of a cluster is predicted to be spam then we change the prediction for all hosts in the cluster to spam. The inverse holds true too. Clustering know your neighbors:Web Spam Detection using the Web Topology SIGIR2007

51 Outline Motivation Introduction to Web Spam Web Spam Taxonomy
Web Spam Detection Conclusions

52 Conclusions Two spamming techniques Detection techniques Introduction
Above all,although there are many techniques to detect web spam,spam is still widespread.

53 Thank you Q&A


Download ppt "Web Spam 2008.12.20."

Similar presentations


Ads by Google