Web Spam 2008.12.20.

Slides:



Advertisements
Similar presentations
Topical TrustRank: Using Topicality to Combat Web Spam Baoning Wu, Vinay Goel and Brian D. Davison Lehigh University, USA.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Temporal Query Log Profiling to Improve Web Search Ranking Alexander Kotov (UIUC) Pranam Kolari, Yi Chang (Yahoo!) Lei Duan (Microsoft)
Topic-Sensitive PageRank Presented by : Bratislav V. Stojanović University of Belgrade School of Electrical Engineering Page 1/29.
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
TrustRank Algorithm Srđan Luković 2010/3482
Site Level Noise Removal for Search Engines André Luiz da Costa Carvalho Federal University of Amazonas, Brazil Paul-Alexandru Chirita L3S and University.
What is WEB SPAM Many slides from a lecture by Marc Najork, Microsoft: “Detecting Spam Web Pages”
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.
A Quality Focused Crawler for Health Information Tim Tang.
Search Engines and Information Retrieval
CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.
Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
CS345 Data Mining Link Analysis 3: Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
Information Retrieval
Detecting Spam Web Pages Marc Najork Microsoft Research, Silicon Valley.
Overview of Search Engines
WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.
Designing for Search Engines MIS 314 MIS 314 Mr. David Auer.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Todd Friesen April, 2007 SEO Workshop Web 2.0 Expo San Francisco.
Web Spam Detection: link-based and content-based techniques Reporter : 鄭志欣 Advisor : Hsing-Kuo Pao 2010/11/8 1.
Designing for Search Engines MIS 314 MIS 314 Professor Sandvig Professor Sandvig.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Adversarial Information Retrieval The Manipulation of Web Content.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Adversarial Information Retrieval on the Web or How I spammed Google and lost Dr. Frank McCown Search Engine Development – COMP 475 Mar. 24, 2009.
Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.
Web Spam Detection with Anti- Trust Rank Vijay Krishnan Rashmi Raj Computer Science Department Stanford University.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Heuristics for Detecting Spam Web Pages Marc Najork Microsoft Research, Silicon Valley Joint work with Fetterly, Manasse, Ntoulas.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
1 Discovering Authorities in Question Answer Communities by Using Link Analysis Pawel Jurczyk, Eugene Agichtein (CIKM 2007)
Collusion-Resistance Misbehaving User Detection Schemes Speaker: Jing-Kai Lou 2015/10/131.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
Chapter 6: Information Retrieval and Web Search
Search Engine Optimization: A Survey of Current Best Practices Author - Niko Solihin Resource -Grand Valley State University April, 2013 Professor - Soe-Tsyr.
Link Analysis in Web Mining Hubs and Authorities Spam Detection.
Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri.
Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
LOGO Finding High-Quality Content in Social Media Eugene Agichtein, Carlos Castillo, Debora Donato, Aristides Gionis and Gilad Mishne (WSDM 2008) Advisor.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Algorithmic Detection of Semantic Similarity WWW 2005.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Search Engines By: Faruq Hasan.
Search Engine and SEO Presented by Yanni Li. Various Components of Search Engine.
Graph Algorithms: Classification William Cohen. Outline Last week: – PageRank – one algorithm on graphs edges and nodes in memory nodes in memory nothing.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
What is WEB SPAM Many slides are from a lecture by Marc Najork: “Detecting Spam Web Pages”
Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project.
QUERY-PERFORMANCE PREDICTION: SETTING THE EXPECTATIONS STRAIGHT Date : 2014/08/18 Author : Fiana Raiber, Oren Kurland Source : SIGIR’14 Advisor : Jia-ling.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
DATA MINING Introductory and Advanced Topics Part III – Web Mining
WEB SPAM.
Methods and Apparatus for Ranking Web Page Search Results
Source: Procedia Computer Science(2015)70:
SEARCH ENGINE OPTIMIZATION SEO. What is SEO? It is the process of optimizing structure, design and content of your website in order to increase traffic.
A Comparative Study of Link Analysis Algorithms
Detecting Spam Web Pages through Content Analysis
Searching for Truth: Locating Information on the WWW
Junghoo “John” Cho UCLA
Presentation transcript:

Web Spam 2008.12.20

Outline Motivation Introduction to Web Spam Web Spam techniques Web Spam Detection Conclusions

Outline Motivation Introduction to Web Spam Web Spam Techniques Web Spam Detection Conclusions

Motivation Increased exposure on the World Wide Web may yield significant financial gains E-commerce is rapidly growing Projected to $329 billion by 2010;13% of all US retail sales More traffic  more money Large fraction of traffic from Search Engines Increase Search Engine referrals: Place ads  Provide genuinely better content  Create Web spam … 

Outline Motivation Introduction to Web Spam Web Spam Techniques Web Spam Detection Conclusions

Web Spam * Giving an exact definition at Web spam is not an easy task. Essentially we know web spam when we see it. Here are some examples.

Ranking Defining Web Spam Spamming = misleading search engines to obtain higher-than-deserved ranking Ranking Relevance is usually measured through the textual similarity between the query and a page.Pages can be given a query-spcific,numeric relevance score;the higher the number,the more relevant the page is to the query. Importance refers to the global popularity of a page,as often inferred from the link structure,or perhaps other indicators. Relevance Importance

Why Web Spam is Bad Bad for users Bad for search engines Makes it harder to satisfy information need Leads to frustrating search experience Bad for search engines Wastes bandwidth, CPU cycles, storage space Pollutes corpus (infinite number of spam pages!) Distorts ranking of results

Outline Motivation Introduction to Web Spam Web Spam Techniques Web Spam Detection Conclusions References

Web Spam Techniques Two categories of techniques associated with web spam Boosting Term-based Link-based Hiding Boosting techniques:To achieve high relevance and/or importance for some pages Hiding techniques:To hide the adopted boosting techniques from the eyes of human web users

Use incoming link information to assign global importance scores Techniques/Boosting Used to increase ranking Hypertext boosting Term –Relevance (one/many queries) –Target: TF-IDF variants Link –Importance –Target: inlink/outlink count Assign global hub and authority scores to each page Use incoming link information to assign global importance scores

Techniques/Boosting/Term give a higher weight to terms that appear in the title heavy spamming,low priority or ignore them completely <html> <head> <meta name = “keywords” content = “buy,cheap ,cameras,Lens,accessories,nikon,canon”> <title>free,free,free, cheap</title> </head> <body> Our customers agree that we are the best online retailer of cameras! … </body> </html> Simplest,most popular,as old as search engines the url of a page =>a set of terms,to determine the relevance of the page offer a summary of the pointed document,higer weight <html> …A great <a href = “buy-canon-rebel-20d-lens-case.camerasx.com”> free,great deals,cheap,inexpensive,cheap,free</a> store. </html>

Techniques/Boosting/Link Outgoing links to well-known pages provide useful resourses,BUT,have links to spam pages Spammers can control a large number of sites and create arbitrary link structures Buy expired domains,takes advantage of the false relevance/importance converyed by the pool of old links A group of spammers set up a link exchange structure,their sites point to each other Post messages (containing links) to Blogs;forums;Wikis allow webmasters to post links their sites,maybe spam links

Different web pages to users and web crawlers Techniques/Hiding Different web pages to users and web crawlers <script type=“text/javascript”><!-- location.replace(“target.html”) //--> </script> <body background=“red”> <font color=“red”>hidden text</font> … </body> <meta http-equiv=“refresh” content=“0;url=plush.com”> <div style=“visibility:hidden”>You can’t see me!</div> <a href=“target.html”><img src= “tinyimg.gif”></a>

Outline Motivation Introduction to Web Spam Web Spam Taxonomy Web Spam Detection Conclusions

How do we detect spam? Detecting Techniques Content-based Link-based Cloaking-based other

Content-based Detection Detecting Techniques Content-based Link-based Cloaking-based other

Related Work D. Fetterly, M. Manasse and M. Najork. Spam, Damn Spam, and Statistics. [WebDB2004] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. [www 2006] …

Related Work D. Fetterly, M. Manasse and M. Najork. Spam, Damn Spam, and Statistics. [WebDB2004] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. [www 2006] …

Content-based Detection Number of words in the page and title Average word length Amount of anchor text Compression rate Fraction of page drawn from globally popular words Fraction of globally popular words detection spam web through content analysis WWW2006

Number of Words in <title> detection spam web through content analysis WWW2006

Distribution of Word-counts in <title> Spam more likely in pages with more words in title detection spam web through content analysis WWW2006

Compressibility of a Page detection spam web through content analysis WWW2006

detection spam web through content analysis WWW2006 zipRatio of a page detection spam web through content analysis WWW2006

Distribution of zipRatios Spam more likely in pages with high zipRatio detection spam web through content analysis WWW2006

Combine heuristics Use the previously presented metrics as features for a classifier show results for a decision-tree

Decision Tree

Link-based Detection Detecting Techniques Content-based Link-based Cloaking-based other

Related Work Davison B. Recognizing nepotistic links on the Web. 2000 Z. Gyongyi, H. Garcia-Molina and J. Pedersen. Combating Web Spam with TrustRank. [VLDB 2004] B. Wu and B. Davison. Identifying Link Farm Spam Pages. [WWW 2005] R. Baeza-Yates, C. Castillo and V. Lopez. PageRank Increase under Different Collusion Topologies. [AIRWeb 2005] Krishnan, V. and Raj, R. Web Spam Detection with Anti-Trust-Rank. [AIRWeb2006] L. Becchetti, C. Castillo1 D. Donato1, S. Leonardi, and R. Baeza-Yates. Using Rank Propagation and Probabilistic Counting for Link Based Spam Detection. [WebKDD’06] B. Wu, V. Goel and B. Davison. Topical Trustrank: Using Topicality to Combat Web Spam. [WWW 2006] Z.Gyongyi,P.Berkhin,h.Garcia-Molina,J.Pedersen.Link Spam Detection Based on Mass Estimation.[VLDB 2006]

Related Work Davison B. Recognizing nepotistic links on the Web. 2000 Z. Gyongyi, H. Garcia-Molina and J. Pedersen. Combating Web Spam with TrustRank. [VLDB 2004] B. Wu and B. Davison. Identifying Link Farm Spam Pages. [WWW 2005] R. Baeza-Yates, C. Castillo and V. Lopez. PageRank Increase under Different Collusion Topologies. [AIRWeb 2005] Krishnan, V. and Raj, R. Web Spam Detection with Anti-Trust-Rank. [AIRWeb2006] L. Becchetti, C. Castillo1 D. Donato1, S. Leonardi, and R. Baeza-Yates. Using Rank Propagation and Probabilistic Counting for Link Based Spam Detection. [WebKDD’06] B. Wu, V. Goel and B. Davison. Topical Trustrank: Using Topicality to Combat Web Spam. [WWW 2006] Z.Gyongyi,P.Berkhin,h.Garcia-Molina,J.Pedersen.Link Spam Detection Based on Mass Estimation.[VLDB 2006]

Link Spam Detection Based on Mass Estimation VLDB2006 Our target Detect pages that achieve high PageRank through link spamming Link Spam Detection Based on Mass Estimation VLDB2006

PageRank Contribution Link Spam Detection Based on Mass Estimation VLDB2006

PageRank Contribution Link Spam Detection Based on Mass Estimation VLDB2006

PageRank Contribution Link Spam Detection Based on Mass Estimation VLDB2006

PageRank Contribution Link Spam Detection Based on Mass Estimation VLDB2006

PageRank Contribution Link Spam Detection Based on Mass Estimation VLDB2006

Link Spam Detection Based on Mass Estimation VLDB2006 Spam Mass: Definition Absolute mass Amount (part) of PageRank coming from spam Relative mass Fraction of PageRank coming from spam More useful in practice a.m. = p0– = 5 f.m. =p0-/p0=5/7 Link Spam Detection Based on Mass Estimation VLDB2006

Link Spam Detection Based on Mass Estimation VLDB2006 Spam Mass: Estimation Approximate the set of good nodes by a subset called good core Link Spam Detection Based on Mass Estimation VLDB2006

Link Spam Detection Based on Mass Estimation VLDB2006 Spam Mass: Estimation Approximate the set of good nodes by a subset called good core Link Spam Detection Based on Mass Estimation VLDB2006

Link Spam Detection Based on Mass Estimation VLDB2006 Spam Mass: Algorithm Create good core Compute PageRank scores pi and pi+ For all pages i with large PageRank Mark page as spam if mi > threshold Compute estimated relative mass mi as (pi – pi+) / pi Link Spam Detection Based on Mass Estimation VLDB2006

Hiding-based Detection Detecting Techniques Content-based Link-based Cloaking-based other

Related Work M. Najork. System and method for identifying cloaked web servers, June 21 2005. U.S. Patent number6,910,077. B. Wu and B. D. Davison. Cloaking and redirection: A preliminary study. [AIRWeb2005] B. Wu and B. D. Davison. Detecting Semantic Cloaking on the Web. [www2006]

Related Work M. Najork. System and method for identifying cloaked web servers, June 21 2005. U.S. Patent number6,910,077. B. Wu and B. D. Davison. Cloaking and redirection: A preliminary study. [AIRWeb2005] B. Wu and B. D. Davison. Detecting Semantic Cloaking on the Web. [www2006]

Cloaking and redirection: A preliminary study AIRWeb2005 Motivation C (a page from crawler’s perspective) B (a page from browser’s Perspective) Web pages be updated frequently compare The difference between C1 and B1 is bigger than the difference between C1 and C2,this evidence is enough that the page is cloaking Cloaking and redirection: A preliminary study AIRWeb2005

Cloaking and redirection: A preliminary study AIRWeb2005 Detecting Cloaking C2 C1 B1 compare compare term link 选定阀值 选定阀值 Cloaking and redirection: A preliminary study AIRWeb2005

Other Detection methods Detecting Techniques Content-based Link-based Cloaking-based other

Related Work Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,Fabrizio Silvestri: Know your neighbors: web spam detection using the web topology.[SIGIR 2007] S.Webb,J.Caverlee and C.Pu.Characterizing web spam using content and http session anlysis.[CEAS 07] S.Webb,J.Caverlee and C.Pu.Predicting Web Spam with HTTP Session Information[CIKM 08] …

Related Work Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,Fabrizio Silvestri: Know your neighbors: web spam detection using the web topology.[SIGIR 2007] S.Webb,J.Caverlee and C.Pu.Characterizing web spam using content and http session anlysis.[CEAS 07] S.Webb,J.Caverlee and C.Pu.Predicting Web Spam with HTTP Session Information[CIKM 08] …

Web Topology Detection Pages topologically close to each other are more likely to have the same label (spam/nonspam) than random pairs of pages. Pages linked together are more likely to be on the same topic than random pairs of pages [Davison, 2000] Spam tends to be clustered on the Web (black on figure) know your neighbors:Web Spam Detection using the Web Topology SIGIR2007

if the majority of a cluster is predicted to be spam then we change the prediction for all hosts in the cluster to spam. The inverse holds true too. Clustering know your neighbors:Web Spam Detection using the Web Topology SIGIR2007

Outline Motivation Introduction to Web Spam Web Spam Taxonomy Web Spam Detection Conclusions

Conclusions Two spamming techniques Detection techniques Introduction Above all,although there are many techniques to detect web spam,spam is still widespread.

Thank you Q&A