Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
A Machine Learning Approach for Improved BM25 Retrieval
TrustRank Algorithm Srđan Luković 2010/3482
Analysis and Modeling of Social Networks Foudalis Ilias.
@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.
22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.
Evaluating Search Engine
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.
Building an Intelligent Web: Theory and Practice Pawan Lingras Saint Mary’s University Rajendra Akerkar American University of Armenia and SIBER, India.
Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.
Web as Graph – Empirical Studies The Structure and Dynamics of Networks.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Recommender systems Ram Akella November 26 th 2008.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Todd Friesen April, 2007 SEO Workshop Web 2.0 Expo San Francisco.
Web Spam Detection: link-based and content-based techniques Reporter : 鄭志欣 Advisor : Hsing-Kuo Pao 2010/11/8 1.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Learning to Rank for Information Retrieval
Adversarial Information Retrieval The Manipulation of Web Content.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Adversarial Information Retrieval on the Web or How I spammed Google and lost Dr. Frank McCown Search Engine Development – COMP 475 Mar. 24, 2009.
Web Spam Detection with Anti- Trust Rank Vijay Krishnan Rashmi Raj Computer Science Department Stanford University.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Data Structures & Algorithms and The Internet: A different way of thinking.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
1 Discovering Authorities in Question Answer Communities by Using Link Analysis Pawel Jurczyk, Eugene Agichtein (CIKM 2007)
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
Chapter 6: Information Retrieval and Web Search
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri.
Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Lecture 2 Jan 15, 2008 Social Search. What is Social Search? Social Information Access –a stream of research that explores methods for organizing users’
LOGO Finding High-Quality Content in Social Media Eugene Agichtein, Carlos Castillo, Debora Donato, Aristides Gionis and Gilad Mishne (WSDM 2008) Advisor.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Algorithmic Detection of Semantic Similarity WWW 2005.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Graph Algorithms: Classification William Cohen. Outline Last week: – PageRank – one algorithm on graphs edges and nodes in memory nodes in memory nothing.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Document Clustering and Collection Selection Diego Puppin Web Mining,
Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project.
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
Information Retrieval in Practice
WEB SPAM.
Source: Procedia Computer Science(2015)70:
CS 440 Database Management Systems
Detecting Spam Web Pages through Content Analysis
Web Spam
Junghoo “John” Cho UCLA
Presentation transcript:

Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa Murdock(1), Fabrizio Silvestri(2). 1. Yahoo! Research Barcelona – Catalunya, Spain 2. ISTI-CNR –Pisa,Italy ACM SIGIR, 25 July 2007, Amsterdam

Soumo’s Biography 4 th Year CS Major Graduating May 2008 Interesting About Me: Lived in India, Australia, and the U.S. CS Interests: Databases, HCI, Web Programming, Networking, Graphics, Gaming,.

Here’s all that you can find on the web….

Here’s just some of what really is out there…

And more….

Why so many different things…? There is a fierce competition for your attention! Ease of publication for personal publication as well as commercial publication, advertisements, and economic activity. …and there’s lots lots lots lots…lots of spam!

What’s Spam?!

Hidden Text

Only hidden text? Here’s a whole fake search engine!!!

Why is Spam bad? Costs: Costs for users: lower precision for some queries Costs for search engines: wasted storage space, network resources, and processing cycles Costs for the publishers: resources invested in cheating and not in improving their contents Every undeserved gain in ranking for a spammer is a loss of search precision for the search engine.

How Do We Detect Spam? Machine Learning/Training Link-based Detection Content-based Detection Using Links and Contents Using Web-based Topology

Machine Learning/Training

ML Challenges Machine Learning Challenges: Instances are not really independent (graph) Training set is relatively small Information Retrieval Challenges: It is hard to find out which features are relevant It is hard for search engines to provide labeled data Even if they do, it will not reflect a consensus on what is Web Spam

Link-based Detection Single-level farms can be detected by searching groups of nodes sharing their out-links [Gibson et al., 2005]

Why use it? Degree-related measures PageRank TrustRank [Gy¨ongyi et al., 2004] Truncated PageRank [Becchetti et al., 2006]: similar to PageRank, it limitsa page to the PageRank score of its close neighbors. Thus, the Truncated PageRank score is a useful feature for spam detection because spam pages generally try to reinforce their PageRank scores by linking to each other.”

Degree-based Measures are related to in- degree and out-degree Edge-reciprocity (the number of links that are reciprocal) Assortativity (the ratio between the degree of a particular page and the average degree of its neighbors

TrustRank / PageRank TrustRank: an algorithm that picks trusted nodes derived from page-ranks but tests the degree of relationship one page has with other known trusted pages. This is given a TrustRank score. Ratio between TrustRank and Page Rank Number of home pages. Cons: this alone is not sufficient as there are many false positives.

Content-based Detection Most of the features reported in [Ntoulas et al., 2006] Number of words in the page and title Average word length Fraction of anchor text Fraction of visible text Compression rate Corpus precision and corpus recall Query precision and query recall Independent trigram likelihood Entropy of trigrams

Corpus and Query F: set of most frequent terms in the collection Q: set of most frequent terms in a query log P: set of terms in a page Computation Techniques: corpus precision: the fraction of words(except stopwords) in a page that appear in the set of popular terms of a data collection. corpus recall: the fraction of popular terms of the data collection that appear in the page. query precision: the fraction of words in a page that appear in the set of q most popular terms appearing in a query log. query recall: the fraction of q most popular terms of the query log that appear in the page.

Visual Clues Figure: Histogram of the average word length in non-spam vs. spam pages for k = 500. Figure: Histogram of the corpus precision in non-spam vs. spam pages. Figure: Histogram of the query precision in non-spam vs. spam pages for k = 500.

Links AND Contents Detection Why Both?:

Web Topology Detection Pages topologically close to each other are more likely to have the same label (spam/nonspam) than random pairs of pages. Pages linked together are more likely to be on the same topic than random pairs of pages [Davison, 2000] Spam tends to be clustered on the Web (black on figure)

Topological dependencies: in- links Let S OUT (x) be the fraction of spam hosts linked by host x out of all labeled hosts linked by host x. This figure shows the histogram of S OUT for spam and non-spam hosts. We see that almost all non-spam hosts link mostly to non-spam hosts. Let S IN (x) be the fraction of spam hosts that link to host x out of all labeled hosts that link to x. This figure shows the histograms of S IN for spam and non-spam hosts.In this case there is a clear separation between spam and non-spam hosts.

Clustering: if the majority of a cluster is predicted to be spam then we change the prediction for all hosts in the cluster to spam. The inverse holds true too.

Article Critique Pros: Has detailed descriptions of various detection mechanisms. Integrates link and content attributes for building a system to detect Web spam Cons: Statistics and success rate for other content-based detection techniques. Some graphs had axis labels missing. Extension: combine the regularization (any method of preventing overfitting of data by a model) methods at hand in order to improve the overall accuracy

Summary Machine Learning/Training Link-based Detection Content-based Detection Using Links and Contents Using Web-based Topology Costs: Costs for users: lower precision for some queries Costs for search engines: wasted storage space, network resources, and processing cycles Costs for the publishers: resources invested in cheating and not in improving their contents Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine. How Do We Detect Spam?Why is Spam bad?