Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri.

Slides:

Advertisements

Similar presentations

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.

Advertisements

TrustRank Algorithm Srđan Luković 2010/3482

Site Level Noise Removal for Search Engines André Luiz da Costa Carvalho Federal University of Amazonas, Brazil Paul-Alexandru Chirita L3S and University.

Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.

1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006.

Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.

CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.

Xyleme A Dynamic Warehouse for XML Data of the Web.

Building an Intelligent Web: Theory and Practice Pawan Lingras Saint Mary’s University Rajendra Akerkar American University of Armenia and SIBER, India.

6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.

Web queries classification Nguyen Viet Bang WING group meeting June 9 th 2006.

Information Retrieval

Chapter 5: Information Retrieval and Web Search

CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:

SIEVE—Search Images Effectively through Visual Elimination Ying Liu, Dengsheng Zhang and Guojun Lu Gippsland School of Info Tech,

Large-Scale Content-Based Image Retrieval Project Presentation CMPT 880: Large Scale Multimedia Systems and Cloud Computing Under supervision of Dr. Mohamed.

Web Spam Detection: link-based and content-based techniques Reporter : 鄭志欣 Advisor : Hsing-Kuo Pao 2010/11/8 1.

Countering Spam Using Classification Techniques Steve Webb Data Mining Guest Lecture February 21, 2008.

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.

Adversarial Information Retrieval The Manipulation of Web Content.

Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh.

Adversarial Information Retrieval on the Web or How I spammed Google and lost Dr. Frank McCown Search Engine Development – COMP 475 Mar. 24, 2009.

Web Spam Detection with Anti- Trust Rank Vijay Krishnan Rashmi Raj Computer Science Department Stanford University.

Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.

Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.

Web Page Language Identification Based on URLs Reporter: 鄭志欣 Advisor: Hsing-Kuo Pao 1.

Using Hyperlink structure information for web search.

Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Chien-Shing Chen Author ： Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.

Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.

Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.

Chapter 6: Information Retrieval and Web Search

The Simigle Image Search Engine Wei Dong

Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,

LOGO Finding High-Quality Content in Social Media Eugene Agichtein, Carlos Castillo, Debora Donato, Aristides Gionis and Gilad Mishne (WSDM 2008) Advisor.

Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova ， Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.

Algorithmic Detection of Semantic Similarity WWW 2005.

Graph Algorithms: Classification William Cohen. Outline Last week: – PageRank – one algorithm on graphs edges and nodes in memory nodes in memory nothing.

Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.

Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project.

Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)

Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.

CS 440 Database Management Systems Web Data Management 1.

CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.

TrustRank. 2 Observation – Good pages tend to link good pages. – Human is the best spam detector Algorithm – Select a small subset of pages and let a.

Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.

Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Data Mining 101 with Scikit-Learn

Source: Procedia Computer Science（2015）70:

Research Areas Christoph F. Eick

CS 440 Database Management Systems

PageRank algorithm based on Eigenvectors

Text Categorization Document classification categorizes documents into one or more classes which is useful in Information Retrieval (IR). IR is the task.

Detecting Spam Web Pages through Content Analysis

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^

Presentation transcript:

Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri Presented by Sovandy Hang CS 4440, Fall 2007

Outline About me About me Introduction Introduction Keywords Keywords How the process works? How the process works? Conclusion Conclusion Questions and answers Questions and answers

About Me 5 th year CS and IE major 5 th year CS and IE major Graduate next summer Graduate next summer Interest: Enterprise Resource Planning Interest: Enterprise Resource Planning Think all softwares should be open source Think all softwares should be open source

Introduction Web search is a part of our lives. Web search is a part of our lives. Many businesses rely on web. Many businesses rely on web. There is huge economic incentive for commercial website to influence search results. There is huge economic incentive for commercial website to influence search results. Web spamming is cheap and often successful. Web spamming is cheap and often successful. Web spam degrades the quality of search engine. Web spam degrades the quality of search engine. Web spam is annoying. Web spam is annoying.

Keywords Web spam Web spam Pagerank Pagerank Spamdexing Spamdexing Spamicity Spamicity Graph-based algorithm Graph-based algorithm

Measurement Tool

How it work? Feature ExtractionClassificationSmoothing Propagation Stack Graphical Learning Clustering

Feature Extraction Data set is obtained by using web crawler. Data set is obtained by using web crawler. For each page, links and its contents are obtained. For each page, links and its contents are obtained. From data set, a full graph is built. From data set, a full graph is built. For each host and page, certain features are computed. For each host and page, certain features are computed. Link-based features are extracted from hostgraph. Link-based features are extracted from hostgraph. Content-based feature are extracted from individual pages. Content-based feature are extracted from individual pages.

Linked-based Feature Some important linked-based features are: Degree-related measures Degree-related measures PageRank PageRank TrustRank TrustRank Truncated PageRank Truncated PageRank Estimation of supporters Estimation of supporters

Content-based Feature Some important content-based features are: Fraction of visible text Fraction of visible text Compressing rate Compressing rate Corpus precision and corpus recall Corpus precision and corpus recall Query precision and query recall Query precision and query recall Independent trigram likelihood Independent trigram likelihood Entropy of diagram Entropy of diagram

Classification Create base classifier from link-based content- based features. Create base classifier from link-based content- based features. Apply cost-sensitive decision tree to classify spam and non-spam hosts. Apply cost-sensitive decision tree to classify spam and non-spam hosts.

Smoothing Hosts are now labeled as spam and non-spam by classifier. Hosts are now labeled as spam and non-spam by classifier. It’s an improvement on base classifier. It’s an improvement on base classifier. Few smoothing techniques are: Few smoothing techniques are: Clustering Clustering Propagation Propagation Stacked graphical learning. Stacked graphical learning.

Smoothing (Cont.) Based on topological dependencies of spam node: Links are not placed at random. Links are not placed at random. Similar pages tends to link more frequently than dissimilar pages. Similar pages tends to link more frequently than dissimilar pages.Or Spam tends to be clustered on the Web. Non-spam nodes tend to be linked by very few spam nodes, and usually link to no spam nodes. Spam nodes are mainly linked by spam nodes.

Smoothing - Clustering Split graph into many clusters. Split graph into many clusters. Use METIS graph clustering algorithm. Use METIS graph clustering algorithm. If majority of nodes in cluster are spam, then all hosts in cluster are spam. If majority of nodes in cluster are spam, then all hosts in cluster are spam.

Smoothing - Propagation Propagate predictions using random walks. Start from node labeled as spam by base classifier then go forward or backward.

Smoothing – Stack Graphical Learning It’s machine learning process. It’s machine learning process. It creates extra features in addition to content- based and linked-based ones. It creates extra features in addition to content- based and linked-based ones.

Conclusion Based on assumption that there is a tendency of spammers to be linked together. Based on assumption that there is a tendency of spammers to be linked together. Using both link-based and content-based feature enhance the detection quality. Using both link-based and content-based feature enhance the detection quality. It can be used on web datasets of any size. It can be used on web datasets of any size. Paper does not explain very well each step. Paper does not explain very well each step.

Useful Reading “ “Using Spam Farm to Boost PageRank” by Ye Du, Yaoyun Shi, Xin Zhao “Using Annotations in Enterprise Search” by “Using Annotations in Enterprise Search” by Pavel A. Dmitriev, Nadav Eiron, Marcus Fontoura, Eugene Shekita

Question ?