Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

What is WEB SPAM Many slides from a lecture by Marc Najork, Microsoft: “Detecting Spam Web Pages”
1 CANTINA : A Content-Based Approach to Detecting Phishing Web Sites WWW Yue Zhang, Jason Hong, and Lorrie Cranor.
TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz Universidad Europea de Madrid.
A Quality Focused Crawler for Health Information Tim Tang.
Victor Ivanov. Introduction  Definition  Unsolicited bulk messages  Concerns  Server load  Garbage content.
Search Engines and Information Retrieval
Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
1 CS/INFO 430 Information Retrieval Lecture 18 Web Search 4.
TTI's Gender Prediction System using Bootstrapping and Identical-Hierarchy Mohammad Golam Sohrab Computational Intelligence Laboratory Toyota.
Distributed Representations of Sentences and Documents
Scalable Text Mining with Sparse Generative Models
Overview of Search Engines
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Todd Friesen April, 2007 SEO Workshop Web 2.0 Expo San Francisco.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science.
Adversarial Information Retrieval on the Web or How I spammed Google and lost Dr. Frank McCown Search Engine Development – COMP 475 Mar. 24, 2009.
Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 9/19/2015Slide 1 (of 32)
Accelerated Focused Crawling Through Online Relevance Feedback Soumen Chakrabarti, IIT Bombay Kunal Punera, IIT Bombay Mallela Subramanyam, UT Austin.
11 CANTINA: A Content- Based Approach to Detecting Phishing Web Sites Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/6/7.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Wei Zhang Akshat Surve Xiaoli Fern Thomas Dietterich.
SEO  What is it?  Seo is a collection of techniques targeted towards increasing the presence of a website on a search engine.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Medical Data Classifier undergraduate project By: Avikam Agur and Maayan Zehavi Advisors: Prof. Michael Elhadad and Mr. Tal Baumel.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
Spam Detection Ethan Grefe December 13, 2013.
Graph Algorithms: Classification William Cohen. Outline Last week: – PageRank – one algorithm on graphs edges and nodes in memory nodes in memory nothing.
Ranking Definitions with Supervised Learning Methods J.Xu, Y.Cao, H.Li and M.Zhao WWW 2005 Presenter: Baoning Wu.
Latent Dirichlet Allocation
CSE 534 Final Project Internet Outage Analysis Name: Guanyu Zhu, Wei-Ting Lin, Zhaowei Sun Professor: Phillipa Gill.
Nuhi BESIMI, Adrian BESIMI, Visar SHEHU
Class Imbalance in Text Classification
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28,
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Spamdexing
What is WEB SPAM Many slides are from a lecture by Marc Najork: “Detecting Spam Web Pages”
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Ping-Tsun Chang Intelligent Systems Laboratory NTU/CSIE Using Support Vector Machine for Integrating Catalogs.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
The intelligent Web searcher Isabel 'MaineC.' Drost Developing Intelligent Search Engines.
Information Retrieval in Practice
Mining Data Semantics (MDS'2011) Workshop
Assessment.
Search Engine Architecture
Assessment.
Shuang-Hong Yang, Hongyuan Zha, Bao-Gang Hu NIPS2009
Games to engage users and collect data
Lecture 02 The Basics pf Creating a GIS Map
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
A Comparative Study of Link Analysis Algorithms
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
IST 497E Information Retrieval and Organization
Topic Oriented Semi-supervised Document Clustering
Text Categorization Rong Jin.
Text Categorization Assigning documents to a fixed set of categories
Information Retrieval
Query Type Classification for Web Document Retrieval
Kanchana Ihalagedara Rajitha Kithuldeniya Supun weerasekara
Using Link Information to Enhance Web Page Classification
SPECIAL ISSUE on Document Analysis, 5(2):1-15, 2005.
Presentation transcript:

Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam Steve Hookway 11/17/05

Motivation Black and blue – a competition Identify SPAM pages and discount them in ranking Which techniques work best and will they last?

SPAM vs Ham Spam Link Farms Link Exchange Services Guestbooks Ham Dmoz

BadRank Google may make use of Bad Rank: Interleave crawling and page rank updating When updating page rank, BR and blacklist are considered

Representation Each page represented by 89 features plus tfidf vector Three block approach Content based Term frequency, inverse document frequency Features based on each page and aggregated Features based collectively Labeled samples created Ham: Dmoz SPAM: Manually identified

Experimental Results tfidf is the most discriminative feature Using the combined representation is always better than using only the link based features

Robustness Adversary obfuscates an increasing number of attributes Purely text based classifier is immediately useless Combined classifier deteriorates slower

Open Problems Collective Classification Game Theory “Google Bombing” Dealing with a large dataset Game Theory “Google Bombing” Deciding validity of references Click Spam Stateless protocol provides no info on client

Conclusion Classify instances of SPAM Modify page rank Purely text-based classifier is easy to break Need to consider a variety of features