Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

Slides:



Advertisements
Similar presentations
Query Classification Using Asymmetrical Learning Zheng Zhu Birkbeck College, University of London.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.
Chapter 5: Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Exercising these ideas  You have a description of each item in a small collection. (30 web sites)  Assume we are looking for information about boxers,
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Evaluating Search Engine
Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.
CS335 Principles of Multimedia Systems Content Based Media Retrieval Hao Jiang Computer Science Department Boston College Dec. 4, 2007.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
Chapter 5 Searching for Truth: Locating Information on the WWW.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Chapter 5: Information Retrieval and Web Search
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
Chapter 5 Searching for Truth: Locating Information on the WWW.
Data mining and machine learning A brief introduction.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Universit at Dortmund, LS VIII
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
1 Search Engine Optimization An introduction to optimizing your web site for best possible search engine results.
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National.
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
Search Engines By: Faruq Hasan.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Ranking Definitions with Supervised Learning Methods J.Xu, Y.Cao, H.Li and M.Zhao WWW 2005 Presenter: Baoning Wu.
Post-Ranking query suggestion by diversifying search Chao Wang.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
Query Type Classification for Web Document Retrieval In-Ho Kang, GilChang Kim KAIST SIGIR 2003.
Search Engine Optimization
Information Organization: Overview
Large-Scale Content-Based Audio Retrieval from Text Queries
Data Mining Chapter 6 Search Engines
Searching for Truth: Locating Information on the WWW
Searching for Truth: Locating Information on the WWW
Searching for Truth: Locating Information on the WWW
Information Retrieval
Information Organization: Overview
Information Retrieval and Web Design
Presentation transcript:

Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB

Setember 25, 2008DATABASE & MULTIMEDIA LAB1 Contents  Introduction  Support Vector Machine  Data Set  Domain Separation  Rank-time features  Evaluation  Summary

Setember 25, 2008DATABASE & MULTIMEDIA LAB2 Introduction  World Wide Web(WWW)  Definition  An information space in which the items of interest, referred to as resources, are identified by global identi fiers [IAN04]  Description  Too much information  Needs Web Search Engines

Setember 25, 2008DATABASE & MULTIMEDIA LAB3 Introduction  Web Search Engine  Definition  A search engine designed to search for information on the World Wide Web [WIK08]  Description  Retrieves pages relevant to users’ query  Ranking is become important  Web Spam interferes Web Search Engines

Web Spam(1/2)  Definition  A page that uses bad method to improve ranking [KRI07]  Object  Mislead web search engines’ rank algorithm  Make profit by increase page’s traffic  Reason why we should remove Web Spam  Users spend too much time to search for information  Ranking on search engines is critical for making profit  Reduce search engine’s resources Setember 25, 2008DATABASE & MULTIMEDIA LAB4

Web Spam(2/2)  Type of Web spam  Link stuffing  Keyword stuffing  Cloaking  Web farming  When to remove Web Spam  Crawl-time  Index-time  Rank-time  How to remove Web Spam  By training machine – Support Vector Machine(SVM) Setember 25, 2008DATABASE & MULTIMEDIA LAB5

Support Vector Machine(1/2)  Definition  A set of related supervised learning methods used for classification and regression[WIK08]  Description  Find separating hyperplane with maximal margin on vector space Setember 25, 2008DATABASE & MULTIMEDIA LAB6 n dimensions ? v1 v2

Support Vector Machine(2/2)  Procedure  Collect Datasets  Classify Datasets into Training Datasets and Test Dataset  Train the machine with Training Datasets  Test the machine with Test Dataset  Problem  We need to collect Datasets Setember 25, 2008DATABASE & MULTIMEDIA LAB7

Dataset  Definition  A set of labeled sample data for training and test  Collecting Procedure  Collect common query lists from MSN Live search engine  Label each of top-10 result as spam, non-spam or unknown by human judge  Classify dataset into training datasets and a test dataset  Classification method on datasets  Very important!  We choose Domain Separation Setember 25, 2008DATABASE & MULTIMEDIA LAB8

Domain Separation(1/6)  Definition  A classification method that classify according to domains  Procedure(in this paper)  For each URL from dataset  Calculate hash value by domain  If a new hash value comes, assign it randomly into 5 files  If the hash value comes again, put into the assigned file  Adjust 5 files into similar size  Why should we choose Domain Separation? Setember 25, 2008DATABASE & MULTIMEDIA LAB9

Domain Separation(2/6)  Domain separated vs. Randomly separated  Opinion  Domain separated datasets are better  The result trained with randomly separate dataset is WRONG!  It’s general classification problem in machine learning  Reason  If there exists subsets in dataset, and they has features, we should use those features  In fact, some spammers buy a domain for making spam page, it’s co mmon that whole pages related that domain labeled spam  How to make domain separated datasets? Setember 25, 2008DATABASE & MULTIMEDIA LAB10

Domain Separation(3/6)  Five-fold cross validation  Definition  A method for training and test the SVM using in this paper  Procedure  Choose one of five domain-separated datasets as a test set  Choose other domain-separated datasets as training datasets  Train the SVM with 4 training datasets  Test the SVM with a test set  Repeat above procedures at all combination of sets Setember 25, 2008DATABASE & MULTIMEDIA LAB11

Domain Separation(4/6)  The result of domain separation  Total 31,300 URLs  3,133 spam labeled URLs(9.99%)  Problem  Learning feature vector to subset hash to label may turn out to be wildly and incorrectly optimistic  Leave future work Setember 25, 2008DATABASE & MULTIMEDIA LAB12

Domain Separation(5/6)  Description  No duplicated domain  Consists 25% spam  Couldn’t use domain information  Worst-case graph Setember 25, 2008DATABASE & MULTIMEDIA LAB13

Domain Separation(6/6)  Description  Add additional feature  Consists 10% spam  More difficult to detect than 25% spam  Result  Still little bit lower than randomly sep., but it’s worst-case  Note : Still couldn’t use domain information Setember 25, 2008DATABASE & MULTIMEDIA LAB14

FEAT A (1/2)  Description  Rank independent features  FEAT A includes  Domain-level features  Page-level features  Link information Setember 25, 2008DATABASE & MULTIMEDIA LAB15

FEAT A (2/2)  Description  Average precision 60% at 10.8% recall  Consists of 10% spam  Not so good  We will add Rank-time features! Setember 25, 2008DATABASE & MULTIMEDIA LAB16

Rank-time Features  Definition  Features using on rank-time  Motivation  Every page has feature vector  Shape of spam/non-spam pages’ feature vector is different  Spammer can’t guess distribution of non-spam feature vector  Consist of  Query independent features(FEAT B )  Query dependent features(FEAT Q ) Setember 25, 2008DATABASE & MULTIMEDIA LAB17

FEAT B  Definition  Query independent, rank-time features  Description  Page-level features  Domain-level features  Popularity features  Time features Setember 25, 2008DATABASE & MULTIMEDIA LAB18

FEAT Q  Definition  Query dependent, rank-time features  Description  Depend on the match between query and document property  Examine for each returned result  Future work  Label spam on the URL only, not on the relevance of a URL to a query Setember 25, 2008DATABASE & MULTIMEDIA LAB19

Evaluation  Micro averaged on five tests Setember 25, 2008DATABASE & MULTIMEDIA LAB20

Summary  Classification of Web Spam is an important problem  We can classify Web Spam by training on the SVM  Making training datasets as domain-separated datasets is very important  Rank-time features improve classification performance by as much as 25% in recall at a set precision Setember 25, 2008DATABASE & MULTIMEDIA LAB21

References  [KRY07] Krysta, M., Qiang, W., Chris, J., Aaswath, R,. “Improving Web Spam Classification using Rank-time Features”, AIRWeb ’07, May 8, 2007  [IAN04] Ian, J., “Architecture of the World Wide Web, Volume One”, W3C Recommendation, Dec 15, 2004  [WIK08] “Web Search Engine”, “Support Vector Machine”, Sep 25, 2008 Setember 25, 2008DATABASE & MULTIMEDIA LAB22

Receiver Operating Characteristic Setember 25, 2008DATABASE & MULTIMEDIA LAB23 [Appendix A]