Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB

Setember 25, 2008DATABASE & MULTIMEDIA LAB1 Contents  Introduction  Support Vector Machine  Data Set  Domain Separation  Rank-time features  Evaluation  Summary

Setember 25, 2008DATABASE & MULTIMEDIA LAB2 Introduction  World Wide Web(WWW)  Definition  An information space in which the items of interest, referred to as resources, are identified by global identi fiers [IAN04]  Description  Too much information  Needs Web Search Engines

Setember 25, 2008DATABASE & MULTIMEDIA LAB3 Introduction  Web Search Engine  Definition  A search engine designed to search for information on the World Wide Web [WIK08]  Description  Retrieves pages relevant to users’ query  Ranking is become important  Web Spam interferes Web Search Engines

Web Spam(1/2)  Definition  A page that uses bad method to improve ranking [KRI07]  Object  Mislead web search engines’ rank algorithm  Make profit by increase page’s traffic  Reason why we should remove Web Spam  Users spend too much time to search for information  Ranking on search engines is critical for making profit  Reduce search engine’s resources Setember 25, 2008DATABASE & MULTIMEDIA LAB4

Web Spam(2/2)  Type of Web spam  Link stuffing  Keyword stuffing  Cloaking  Web farming  When to remove Web Spam  Crawl-time  Index-time  Rank-time  How to remove Web Spam  By training machine – Support Vector Machine(SVM) Setember 25, 2008DATABASE & MULTIMEDIA LAB5

Support Vector Machine(1/2)  Definition  A set of related supervised learning methods used for classification and regression[WIK08]  Description  Find separating hyperplane with maximal margin on vector space Setember 25, 2008DATABASE & MULTIMEDIA LAB6 n dimensions ? v1 v2

Support Vector Machine(2/2)  Procedure  Collect Datasets  Classify Datasets into Training Datasets and Test Dataset  Train the machine with Training Datasets  Test the machine with Test Dataset  Problem  We need to collect Datasets Setember 25, 2008DATABASE & MULTIMEDIA LAB7

Dataset  Definition  A set of labeled sample data for training and test  Collecting Procedure  Collect common query lists from MSN Live search engine  Label each of top-10 result as spam, non-spam or unknown by human judge  Classify dataset into training datasets and a test dataset  Classification method on datasets  Very important!  We choose Domain Separation Setember 25, 2008DATABASE & MULTIMEDIA LAB8

Domain Separation(1/6)  Definition  A classification method that classify according to domains  Procedure(in this paper)  For each URL from dataset  Calculate hash value by domain  If a new hash value comes, assign it randomly into 5 files  If the hash value comes again, put into the assigned file  Adjust 5 files into similar size  Why should we choose Domain Separation? Setember 25, 2008DATABASE & MULTIMEDIA LAB9

Domain Separation(2/6)  Domain separated vs. Randomly separated  Opinion  Domain separated datasets are better  The result trained with randomly separate dataset is WRONG!  It’s general classification problem in machine learning  Reason  If there exists subsets in dataset, and they has features, we should use those features  In fact, some spammers buy a domain for making spam page, it’s co mmon that whole pages related that domain labeled spam  How to make domain separated datasets? Setember 25, 2008DATABASE & MULTIMEDIA LAB10

Domain Separation(3/6)  Five-fold cross validation  Definition  A method for training and test the SVM using in this paper  Procedure  Choose one of five domain-separated datasets as a test set  Choose other domain-separated datasets as training datasets  Train the SVM with 4 training datasets  Test the SVM with a test set  Repeat above procedures at all combination of sets Setember 25, 2008DATABASE & MULTIMEDIA LAB11

Domain Separation(4/6)  The result of domain separation  Total 31,300 URLs  3,133 spam labeled URLs(9.99%)  Problem  Learning feature vector to subset hash to label may turn out to be wildly and incorrectly optimistic  Leave future work Setember 25, 2008DATABASE & MULTIMEDIA LAB12

Domain Separation(5/6)  Description  No duplicated domain  Consists 25% spam  Couldn’t use domain information  Worst-case graph Setember 25, 2008DATABASE & MULTIMEDIA LAB13

Domain Separation(6/6)  Description  Add additional feature  Consists 10% spam  More difficult to detect than 25% spam  Result  Still little bit lower than randomly sep., but it’s worst-case  Note : Still couldn’t use domain information Setember 25, 2008DATABASE & MULTIMEDIA LAB14

FEAT A (1/2)  Description  Rank independent features  FEAT A includes  Domain-level features  Page-level features  Link information Setember 25, 2008DATABASE & MULTIMEDIA LAB15

FEAT A (2/2)  Description  Average precision 60% at 10.8% recall  Consists of 10% spam  Not so good  We will add Rank-time features! Setember 25, 2008DATABASE & MULTIMEDIA LAB16

Rank-time Features  Definition  Features using on rank-time  Motivation  Every page has feature vector  Shape of spam/non-spam pages’ feature vector is different  Spammer can’t guess distribution of non-spam feature vector  Consist of  Query independent features(FEAT B )  Query dependent features(FEAT Q ) Setember 25, 2008DATABASE & MULTIMEDIA LAB17

FEAT B  Definition  Query independent, rank-time features  Description  Page-level features  Domain-level features  Popularity features  Time features Setember 25, 2008DATABASE & MULTIMEDIA LAB18

FEAT Q  Definition  Query dependent, rank-time features  Description  Depend on the match between query and document property  Examine for each returned result  Future work  Label spam on the URL only, not on the relevance of a URL to a query Setember 25, 2008DATABASE & MULTIMEDIA LAB19

Evaluation  Micro averaged on five tests Setember 25, 2008DATABASE & MULTIMEDIA LAB20

Summary  Classification of Web Spam is an important problem  We can classify Web Spam by training on the SVM  Making training datasets as domain-separated datasets is very important  Rank-time features improve classification performance by as much as 25% in recall at a set precision Setember 25, 2008DATABASE & MULTIMEDIA LAB21

References  [KRY07] Krysta, M., Qiang, W., Chris, J., Aaswath, R,. “Improving Web Spam Classification using Rank-time Features”, AIRWeb ’07, May 8, 2007  [IAN04] Ian, J., “Architecture of the World Wide Web, Volume One”, W3C Recommendation, Dec 15, 2004  [WIK08] “Web Search Engine”, “Support Vector Machine”, http://wikipedia.org, Sep 25, 2008 Setember 25, 2008DATABASE & MULTIMEDIA LAB22

Receiver Operating Characteristic Setember 25, 2008DATABASE & MULTIMEDIA LAB23 [Appendix A]

Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

Similar presentations

Presentation on theme: "Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

Similar presentations

Presentation on theme: "Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB."— Presentation transcript:

Similar presentations

About project

Feedback