Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB
Setember 25, 2008DATABASE & MULTIMEDIA LAB1 Contents Introduction Support Vector Machine Data Set Domain Separation Rank-time features Evaluation Summary
Setember 25, 2008DATABASE & MULTIMEDIA LAB2 Introduction World Wide Web(WWW) Definition An information space in which the items of interest, referred to as resources, are identified by global identi fiers [IAN04] Description Too much information Needs Web Search Engines
Setember 25, 2008DATABASE & MULTIMEDIA LAB3 Introduction Web Search Engine Definition A search engine designed to search for information on the World Wide Web [WIK08] Description Retrieves pages relevant to users’ query Ranking is become important Web Spam interferes Web Search Engines
Web Spam(1/2) Definition A page that uses bad method to improve ranking [KRI07] Object Mislead web search engines’ rank algorithm Make profit by increase page’s traffic Reason why we should remove Web Spam Users spend too much time to search for information Ranking on search engines is critical for making profit Reduce search engine’s resources Setember 25, 2008DATABASE & MULTIMEDIA LAB4
Web Spam(2/2) Type of Web spam Link stuffing Keyword stuffing Cloaking Web farming When to remove Web Spam Crawl-time Index-time Rank-time How to remove Web Spam By training machine – Support Vector Machine(SVM) Setember 25, 2008DATABASE & MULTIMEDIA LAB5
Support Vector Machine(1/2) Definition A set of related supervised learning methods used for classification and regression[WIK08] Description Find separating hyperplane with maximal margin on vector space Setember 25, 2008DATABASE & MULTIMEDIA LAB6 n dimensions ? v1 v2
Support Vector Machine(2/2) Procedure Collect Datasets Classify Datasets into Training Datasets and Test Dataset Train the machine with Training Datasets Test the machine with Test Dataset Problem We need to collect Datasets Setember 25, 2008DATABASE & MULTIMEDIA LAB7
Dataset Definition A set of labeled sample data for training and test Collecting Procedure Collect common query lists from MSN Live search engine Label each of top-10 result as spam, non-spam or unknown by human judge Classify dataset into training datasets and a test dataset Classification method on datasets Very important! We choose Domain Separation Setember 25, 2008DATABASE & MULTIMEDIA LAB8
Domain Separation(1/6) Definition A classification method that classify according to domains Procedure(in this paper) For each URL from dataset Calculate hash value by domain If a new hash value comes, assign it randomly into 5 files If the hash value comes again, put into the assigned file Adjust 5 files into similar size Why should we choose Domain Separation? Setember 25, 2008DATABASE & MULTIMEDIA LAB9
Domain Separation(2/6) Domain separated vs. Randomly separated Opinion Domain separated datasets are better The result trained with randomly separate dataset is WRONG! It’s general classification problem in machine learning Reason If there exists subsets in dataset, and they has features, we should use those features In fact, some spammers buy a domain for making spam page, it’s co mmon that whole pages related that domain labeled spam How to make domain separated datasets? Setember 25, 2008DATABASE & MULTIMEDIA LAB10
Domain Separation(3/6) Five-fold cross validation Definition A method for training and test the SVM using in this paper Procedure Choose one of five domain-separated datasets as a test set Choose other domain-separated datasets as training datasets Train the SVM with 4 training datasets Test the SVM with a test set Repeat above procedures at all combination of sets Setember 25, 2008DATABASE & MULTIMEDIA LAB11
Domain Separation(4/6) The result of domain separation Total 31,300 URLs 3,133 spam labeled URLs(9.99%) Problem Learning feature vector to subset hash to label may turn out to be wildly and incorrectly optimistic Leave future work Setember 25, 2008DATABASE & MULTIMEDIA LAB12
Domain Separation(5/6) Description No duplicated domain Consists 25% spam Couldn’t use domain information Worst-case graph Setember 25, 2008DATABASE & MULTIMEDIA LAB13
Domain Separation(6/6) Description Add additional feature Consists 10% spam More difficult to detect than 25% spam Result Still little bit lower than randomly sep., but it’s worst-case Note : Still couldn’t use domain information Setember 25, 2008DATABASE & MULTIMEDIA LAB14
FEAT A (1/2) Description Rank independent features FEAT A includes Domain-level features Page-level features Link information Setember 25, 2008DATABASE & MULTIMEDIA LAB15
FEAT A (2/2) Description Average precision 60% at 10.8% recall Consists of 10% spam Not so good We will add Rank-time features! Setember 25, 2008DATABASE & MULTIMEDIA LAB16
Rank-time Features Definition Features using on rank-time Motivation Every page has feature vector Shape of spam/non-spam pages’ feature vector is different Spammer can’t guess distribution of non-spam feature vector Consist of Query independent features(FEAT B ) Query dependent features(FEAT Q ) Setember 25, 2008DATABASE & MULTIMEDIA LAB17
FEAT B Definition Query independent, rank-time features Description Page-level features Domain-level features Popularity features Time features Setember 25, 2008DATABASE & MULTIMEDIA LAB18
FEAT Q Definition Query dependent, rank-time features Description Depend on the match between query and document property Examine for each returned result Future work Label spam on the URL only, not on the relevance of a URL to a query Setember 25, 2008DATABASE & MULTIMEDIA LAB19
Evaluation Micro averaged on five tests Setember 25, 2008DATABASE & MULTIMEDIA LAB20
Summary Classification of Web Spam is an important problem We can classify Web Spam by training on the SVM Making training datasets as domain-separated datasets is very important Rank-time features improve classification performance by as much as 25% in recall at a set precision Setember 25, 2008DATABASE & MULTIMEDIA LAB21
References [KRY07] Krysta, M., Qiang, W., Chris, J., Aaswath, R,. “Improving Web Spam Classification using Rank-time Features”, AIRWeb ’07, May 8, 2007 [IAN04] Ian, J., “Architecture of the World Wide Web, Volume One”, W3C Recommendation, Dec 15, 2004 [WIK08] “Web Search Engine”, “Support Vector Machine”, Sep 25, 2008 Setember 25, 2008DATABASE & MULTIMEDIA LAB22
Receiver Operating Characteristic Setember 25, 2008DATABASE & MULTIMEDIA LAB23 [Appendix A]