Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam Steve Hookway 11/17/05
Motivation Black and blue – a competition Identify SPAM pages and discount them in ranking Which techniques work best and will they last?
SPAM vs Ham Spam Link Farms Link Exchange Services Guestbooks Ham Dmoz
BadRank Google may make use of Bad Rank: Interleave crawling and page rank updating When updating page rank, BR and blacklist are considered
Representation Each page represented by 89 features plus tfidf vector Three block approach Content based Term frequency, inverse document frequency Features based on each page and aggregated Features based collectively Labeled samples created Ham: Dmoz SPAM: Manually identified
Experimental Results tfidf is the most discriminative feature Using the combined representation is always better than using only the link based features
Robustness Adversary obfuscates an increasing number of attributes Purely text based classifier is immediately useless Combined classifier deteriorates slower
Open Problems Collective Classification Game Theory “Google Bombing” Dealing with a large dataset Game Theory “Google Bombing” Deciding validity of references Click Spam Stateless protocol provides no info on client
Conclusion Classify instances of SPAM Modify page rank Purely text-based classifier is easy to break Need to consider a variety of features