Download presentation
Presentation is loading. Please wait.
Published byChristina Copeland Modified over 6 years ago
1
Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam
Steve Hookway 11/17/05
2
Motivation Black and blue – a competition
Identify SPAM pages and discount them in ranking Which techniques work best and will they last?
3
SPAM vs Ham Spam Link Farms Link Exchange Services Guestbooks Ham Dmoz
4
BadRank Google may make use of Bad Rank:
Interleave crawling and page rank updating When updating page rank, BR and blacklist are considered
5
Representation Each page represented by 89 features plus tfidf vector
Three block approach Content based Term frequency, inverse document frequency Features based on each page and aggregated Features based collectively Labeled samples created Ham: Dmoz SPAM: Manually identified
6
Experimental Results tfidf is the most discriminative feature
Using the combined representation is always better than using only the link based features
8
Robustness Adversary obfuscates an increasing number of attributes
Purely text based classifier is immediately useless Combined classifier deteriorates slower
9
Open Problems Collective Classification Game Theory “Google Bombing”
Dealing with a large dataset Game Theory “Google Bombing” Deciding validity of references Click Spam Stateless protocol provides no info on client
10
Conclusion Classify instances of SPAM Modify page rank
Purely text-based classifier is easy to break Need to consider a variety of features
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.