Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam

Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam
Steve Hookway 11/17/05

Motivation Black and blue – a competition
Identify SPAM pages and discount them in ranking Which techniques work best and will they last?

SPAM vs Ham Spam Link Farms Link Exchange Services Guestbooks Ham Dmoz

BadRank Google may make use of Bad Rank:
Interleave crawling and page rank updating When updating page rank, BR and blacklist are considered

Representation Each page represented by 89 features plus tfidf vector
Three block approach Content based Term frequency, inverse document frequency Features based on each page and aggregated Features based collectively Labeled samples created Ham: Dmoz SPAM: Manually identified

Experimental Results tfidf is the most discriminative feature
Using the combined representation is always better than using only the link based features

Robustness Adversary obfuscates an increasing number of attributes
Purely text based classifier is immediately useless Combined classifier deteriorates slower

Open Problems Collective Classification Game Theory “Google Bombing”
Dealing with a large dataset Game Theory “Google Bombing” Deciding validity of references Click Spam Stateless protocol provides no info on client

Conclusion Classify instances of SPAM Modify page rank
Purely text-based classifier is easy to break Need to consider a variety of features

Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam

Similar presentations

Presentation on theme: "Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam

Similar presentations

Presentation on theme: "Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam"— Presentation transcript:

Similar presentations

About project

Feedback