Presentation is loading. Please wait.

Presentation is loading. Please wait.

Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam

Similar presentations


Presentation on theme: "Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam"— Presentation transcript:

1 Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam
Steve Hookway 11/17/05

2 Motivation Black and blue – a competition
Identify SPAM pages and discount them in ranking Which techniques work best and will they last?

3 SPAM vs Ham Spam Link Farms Link Exchange Services Guestbooks Ham Dmoz

4 BadRank Google may make use of Bad Rank:
Interleave crawling and page rank updating When updating page rank, BR and blacklist are considered

5 Representation Each page represented by 89 features plus tfidf vector
Three block approach Content based Term frequency, inverse document frequency Features based on each page and aggregated Features based collectively Labeled samples created Ham: Dmoz SPAM: Manually identified

6 Experimental Results tfidf is the most discriminative feature
Using the combined representation is always better than using only the link based features

7

8 Robustness Adversary obfuscates an increasing number of attributes
Purely text based classifier is immediately useless Combined classifier deteriorates slower

9 Open Problems Collective Classification Game Theory “Google Bombing”
Dealing with a large dataset Game Theory “Google Bombing” Deciding validity of references Click Spam Stateless protocol provides no info on client

10 Conclusion Classify instances of SPAM Modify page rank
Purely text-based classifier is easy to break Need to consider a variety of features


Download ppt "Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam"

Similar presentations


Ads by Google