Download presentation
Presentation is loading. Please wait.
Published byJoshua Osborne Modified over 9 years ago
1
Blocking Blog Spam with Language Model Disagreement Gilad Mishne (Amsterdam) David Carmel (IBM Israel) AIRWeb 2005
2
What is Blog Spam? Bots posting comments unrelated to the original blog post Comments contain links to irrelevant sites Links are used to fool Google
3
Current Solutions Register Solve a puzzle Prevent HTML Prevent comments in old posts IP Filter Limit comment rate
4
Objective Filter out blog spams
5
Approach Compare post contents with comment contents
6
KL-Divergence Similarity Use KL-Divergence as a similarity score between post and comment Lower score = Higher similarity
7
Clustering with Gaussian Mixture Use clustering based on Gaussian Mixture Cluster all comments of a post into 2 groups by KL-Divergence value Higher KL-Divergence value group is the spam group
8
Limitations Cheat the system by using words similar to the post in comments Posts and comments are too short to extract the language model –follow the links
9
Experiment Corpus 50 random blog posts with 1024 comments At least 3 comments per post 32% of comments are valid 68% of comments are spams
10
Sample Spams
11
Result Baseline: classify as spam with 68% probability Threshold Multiplier: adjust classification boundary
12
Conclusion No training No hand-coded rules Still working on –Follow the link to the website
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.