Presentation is loading. Please wait.

Presentation is loading. Please wait.

Blocking Blog Spam with Language Model Disagreement Gilad Mishne (Amsterdam) David Carmel (IBM Israel) AIRWeb 2005.

Similar presentations


Presentation on theme: "Blocking Blog Spam with Language Model Disagreement Gilad Mishne (Amsterdam) David Carmel (IBM Israel) AIRWeb 2005."— Presentation transcript:

1 Blocking Blog Spam with Language Model Disagreement Gilad Mishne (Amsterdam) David Carmel (IBM Israel) AIRWeb 2005

2 What is Blog Spam? Bots posting comments unrelated to the original blog post Comments contain links to irrelevant sites Links are used to fool Google

3 Current Solutions Register Solve a puzzle Prevent HTML Prevent comments in old posts IP Filter Limit comment rate

4 Objective Filter out blog spams

5 Approach Compare post contents with comment contents

6 KL-Divergence Similarity Use KL-Divergence as a similarity score between post and comment Lower score = Higher similarity

7 Clustering with Gaussian Mixture Use clustering based on Gaussian Mixture Cluster all comments of a post into 2 groups by KL-Divergence value Higher KL-Divergence value group is the spam group

8 Limitations Cheat the system by using words similar to the post in comments Posts and comments are too short to extract the language model –follow the links

9 Experiment Corpus 50 random blog posts with 1024 comments At least 3 comments per post 32% of comments are valid 68% of comments are spams

10 Sample Spams

11 Result Baseline: classify as spam with 68% probability Threshold Multiplier: adjust classification boundary

12 Conclusion No training No hand-coded rules Still working on –Follow the link to the website


Download ppt "Blocking Blog Spam with Language Model Disagreement Gilad Mishne (Amsterdam) David Carmel (IBM Israel) AIRWeb 2005."

Similar presentations


Ads by Google