Presentation is loading. Please wait.

Presentation is loading. Please wait.

Spam Detection Jingrui He 10/08/2007. Spam Types  Email Spam Unsolicited commercial email  Blog Spam Unwanted comments in blogs  Splogs Fake blogs.

Similar presentations


Presentation on theme: "Spam Detection Jingrui He 10/08/2007. Spam Types  Email Spam Unsolicited commercial email  Blog Spam Unwanted comments in blogs  Splogs Fake blogs."— Presentation transcript:

1 Spam Detection Jingrui He 10/08/2007

2 Spam Types  Email Spam Unsolicited commercial email  Blog Spam Unwanted comments in blogs  Splogs Fake blogs to boost PageRank

3 From Learning Point of View  Spam Detection Classification problem (ham vs. spam)  Feature Extraction A Learning Approach to Spam Detection based on Social Networks. H.Y. Lam and D.Y. Yeung  Fast Classifier Relaxed Online SVMs for Spam Filtering. D. Sculley, G.M. Wachman

4 A Learning Approach to Spam Detection based on Social Networks H.Y. Lam and D.Y. Yeung CEAS 2007

5 Problem Statement  n Email Accounts  Sender Set: ; Receiver Set  Labeled Sender Set: s.t.  Goal Assign the remaining account with in

6 System Flow Chart

7 Social Network from Logs  Directed Graph  Directed Edge Email sent from to  Edge Weight = is the number of emails sent from to

8 System Flow Chart

9 Features from Email Social Networks  In-count / Out-count The sum of in-coming / out-going edge weights  In-degree / Out-degree The number of email accounts that a node receives emails from / sends emails to

10 Features from Email Social Networks  Communication Reciprocity (CR) The percentage of interactive neighbors that a node has The set of accounts that received emails from The set of accounts that sent emails to

11  Communication Interaction Average (CIA) The level of interaction between a sender and each of the corresponding recipients Features from Email Social Networks

12  Clustering Coefficient (CC) Friends-of-friends relationship between email accounts Features from Email Social Networks Number of neighbors of Number of connections between neighbors of

13 System Flow Chart

14 Preprocessing  Sender Feature Vector  Weighted Features Problematic?

15 System Flow Chart

16 Assigning Spam Score  Similarity Weighted k-NN method Gaussian similarity Similarity weighted mean k-NN scores Score scaling The set of k nearest neighbors

17 Experiments  Enron Dataset: 9150 Senders  To Get Legitimate Enron senders: email transactions within the Enron email domain 5000 generated spam accounts 120 senders from each class  Results Averaged over 100 Times

18 Number of Nearest Neighbors

19 Feature Weights (CC)

20 Feature Weights (CIA)

21 Feature Weights (CR)

22 Feature Weights  In/Out-Count & In/Out-Degree The smaller the better  Final Weights In/Out-count & In/Out-degree: 1 CR: 1 CIA: 10 CC: 15

23 Conclusion  Legitimacy Score No content needed  Can Be Combined with Content-Based Filters  More Sophisticated Classifiers SVM, boosting, etc  Classifiers Using Combined Feature

24 Relaxed Online SVMs for Spam Filtering D. Sculley and G.M. Washman SIGIR 2007

25 Anti-Spam Controversy  Support Vector Machines (SVMs)  Academic Researchers Statistically robust State-of-the-art performance  Practitioners Quadratic in the number of training examples Impractical!  Solution: Relaxed Online SVMs

26 Background: SVMs  Data Set =  Class Label : 1 for spam; -1 for ham  Classifier:  To Find and Minimize: Constraints: Slack variable Maximizing the marginMinimizing the loss function Tradeoff parameter

27 Online SVMs

28 Tuning the Tradeoff Parameter C  Spamassassin data set: 6034 examples Large C preferred

29 Email Spam and SVMs  TREC05P-1: 92189 Messages  TREC06P: 37822 messages

30 Blog Comment Spam and SVMs  Leave One Out Cross Validation  50 Blog Posts; 1024 Comments

31 Splogs and SVMs  Leave One Out Cross Validation  1380 Examples

32 Computational Cost  Online SVMs: Quadratic Training Time

33 Relaxed Online SVMs (ROSVM)  Objective Function of SVMs:  Large C Preferred Minimizing training error more important than maximizing the margin  ROSVM Full margin maximization not necessary Relax this requirement

34 The last value found for when Three Ways to Relax SVMs (1)  Only Optimize Over the Recent p Examples Dual form of SVMs Constraints

35 Three Ways to Relax SVMs (2)  Only Update on Actual Errors Original online SVMs  Update when ROSVM  Update when  m=0: mistake driven online SVMs  NO significant degrade in performance  Significantly reduce cost

36 Three Ways to Relax SVMs (3)  Reduce the Number of Iterations in Interative SVMs SMO: repeated pass over the training set to minimize the objective function Parameter T: the maximum number of iterations T=1: little impact on performance

37 Testing Reduced Size

38 Testing Reduced Iterations

39 Testing Reduced Updates

40 Online SVMs and ROSVM  ROSVM: Email Spam Blog Comment Spam Splog Data Set


Download ppt "Spam Detection Jingrui He 10/08/2007. Spam Types  Email Spam Unsolicited commercial email  Blog Spam Unwanted comments in blogs  Splogs Fake blogs."

Similar presentations


Ads by Google