Spam Detection Jingrui He 10/08/2007
Spam Types Spam Unsolicited commercial Blog Spam Unwanted comments in blogs Splogs Fake blogs to boost PageRank
From Learning Point of View Spam Detection Classification problem (ham vs. spam) Feature Extraction A Learning Approach to Spam Detection based on Social Networks. H.Y. Lam and D.Y. Yeung Fast Classifier Relaxed Online SVMs for Spam Filtering. D. Sculley, G.M. Wachman
A Learning Approach to Spam Detection based on Social Networks H.Y. Lam and D.Y. Yeung CEAS 2007
Problem Statement n Accounts Sender Set: ; Receiver Set Labeled Sender Set: s.t. Goal Assign the remaining account with in
System Flow Chart
Social Network from Logs Directed Graph Directed Edge sent from to Edge Weight = is the number of s sent from to
System Flow Chart
Features from Social Networks In-count / Out-count The sum of in-coming / out-going edge weights In-degree / Out-degree The number of accounts that a node receives s from / sends s to
Features from Social Networks Communication Reciprocity (CR) The percentage of interactive neighbors that a node has The set of accounts that received s from The set of accounts that sent s to
Communication Interaction Average (CIA) The level of interaction between a sender and each of the corresponding recipients Features from Social Networks
Clustering Coefficient (CC) Friends-of-friends relationship between accounts Features from Social Networks Number of neighbors of Number of connections between neighbors of
System Flow Chart
Preprocessing Sender Feature Vector Weighted Features Problematic?
System Flow Chart
Assigning Spam Score Similarity Weighted k-NN method Gaussian similarity Similarity weighted mean k-NN scores Score scaling The set of k nearest neighbors
Experiments Enron Dataset: 9150 Senders To Get Legitimate Enron senders: transactions within the Enron domain 5000 generated spam accounts 120 senders from each class Results Averaged over 100 Times
Number of Nearest Neighbors
Feature Weights (CC)
Feature Weights (CIA)
Feature Weights (CR)
Feature Weights In/Out-Count & In/Out-Degree The smaller the better Final Weights In/Out-count & In/Out-degree: 1 CR: 1 CIA: 10 CC: 15
Conclusion Legitimacy Score No content needed Can Be Combined with Content-Based Filters More Sophisticated Classifiers SVM, boosting, etc Classifiers Using Combined Feature
Relaxed Online SVMs for Spam Filtering D. Sculley and G.M. Washman SIGIR 2007
Anti-Spam Controversy Support Vector Machines (SVMs) Academic Researchers Statistically robust State-of-the-art performance Practitioners Quadratic in the number of training examples Impractical! Solution: Relaxed Online SVMs
Background: SVMs Data Set = Class Label : 1 for spam; -1 for ham Classifier: To Find and Minimize: Constraints: Slack variable Maximizing the marginMinimizing the loss function Tradeoff parameter
Online SVMs
Tuning the Tradeoff Parameter C Spamassassin data set: 6034 examples Large C preferred
Spam and SVMs TREC05P-1: Messages TREC06P: messages
Blog Comment Spam and SVMs Leave One Out Cross Validation 50 Blog Posts; 1024 Comments
Splogs and SVMs Leave One Out Cross Validation 1380 Examples
Computational Cost Online SVMs: Quadratic Training Time
Relaxed Online SVMs (ROSVM) Objective Function of SVMs: Large C Preferred Minimizing training error more important than maximizing the margin ROSVM Full margin maximization not necessary Relax this requirement
The last value found for when Three Ways to Relax SVMs (1) Only Optimize Over the Recent p Examples Dual form of SVMs Constraints
Three Ways to Relax SVMs (2) Only Update on Actual Errors Original online SVMs Update when ROSVM Update when m=0: mistake driven online SVMs NO significant degrade in performance Significantly reduce cost
Three Ways to Relax SVMs (3) Reduce the Number of Iterations in Interative SVMs SMO: repeated pass over the training set to minimize the objective function Parameter T: the maximum number of iterations T=1: little impact on performance
Testing Reduced Size
Testing Reduced Iterations
Testing Reduced Updates
Online SVMs and ROSVM ROSVM: Spam Blog Comment Spam Splog Data Set