Spam Email Detection Ethan Grefe December 13, 2013
Motivation Spam email is constantly cluttering inboxes Commonly removed using rule based filters Spam often has very similar characteristics This allows them to be detected using machine learning Naïve Bayes Classifiers Support Vector Machines
SVM Solution Used training data from CSDMC2010 SPAM corpus 4327 labeled emails 2949 non-spam messages (HAM) 1378 spam messages (SPAM). Extracted features from the subject and body of emails Used resulting feature vectors to train an SVM classifier in Matlab
Email Features Features were determined by research and observation Best results were obtained with the following features Percentage of letters that are capitalized Types of punctuation used Average length of a word Amount of html in the email
Classifier Results Trained on a random 35% of emails Tested SVM classifier on remaining 65% Trained SVM using three different kernel functions Kernel Function Spam Classification Rate Ham Classification Rate Total Classification Rate RBF 80.06% 92.33% 86.20% Linear 78.69% 80.66% 79.67% Quadratic 82.75% 84.85% 83.80%
Possible Improvements Use Naïve Bayes to classify emails using word frequency Obtain a wider variety of input features Test other types of learning algorithms