Presentation is loading. Please wait.

Presentation is loading. Please wait.

EMAIL SPAM DETECTION USING MACHINE LEARNING Lydia Song, Lauren Steimle, Xiaoxiao Xu.

Similar presentations


Presentation on theme: "EMAIL SPAM DETECTION USING MACHINE LEARNING Lydia Song, Lauren Steimle, Xiaoxiao Xu."— Presentation transcript:

1 EMAIL SPAM DETECTION USING MACHINE LEARNING Lydia Song, Lauren Steimle, Xiaoxiao Xu

2 Outline  Introduction to Project  Pre-processing  Dimensionality Reduction  Brief discussion of different algorithms  K-nearest  Decision tree  Logistic regression  Naïve-Bayes  Preliminary results  Conclusion

3 Spam Statistics  Percentage of Spam Emails in email traffic averaged 69.9% in February 2014 Source: https://www.securelist.com/en/analysis/204792328/Spam_report_February_2014 Percentage of spam in email traffic

4 Spam vs. Ham Spam=Unwanted communicationHam=Normal communication

5 Pre-processing Example of Spam EmailCorresponding File in Data Set Spam Email in Web Browser

6 Pre-processing 1. Remove meaningless words 2. Create a “bag of words” used in data set 3. Combine similar words 4. Create a feature matrix Email 1 Email 2 Email m “history” “service ” Bag of Words history last … service “last”

7 Pre-processing Example Your history shows that your last order is ready for refilling. Thank you, Sam Mcfarland Customer Services tokens= [‘your’, ‘history’, ‘shows’, ‘that’, ‘your’, ‘last’, ‘order’, ‘is’, ‘ready’, ‘for’, ‘refilling’, ‘thank’, ‘you’, ‘sam’, ‘mcfarland’, ‘customer services’] filtered_words=[ 'history', 'last', 'order', 'ready', 'refilling', 'thank', 'sam', 'mcfarland', 'customer', 'services'] bag of words=['history', 'last', 'order', 'ready', 'refill', 'thank', 'sam', 'mcfarland', 'custom', 'service'] Email 1 Email 2 Email m “histori” “servi” “last” Bag of Words histori last … servi

8 Dimensionality Growth  Add ~100-150 features for each additional email

9 Dimensionality Reduction  Add a requirement that words must appear in x% of all emails to be considered a feature

10 Dimensionality Reduction-Hashing Trick  Before Hashing: 70x9403 Dimensions  After Hashing: 70x1024 Dimensions String Integer Hash Table Index Source: Jorge Stolfi, http://en.wikipedia.org/wiki/File:Hash_table_5_0_1_1_1_1_1_LL.svg#filelinks

11 Outline  Introduction to Project  Pre-processing  Dimensionality Reduction  Brief discussion of different algorithms  K-nearest  Decision tree  Logistic regression  Naïve-Bayes  Preliminary results  Conclusion

12 K-Nearest Neighbors  Goal: Classify an unknown training sample into one of C classes  Idea: To determine the label of an unknown sample (x), look at x’s k-nearest neighbors Image from MIT Opencourseware

13 Decision Tree  Convert training data into a tree structure  Root node: the first decision node  Decision node: if–then decision based on features of training sample  Leaf Node: contains a class label Image from MIT Opencourseware

14 Logistic Regression  “Regression” over training examples  Transform continuous y to prediction of 1 or 0 using the standard logistic function  Predict spam if

15 Naïve Bayes  Use Bayes Theorem:  Hypothesis (H): spam or not spam  Event (e): word occurs  For example, the probability an email is spam when the word “free” is in the email  “Naïve”: assume the feature values are independent of each other

16 Outline  Introduction to Project  Pre-processing  Dimensionality Reduction  Brief discussion of different algorithms  K-nearest  Decision tree  Logistic regression  Naïve-Bayes  Preliminary results  Conclusion

17 Preliminary Results  250 emails in training set, 50 in testing set  Use 15% as the “percentage of emails” cutoff  Performance measures:  Accuracy: % of predictions that were correct  Recall: % of spam emails that were predicted correctly  Precision: % of emails classified as spam that were actually spam  F-Score: weighted average of precision and recall

18 “Percentage of Emails” Performance Linear RegressionLogistic Regression

19 Preliminary Results

20 Next Steps  Implement SVM: Matlab vs. Weka  Hashing trick- try different number of buckets  Regularizations

21 Thank you! Any questions?


Download ppt "EMAIL SPAM DETECTION USING MACHINE LEARNING Lydia Song, Lauren Steimle, Xiaoxiao Xu."

Similar presentations


Ads by Google