Download presentation
Presentation is loading. Please wait.
Published byCornelius Watkins Modified over 9 years ago
1
EMAIL SPAM DETECTION USING MACHINE LEARNING Lydia Song, Lauren Steimle, Xiaoxiao Xu
2
Outline Introduction to Project Pre-processing Dimensionality Reduction Brief discussion of different algorithms K-nearest Decision tree Logistic regression Naïve-Bayes Preliminary results Conclusion
3
Spam Statistics Percentage of Spam Emails in email traffic averaged 69.9% in February 2014 Source: https://www.securelist.com/en/analysis/204792328/Spam_report_February_2014 Percentage of spam in email traffic
4
Spam vs. Ham Spam=Unwanted communicationHam=Normal communication
5
Pre-processing Example of Spam EmailCorresponding File in Data Set Spam Email in Web Browser
6
Pre-processing 1. Remove meaningless words 2. Create a “bag of words” used in data set 3. Combine similar words 4. Create a feature matrix Email 1 Email 2 Email m “history” “service ” Bag of Words history last … service “last”
7
Pre-processing Example Your history shows that your last order is ready for refilling. Thank you, Sam Mcfarland Customer Services tokens= [‘your’, ‘history’, ‘shows’, ‘that’, ‘your’, ‘last’, ‘order’, ‘is’, ‘ready’, ‘for’, ‘refilling’, ‘thank’, ‘you’, ‘sam’, ‘mcfarland’, ‘customer services’] filtered_words=[ 'history', 'last', 'order', 'ready', 'refilling', 'thank', 'sam', 'mcfarland', 'customer', 'services'] bag of words=['history', 'last', 'order', 'ready', 'refill', 'thank', 'sam', 'mcfarland', 'custom', 'service'] Email 1 Email 2 Email m “histori” “servi” “last” Bag of Words histori last … servi
8
Dimensionality Growth Add ~100-150 features for each additional email
9
Dimensionality Reduction Add a requirement that words must appear in x% of all emails to be considered a feature
10
Dimensionality Reduction-Hashing Trick Before Hashing: 70x9403 Dimensions After Hashing: 70x1024 Dimensions String Integer Hash Table Index Source: Jorge Stolfi, http://en.wikipedia.org/wiki/File:Hash_table_5_0_1_1_1_1_1_LL.svg#filelinks
11
Outline Introduction to Project Pre-processing Dimensionality Reduction Brief discussion of different algorithms K-nearest Decision tree Logistic regression Naïve-Bayes Preliminary results Conclusion
12
K-Nearest Neighbors Goal: Classify an unknown training sample into one of C classes Idea: To determine the label of an unknown sample (x), look at x’s k-nearest neighbors Image from MIT Opencourseware
13
Decision Tree Convert training data into a tree structure Root node: the first decision node Decision node: if–then decision based on features of training sample Leaf Node: contains a class label Image from MIT Opencourseware
14
Logistic Regression “Regression” over training examples Transform continuous y to prediction of 1 or 0 using the standard logistic function Predict spam if
15
Naïve Bayes Use Bayes Theorem: Hypothesis (H): spam or not spam Event (e): word occurs For example, the probability an email is spam when the word “free” is in the email “Naïve”: assume the feature values are independent of each other
16
Outline Introduction to Project Pre-processing Dimensionality Reduction Brief discussion of different algorithms K-nearest Decision tree Logistic regression Naïve-Bayes Preliminary results Conclusion
17
Preliminary Results 250 emails in training set, 50 in testing set Use 15% as the “percentage of emails” cutoff Performance measures: Accuracy: % of predictions that were correct Recall: % of spam emails that were predicted correctly Precision: % of emails classified as spam that were actually spam F-Score: weighted average of precision and recall
18
“Percentage of Emails” Performance Linear RegressionLogistic Regression
19
Preliminary Results
20
Next Steps Implement SVM: Matlab vs. Weka Hashing trick- try different number of buckets Regularizations
21
Thank you! Any questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.