Download presentation
Presentation is loading. Please wait.
Published byRiver Amor Modified over 10 years ago
1
Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI
2
General background information about the process of machine learning
3
The process of email detection Motivation of this project Pre-processing of data Classifier Models Evaluation of classifiers
4
Motivation of this project Spam email has been annoyed every personal email account 60% of January 2004 emails were spam Fraud & Phishing Spam vs. Ham email
5
Our Goal
6
Spam Email example
7
Ham Email example
8
The process of email detection Motivation of this project Pre-processing of data Classifier Models Evaluation of classifiers
9
Pre-processing of data Convert capital letters to lowercase Remove numbers, and extra white space Remove punctuations Remove stop-words Delete terms with length greater than 20.
10
Pre-processing of data Original Email
11
Pre-processing of data After pre-processing
12
Pre-processing of data Extract Terms
13
Pre-processing of data Reduce Terms Keep word length < 20
14
The process of email detection Motivation of this project Pre-processing of data Classifier Models Evaluation of classifiers
15
Different classification methods K Nearest Neighbor (KNN) Naive Bayes Classifier Logistic Regression Decision Tree Analysis
16
What is K Nearest Neighbor Use k "closet" samples (nearest neighbors) to perform classification
17
What is K Nearest Neighbor
18
Initial outcome and strategies for improvement KNN accuracy was ~64% - very low KNN classifier does not fit our project Term-list is still too large Try different method to classify and see if evaluation results are better than KNN results Continue to reduce size of term list by removing terms that are not meaningful
19
Steps for improvement Remove sparsity Reduced length threshold Created hashtable Used alternative classifier Naive- Bayes Classifier
20
Calculate Hash Key for each term in term-list. Once collision occurs, use the separate chain Hashtable
21
Naive- Bayes classifier
22
Secondary Results Correctness increases from 62% to 82.36%
23
Suggestions for further improvement Revise pre-processing Apply additional classifiers
24
Thank you Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.