Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.

Similar presentations


Presentation on theme: "Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI."— Presentation transcript:

1 Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI

2 General background information about the process of machine learning

3 The process of email detection Motivation of this project Pre-processing of data Classifier Models Evaluation of classifiers

4 Motivation of this project Spam email has been annoyed every personal email account 60% of January 2004 emails were spam Fraud & Phishing Spam vs. Ham email

5 Our Goal

6 Spam Email example

7 Ham Email example

8 The process of email detection Motivation of this project Pre-processing of data Classifier Models Evaluation of classifiers

9 Pre-processing of data Convert capital letters to lowercase Remove numbers, and extra white space Remove punctuations Remove stop-words Delete terms with length greater than 20.

10 Pre-processing of data Original Email

11 Pre-processing of data After pre-processing

12 Pre-processing of data Extract Terms

13 Pre-processing of data Reduce Terms Keep word length < 20

14 The process of email detection Motivation of this project Pre-processing of data Classifier Models Evaluation of classifiers

15 Different classification methods K Nearest Neighbor (KNN) Naive Bayes Classifier Logistic Regression Decision Tree Analysis

16 What is K Nearest Neighbor Use k "closet" samples (nearest neighbors) to perform classification

17 What is K Nearest Neighbor

18 Initial outcome and strategies for improvement KNN accuracy was ~64% - very low KNN classifier does not fit our project Term-list is still too large Try different method to classify and see if evaluation results are better than KNN results Continue to reduce size of term list by removing terms that are not meaningful

19 Steps for improvement Remove sparsity Reduced length threshold Created hashtable Used alternative classifier Naive- Bayes Classifier

20 Calculate Hash Key for each term in term-list. Once collision occurs, use the separate chain Hashtable

21 Naive- Bayes classifier

22 Secondary Results Correctness increases from 62% to 82.36%

23 Suggestions for further improvement Revise pre-processing Apply additional classifiers

24 Thank you Questions?


Download ppt "Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI."

Similar presentations


Ads by Google