Download presentation
Presentation is loading. Please wait.
Published byAugust Hancock Modified over 9 years ago
1
Group 2 R95922027 李庭閣 R95922034 孔垂玖 R95922081 許守傑 R95942129 鄭力維
2
Experiment setting Feature extraction Model training Hybrid-Model Conclusion Reference
3
Selected online corpus: enron Removing html tags Factoring important headers Six folders from enron1 to enron6. Contain totally 13496 spam mails & 15045 ham mails
4
Experiment setting Feature extraction Model training Hybrid-Model Conclusion Reference
5
1. Transmitted Time of the Mail 2. Number of the Receiver 3. Existence of Attachment 4. Existence of images in mail 5. Existence of Cited URLs in mail 6. Symbols in Mail Title 7. Mail-body
6
Spam: Non-uniform D istribution Spam: Only Single Receiver
8
AttachmentImageURL Spam 0.0307%0.6816%30.779% Ham 7.3712%0%7.0521%
9
MarksProbability of being Spam Mail Feature Showing Rate ~ ^ | * % [] ! ? =0.91128% in spam \ / ; &0.18216% in ham Title Absentness Spam senders add titles now. Arabic Numeral : Almost equal probability (Date, ID) Non-alphanumeric Character & Punctuation Marks: Appear more often in Spam Appear more often in ham
10
Build the internal structure of words Use a good NLP tool called Treetagger to help us do word stemming Given the stemmed words appeared in each mail, we build a sparse format vector to represent the “semantic” of a mail
11
Experiment setting Feature extraction Model training Hybrid-Model Conclusion Reference
12
Given a bag of words (x 1, x 2, x 3,…,x n ), Naïve Bayes is powerful for document classification.
13
Create a word-document (mail) matrix by SRILM. For every mail (column) pair, a similarity value can be calculated.
14
As K = 1, the KNN classification model show the best accuracy.
15
Maximize the entropy and minimize the Kullback-Leiber distance between model and the real distribution. The elements in word-document matrix are modified to the binary value {0, 1}.
16
Binary : Select binary value {0,1} to represent that this word appears or not Normalized : Count the occurrence of each word and divide them by their maximum occurrence counts.
17
Experiment setting Feature extraction Model training Hybrid-Model Conclusion Reference
18
The accuracy of NN-based Hybrid Model is always the highest.
19
The voting model averages the classification result, promoting the ability of the filter slightly. However, sometimes voting might reduce the accuracy because of misjudgments of majority. 1.Knn + naïve Bayes + Maximum Entropy 2.naïve Bayes + Maximum Entropy + SVM
20
Experiment setting Feature extraction Model training Hybrid-Model Conclusion Reference
21
7 features are shown mail type discrimination. Transmitted Time & Receiver Size Attachment, Image, and URL Non-alphanumeric Character & Punctuation Marks 5 populous Machine Learning are proved suitable for spam filter Naïve Bayes, KNN, SVM 2 Model combination ways are tested. Committee-based & Single Neural Network
22
[1]. M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, "A Bayesian Approach to Filtering Junk E- Mail," in Proc. AAAI 1998, Jul. 1998. [2] A plan for spam: http://www.paulgraham.com/spam.html http://www.paulgraham.com/spam.html [3]Enron Corpus: http://www.aueb.gr/users/ion/ http://www.aueb.gr/users/ion/ [4]Treetagger: http://www.ims.uni- stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html [5]Maximum Entropy: http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html [6]SRILM: http://www.speech.sri.com/projects/srilm/ http://www.speech.sri.com/projects/srilm/ [7]SVM: http://svmlight.joachims.org/http://svmlight.joachims.org/
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.