Download presentation
Presentation is loading. Please wait.
Published byJody Shields Modified over 8 years ago
1
Project Presentation B92902041 王 立 B92902051 陳俊甫 B92902092 張又仁 B92902095 李佳穎
2
Outline Spam filter technology Personal issue Statistic Our approaching
3
Spam filtering technology 1.basic structured text filters 2.whitelist/verification filters 3.distributed blacklists Pyzor 4.rule-based rankings SpamAssasin 5.Bayesian word distribution filters 6.Bayesian trigram filters
4
Table 1. Quantitative accuracy of spam filtering techniques Technique Good corpus (correctly identified vs. incorrectly identified ) Spam corpus (correctly identified vs. incorrectly identified ) "The Truth"1851 vs. 0 1916 vs. 0 Trigram model1849 vs. 2 1774 vs. 142 Word model1847 vs. 4 1819 vs. 97 SpamAssassin1846 vs. 5 1558 vs. 358 Pyzor1847 vs. 0 (4 err)943 vs. 971 (2 err)
5
Bayesian filtering The first two using bayesian method Pantel and Lin Bayesian filtering 92% spam, 1.16% false positive at 1998 Bayesian doesn’t use in the begin Why?
6
Bayesian filtering (cont.) Someone we find later Jonathan Zdziarski The main problem of previous work is making false positive too high Bayesian filtering 99.5% spam, 0.03% false positive at 2002 Why so different?
7
Possible Reasons 1.less of training data: 160 spam and 466 non spam mails. 2.ignore message headers 3.stemmed the token, reduce words in bad way 4.using all tokens is not good compared with using 15 most significant 5.no bias against false positives
8
Personal issue Some good advantages about personalization 1.make filters more effective 2.let users decide their own spam filter 3.hard for spammer to tune the mail
9
Statistics The fifteen most interesting words in this spam, with their probabilities, are: madam 0.99 promotion 0.99 republic 0.99 shortest 0.047225013 mandatory 0.047225013 standardization 0.07347802 sorry 0.08221981 supported 0.09019077 people's 0.09019077 enter 0.9075001 quality 0.8921298 organization 0.12454646 investment 0.8568143 very 0.14758544 valuable 0.82347786
10
Our approaching machine learning (testing) machine learning (training) Sparse format Data Set
11
Data set Source http://iit.demokritos.gr/skel/i-config/downloads/ http://iit.demokritos.gr/skel/i-config/downloads/ Lingspam PU1 PU123 Enron-spam
12
Ling-spam Collected from a mailing list “Ling-spam” With 481 spam messages and 2412 non- spam messages Topics of legitimate mails are alike. May be good for training, but not enough generalized. 4 versions of the corpus Using Lemmatiser or not Using stop-list or not
13
Example Subject: want best economical hunt vacation life ? want best hunt camp vacation life, felton 's hunt camp wild wonderful west virginium. $ 50. 0 per day pay room three home cook meal ( pack lunch want stay wood noon ) cozy accomodation. reserve space. follow season book 1998 : buck season - nov. 23 - dec. 5 doe season - announce ( please call ) muzzel loader ( deer ) - dec. 14 - dec. 19 archery ( deer ) - oct. 17 - dec. 31 turkey sesson - oct. 24 - nov. 14 e - mail us 110734. 2622 @ compuserve. com
14
Features ‘Words’ as features Sequence of alpha, number and some symbols Only consider subject and body field Not supporting CJK for now Collected from only spams Unlimited feature set Use only features that appear often enough
15
Example for Features 104 please 104 free 103 our 95 mail 91 address 86 send 81 one 80 information 77 us 77 list 74 receive 74 name 73 money … Collected from the spams of lemm_stop section
16
Sparse Format Some result from lemm_stop/part1 : 0, 2:1, 3:1, 4:1, 5:1, 6:1, 10:1, 12:1, 15:1, 16:1, 20:1, … 0, 0:1, 4:1, 5:1, 6:1, 7:1, 8:1, 12:1, 16:1, 20:1, 22:1, … 0, 0:1, 4:1, 5:1, 7:1, 8:1, 11:1, 13:1, 25:1, 41:1, 53:1, … 0, 0:1, 4:1, 5:1, 6:1, 8:1, 9:1, 11:1, 12:1, 13:1, 14:1, … 1, 0:1, 3:1, 6:1, 10:1, 17:1, 18:1, 23:1, 26:1, 28:1, … 1, 3:1, 4:1, 5:1, 6:1, 8:1, 9:1, 11:1, 13:1, 14:1, 15:1, … 1, 0:1, 1:1, 2:1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1, 9:1, 10:1, …
17
Training method 1.Naïve bayes 2.k-NN,k=3 or less 3.CART tree
18
Training and testing Ling-spam is splitted into 10 parts Use 9 parts for training Use 1 parts for testing
19
Reference data spam filtering technology http://www- 128.ibm.com/developerworks/linux/library/l- spamf.html http://www- 128.ibm.com/developerworks/linux/library/l- spamf.html Better bayesian filtering http://www.paulgraham.com/better.html a plan for spam http://www.paulgraham.com/spam.html
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.