1 A Study of Supervised Spam Detection Applied to Eight Months of Personal E- Mail Gordon Cormack and Thomas Lynam Presented by Hui Fang.

1 A Study of Supervised Spam Detection Applied to Eight Months of Personal E- Mail Gordon Cormack and Thomas Lynam Presented by Hui Fang

2 Feel free to interrupt when you have any question or comment!

3 Detour: Some background about Email Spam Some slides are adapted from the Tutorial on Junk Mail Filtering by Geoff Hulten and Joshua Goodman.

4 What is Spam? Typical legal definition –Unsolicited commercial email from someone without a pre-existing business relationship Definition mostly used –Whatever the users think

5 Unofficial Statistics of Spam (Feb.3 to Feb. 12) My Email AccountYahooUIUC Number of Emails 250150 Number of Spams 21412 Spam Rate 85.6%8% It is inconvenient, annoying and wasteful of computer resources. Its volume threatens to overwhelm our ability to recognize useful messages.

6 Spam Detection Is this just text categorization? What are the special challenges? Ham Spam

7 Text classification alone is not enough Spammers now often try to obscure text. Special features are necessary. –E.g. subject line vs. body text –E.g. Mail in the middle of the night is more likely to be spam than mail in the middle of the day. …

8 Weather Report Guy Content in Image Weather, Sunny, High 82, Low 81, Favorite…

9 Secret Decoder Ring Dude Another spam that looks easy Is it?

10 Secret Decoder Ring Dude Character Encoding HTML word breaking Pharmacy Produc t s

11 Diploma Guy Word Obscuring Dlpmoia Pragorm Caerte a mroe prosoeprus

12 Diploma Guy Word Obscuring Dipmloa Paogrrm Cterae a more presporous

13 Diploma Guy Word Obscuring Dimlpoa Pgorram Cearte a more poosperrus

14 Diploma Guy Word Obscuring Dpmloia Pragorm Caetre a more prorpeosus

15 Diploma Guy Word Obscuring Dlpmoia Pragorm Carete a mroe prorpseous

16 More of Diploma Guy Diploma Guy is good at what he does

17 One Solution to Spam Detection Machine Learning –Learn spam versus good

18 Naïve Bayes Want Use Bayes Rule: Assume independence: probability of each word independent of others

19 A Bayesian Approach to Filtering Junk E-Mail 1998 - Sahami, Dumais, Heckerman, Horvitz One of the first papers on using machine learning to combat spam Used Naïve Bayes Feature Space: Words, Phrases, Domain-Specific Features Evaluation Data: ~1700 Messages, ~88% Spam, from volunteer’s private e-mail

20 A Bayesian Approach to Filtering Junk E-Mail 1998 - Sahami, Dumais, Heckerman, Horvitz Hand Crafted Features –35 Phrases ‘Free Money’ ‘Only $’ ‘be over 21’ –20 Domain Specific Features Domain type of sender (.edu,.com, etc) Sender name resolutions (internal mail) Has attachments Time received Percent of non-alphanumeric characters in subject Best collection of heuristics discussed in literature –Without them: Spam precision 97.1% Spam recall 94.3% –With them:Spam precision 100% Spam recall 98.3%

21 A Plan for Spam 2002 – P. Graham Widely cited in the open source community Uses a heavily tuned version of Naïve Bayes Feature Space: Words in header and body Feature Selection: ~23,000 features –all that appeared more than 5 times Evaluation Data: ~8000 messages from author; ~50% spam Results: Spam precision 100%, Spam recall 99.5%

22 Algorithms Used in Spam Detection Naïve Bayes reported to do very well More complex algorithms have some gain

23 Which Algorithm is Best? Very difficult to tell –No consistently-used good data set –No standard evaluation measures Focus of the paper

24 End of Detour

25 Overview of the Paper A Study of Supervised Spam Detection Applied to Eight Months of Personal E-Mail Present several evaluation measures for spam detection Compare methods in six open-sources spam filters Analysis the experiment results

26 Problem: Supervised Spam Detection

27 Methods Methods in six open-source spam filters –Spamassassin –Bogofilter –CRM-114 –DSPAM –SpamBayes –Spamprobe

28 Data A person’s eight month E-mails –From Aug. 2003 to March 2004 Stored in the order received 49,086 messages with judgements – 9,038 (18.4%) ham –40,048 (81.6%) spam

29 Evaluation Measures (1) judgement HamSpam Hamab Spamcd Result a: ham (correctly classified)[true negative] b: spam misclassification[false negative] c: ham misclassification[false positive] d: spam (correctly classified) [true negative] Accuracy: (a+d)/(a+b+c+d) Spam recall: d/(b+d) Spam precision: d/(d+c) Ham misclassification rate: c/(a+c) Spam misclassification rate: b/(b+d)

30 Evaluation Measures (2) Ham/Spam tradeoff curve, i.e. ROC curve Single ham/spam tradeoff score: ROC area under the curve –The probability that a random spam message will receive a higher score than a random ham message

31 Evaluation Measures (3) Ham/Spam leaning curve

32 Misclassification by Genre Not all types of ham are equal –Some more likely misclassified –Some more likely missed if filtered –Some more valuable Spam can similarly be classified

33 Conclusion Present several possible evaluation measures for spam detection Compare several spam detection methods Provide Analysis of the experiment results However, it would be more interesting to compare the performance of different algorithms (e.g. NB vs. SVM).

34 The End Thank you!

1 A Study of Supervised Spam Detection Applied to Eight Months of Personal E- Mail Gordon Cormack and Thomas Lynam Presented by Hui Fang.

Similar presentations

Presentation on theme: "1 A Study of Supervised Spam Detection Applied to Eight Months of Personal E- Mail Gordon Cormack and Thomas Lynam Presented by Hui Fang."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 A Study of Supervised Spam Detection Applied to Eight Months of Personal E- Mail Gordon Cormack and Thomas Lynam Presented by Hui Fang.

Similar presentations

Presentation on theme: "1 A Study of Supervised Spam Detection Applied to Eight Months of Personal E- Mail Gordon Cormack and Thomas Lynam Presented by Hui Fang."— Presentation transcript:

Similar presentations

About project

Feedback