Download presentation
Presentation is loading. Please wait.
1
© 2003 Franz J. Kurfess Spam Filtering 1 CPE/CSC 481: Knowledge-Based Systems Dr. Franz J. Kurfess Computer Science Department Cal Poly
2
© 2003 Franz J. Kurfess Spam Filtering 2 Course Overview u Introduction u Knowledge Representation u Semantic Nets, Frames, Logic u Reasoning and Inference u Predicate Logic, Inference Methods, Resolution u Reasoning with Uncertainty u Probability, Bayesian Decision Making u Expert System Design u ES Life Cycle u CLIPS Overview u Concepts, Notation, Usage u Pattern Matching u Variables, Functions, Expressions, Constraints u Expert System Implementation u Salience, Rete Algorithm u Expert System Examples u Conclusions and Outlook
3
© 2003 Franz J. Kurfess Spam Filtering 3 Overview Spam Filtering u Motivation u Objectives u Chapter Introduction u Spam u Terminology u Dealing with Spam u Laws and Regulations u Filtering via Keywords u Filtering via Rules u Learning u Spam and Bayes u Binary Classification of Documents u N-ary Classification u Implementation u SpamBayes Project u Related Projects u Important Concepts and Terms u Summary
4
© 2003 Franz J. Kurfess Spam Filtering 4 Logistics u Introductions u Course Materials u textbooks (see below) u lecture notes u PowerPoint Slides will be available on my Web page u handouts u Web page u http://www.csc.calpoly.edu/~fkurfess http://www.csc.calpoly.edu/~fkurfess u Term Project u Lab and Homework Assignments u Exams u Grading
5
© 2003 Franz J. Kurfess Spam Filtering 5 Bridge-In
6
© 2003 Franz J. Kurfess Spam Filtering 6 Pre-Test
7
© 2003 Franz J. Kurfess Spam Filtering 7 Motivation dealing with spam “manually” is very time-consuming, tedious, and prone to errors various methods have been tried to “filter” spam, with varying success early results with Bayesian approaches look very promising
8
© 2003 Franz J. Kurfess Spam Filtering 8 Objectives be familiar with the terminology spam Bayesian approaches to understand elementary methods for handling spam automatically more advanced methods scenarios and applications for those methods important characteristics differences between methods, advantages, disadvantages, performance, typical scenarios to evaluate the suitability of approaches for specific tasks binary classification n-ary classification to be able to apply Bayesian filtering spam similar problems
9
© 2003 Franz J. Kurfess Spam Filtering 9 Evaluation Criteria
10
© 2003 Franz J. Kurfess Spam Filtering 10
11
© 2003 Franz J. Kurfess Spam Filtering 11 Spam u broadly: any email that is not wanted by the recipient u similar to paper “junk” mail u easily recognized by recipients u unsolicited bulk email u not requested by the recipients u automatically sent out to a large number of recipients u “optional” characteristics u disguised or forged sender, return addresses and email forwarding information u questionable contents u illegal, unethical, fraudulent,... u hidden activities u acknowledgement of receipt, spyware (“Web bugs”), virus
12
© 2003 Franz J. Kurfess Spam Filtering 12 Terminology u spam terms u spam: negative (bad stuff) u ham: positive (good stuff) u Filtering terms u false negative u spam incorrectly classified as ham spam “gets through” u false positive u ham incorrectly classified as spam valid messages are blocked u corpus u body of documents (email messages) u hapax, hapax legomenon u unique word in a specific message u sample or training set u messages used to train the system u test set u messages used to evaluate the system http://spambayes.sourceforge.net/
13
© 2003 Franz J. Kurfess Spam Filtering 13 Filtering Spam u Keywords u Rules u Learning
14
© 2003 Franz J. Kurfess Spam Filtering 14 Keywords u identify keywords that frequently occur in spam u simple and efficient u all incoming messages are checked for the occurrence of these keywords u if a message contains any or several of them, it is blocked u the list of keywords can be modified easily u not very accurate u many false positives u legitimate messages that happen to include “forbidden” words u many false negatives u can be easily circumvented u used in some early email filtering and Web blocking tools u little to moderate success
15
© 2003 Franz J. Kurfess Spam Filtering 15 Rules u characteristics of spam messages are described through if... then rules u not too complicated, moderately efficient u characteristics can be combined u not only keywords u also formatting, headers u more accurate u fewer false positives u allows a better description of spam messages u fewer false negatives u somewhat more difficult to circumvent
16
© 2003 Franz J. Kurfess Spam Filtering 16 Learning u samples of good (ham) and bad (spam) messages are given to the system before it is deployed u the system analyses various criteria, and tries to determine which criteria are most valuable for the distinction u used earlier for general email categorization u assignment of messages to folders u suggestion of actions to be performed (e.g. reply, delete, forward) u spam was not a problem at that time
17
© 2003 Franz J. Kurfess Spam Filtering 17 Spam and Bayes u Binary Classification of Documents u two bins: u spam, ham u sometimes an implicit “undecided” bin is used u N-ary Classification u uses n bins u “sure spam”, ”probably spam”, “maybe spam”, “unclear” “maybe ham”, “probably ham”, “sure ham” u Related Approaches u neural networks instead of Bayesian filtering u essentially also uses statistical techniques
18
© 2003 Franz J. Kurfess Spam Filtering 18 Binary Classification of Documents u documents are parsed, and tokens extracted u pieces of the message that may serve as classification criteria u determined by the developer u the number of occurrences for each token is calculated u done for two corpora: one ham, one spam u results in two tables with occurrences of tokens in ham and spam u a third table is created that reflects the probability of a message being ham or spam
19
© 2003 Franz J. Kurfess Spam Filtering 19 Calculation of Probabilities Tokenizer Scoring Training Testing http://spambayes.sourceforge.net/
20
© 2003 Franz J. Kurfess Spam Filtering 20 Tokenizer breaks up a mail message into a series of tokens usually words or word stems sometimes complete phrases may consider non-textual elements message headers, HTML constructs, images, comments it can be difficult to identify meaningful tokens message body tokens embedded URLs message headers correlation between different types of clues http://spambayes.sourceforge.net/
21
© 2003 Franz J. Kurfess Spam Filtering 21 Scoring assigns a number to each message 0 definite ham 1 definite spam most difficult and sensitive part of the system incorrect scores false positives false negatives unjustified confidence scores are mostly close to 0 and 1, and rarely in between improvements through using two separate probabilities ham probability spam probability allows better treatment of unknown cases as “unsure” substantial reduction of false positives and false negatives http://spambayes.sourceforge.net/
22
© 2003 Franz J. Kurfess Spam Filtering 22 Training presentation of examples for ham and spam generates the probabilities used by the scoring system to assign values to new messages corpus size usually the larger, the better too large may lead to overtraining the number of ham and spam examples should be roughly equal corpus quality representative samples are very valuable better quality can make up for lack of quantity avoid misleading cues e.g. recent spam vs. old ham; tags added by the mail system http://spambayes.sourceforge.net/
23
© 2003 Franz J. Kurfess Spam Filtering 23 Testing messages categorized as ham or spam are used for testing the performance of the system frequently the existing collection of categorized messages is divided into a training and a testing set intuitive insights often don’t work well HTML tags exclamation marks in the header MESSAGES WRITTEN IN CAPITALS cross-validation formal technique that systematically divides the corpus into various combinations of training and test sets http://spambayes.sourceforge.net/
24
© 2003 Franz J. Kurfess Spam Filtering 24 Results performance results are notoriously difficult to compare message corpus training methods threshold cut-off value for spam “magic numbers” parameters adjusted by the developer or user
25
© 2003 Franz J. Kurfess Spam Filtering 25 Selected Results based on Paul Graham’s article “Better Bayesian Filtering” [Graham, 2003b][Graham, 2003b] 99.75 filtering rate on 1750 messages over 1 month 4 false negatives: spam got through usage of mostly legitimate words neutral text with an innocent-sounding URL 3 false positives: ham got blocked newsletters sent through commercial emailers almost spam email that happens to have features typically associated with spam ALL CAPITALS,, in-line images, URLs
26
© 2003 Franz J. Kurfess Spam Filtering 26 Token Probabilities based on Paul Graham’s article “Better Bayesian Filtering” [Graham, 2003b] [Graham, 2003b] Subject*FREE 0.9999 free!! 0.9999 To*free 0.9998 Subject*free 0.9782 free! 0.9199 Free 0.9198 Url*free 0.9091 FREE 0.8747 From*free 0.7636 free 0.6546
27
© 2003 Franz J. Kurfess Spam Filtering 27 N-ary Classification u more than two categories u similar techniques as in the binary approach u can be substantially more complex
28
© 2003 Franz J. Kurfess Spam Filtering 28 Related Approaches u collaborative filtering u many people categorize messages as spam, and submit them to a central system u also should have ham samples u may “wash out” individual differences u neural networks u similar concepts, but different learning methods
29
© 2003 Franz J. Kurfess Spam Filtering 29 Implementation u SpamBayes Project [SpamBayes][SpamBayes] u stand-alone filter u plug-in for some popular mail programs u Related Projects u SpamAssassin http://spamassassin.org/http://spamassassin.org/ u combines statistical techniques, rules, black-lists, collaborative filtering u see Paul Graham’s list of spam filters at http://www.paulgraham.com/filters.html http://www.paulgraham.com/filters.html
30
© 2003 Franz J. Kurfess Spam Filtering 30 Future Work extension to more sophisticated tokens phrases letters replaced by visually similar symbols e.g. o/0, l/1 separators inserted between characters spam -> s p a m, s-p-a-m combination with other approaches blacklists, whitelists, rule-based systems,... genetic algorithms construction of filters through evolution
31
© 2003 Franz J. Kurfess Spam Filtering 31 References [Graham, 2003a] Paul Graham, A Plan for Spam. http://www.paulgraham.com/spam.html, August 2002. [Graham, 2003a] http://www.paulgraham.com/spam.html [Graham, 2003b] Paul Graham, Better Bayesian Filtering. http://www.paulgraham.com/better.html, January 2003. [Graham, 2003b] http://www.paulgraham.com/better.html [SpamBayes] SpamBayes : Bayesian anti-spam classifier written in Python. http://spambayes.sourceforge.net/, visited Feb. 2003 [SpamBayes]SpamBayes : Bayesian anti-spam classifier written in Python.http://spambayes.sourceforge.net/ [Robinson, 2002] Gary Robinson's Rants: Spam Detection. http://radio.weblogs.com/0101454/stories/2002/09/16/spamD etection.html, Dec. 2002. [A revised version is to appear in the March 2003 issue of the Linux Journal, http://www.linuxjournal.com/. [Robinson, 2002] http://radio.weblogs.com/0101454/stories/2002/09/16/spamD etection.html http://www.linuxjournal.com/ [Giarratano & Riley 1998]
32
© 2003 Franz J. Kurfess Spam Filtering 32 Important Concepts and Terms agenda backward chaining common-sense knowledge conflict resolution expert system (ES) expert system shell explanation forward chaining inference inference mechanism If-Then rules knowledge knowledge acquisition knowledge base knowledge-based system knowledge representation Markov algorithm matching Post production system problem domain production rules reasoning RETE algorithm rule working memory
33
© 2003 Franz J. Kurfess Spam Filtering 33 Summary Spam Filtering
34
© 2003 Franz J. Kurfess Spam Filtering 34
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.