© 2003 Franz J. Kurfess Spam Filtering 1 CPE/CSC 481: Knowledge-Based Systems Dr. Franz J. Kurfess Computer Science Department Cal Poly
© 2003 Franz J. Kurfess Spam Filtering 2 Course Overview u Introduction u Knowledge Representation u Semantic Nets, Frames, Logic u Reasoning and Inference u Predicate Logic, Inference Methods, Resolution u Reasoning with Uncertainty u Probability, Bayesian Decision Making u Expert System Design u ES Life Cycle u CLIPS Overview u Concepts, Notation, Usage u Pattern Matching u Variables, Functions, Expressions, Constraints u Expert System Implementation u Salience, Rete Algorithm u Expert System Examples u Conclusions and Outlook
© 2003 Franz J. Kurfess Spam Filtering 3 Overview Spam Filtering u Motivation u Objectives u Chapter Introduction u Spam u Terminology u Dealing with Spam u Laws and Regulations u Filtering via Keywords u Filtering via Rules u Learning u Spam and Bayes u Binary Classification of Documents u N-ary Classification u Implementation u SpamBayes Project u Related Projects u Important Concepts and Terms u Summary
© 2003 Franz J. Kurfess Spam Filtering 4 Logistics u Introductions u Course Materials u textbooks (see below) u lecture notes u PowerPoint Slides will be available on my Web page u handouts u Web page u u Term Project u Lab and Homework Assignments u Exams u Grading
© 2003 Franz J. Kurfess Spam Filtering 5 Bridge-In
© 2003 Franz J. Kurfess Spam Filtering 6 Pre-Test
© 2003 Franz J. Kurfess Spam Filtering 7 Motivation dealing with spam “manually” is very time-consuming, tedious, and prone to errors various methods have been tried to “filter” spam, with varying success early results with Bayesian approaches look very promising
© 2003 Franz J. Kurfess Spam Filtering 8 Objectives be familiar with the terminology spam Bayesian approaches to understand elementary methods for handling spam automatically more advanced methods scenarios and applications for those methods important characteristics differences between methods, advantages, disadvantages, performance, typical scenarios to evaluate the suitability of approaches for specific tasks binary classification n-ary classification to be able to apply Bayesian filtering spam similar problems
© 2003 Franz J. Kurfess Spam Filtering 9 Evaluation Criteria
© 2003 Franz J. Kurfess Spam Filtering 10
© 2003 Franz J. Kurfess Spam Filtering 11 Spam u broadly: any that is not wanted by the recipient u similar to paper “junk” mail u easily recognized by recipients u unsolicited bulk u not requested by the recipients u automatically sent out to a large number of recipients u “optional” characteristics u disguised or forged sender, return addresses and forwarding information u questionable contents u illegal, unethical, fraudulent,... u hidden activities u acknowledgement of receipt, spyware (“Web bugs”), virus
© 2003 Franz J. Kurfess Spam Filtering 12 Terminology u spam terms u spam: negative (bad stuff) u ham: positive (good stuff) u Filtering terms u false negative u spam incorrectly classified as ham spam “gets through” u false positive u ham incorrectly classified as spam valid messages are blocked u corpus u body of documents ( messages) u hapax, hapax legomenon u unique word in a specific message u sample or training set u messages used to train the system u test set u messages used to evaluate the system
© 2003 Franz J. Kurfess Spam Filtering 13 Filtering Spam u Keywords u Rules u Learning
© 2003 Franz J. Kurfess Spam Filtering 14 Keywords u identify keywords that frequently occur in spam u simple and efficient u all incoming messages are checked for the occurrence of these keywords u if a message contains any or several of them, it is blocked u the list of keywords can be modified easily u not very accurate u many false positives u legitimate messages that happen to include “forbidden” words u many false negatives u can be easily circumvented u used in some early filtering and Web blocking tools u little to moderate success
© 2003 Franz J. Kurfess Spam Filtering 15 Rules u characteristics of spam messages are described through if... then rules u not too complicated, moderately efficient u characteristics can be combined u not only keywords u also formatting, headers u more accurate u fewer false positives u allows a better description of spam messages u fewer false negatives u somewhat more difficult to circumvent
© 2003 Franz J. Kurfess Spam Filtering 16 Learning u samples of good (ham) and bad (spam) messages are given to the system before it is deployed u the system analyses various criteria, and tries to determine which criteria are most valuable for the distinction u used earlier for general categorization u assignment of messages to folders u suggestion of actions to be performed (e.g. reply, delete, forward) u spam was not a problem at that time
© 2003 Franz J. Kurfess Spam Filtering 17 Spam and Bayes u Binary Classification of Documents u two bins: u spam, ham u sometimes an implicit “undecided” bin is used u N-ary Classification u uses n bins u “sure spam”, ”probably spam”, “maybe spam”, “unclear” “maybe ham”, “probably ham”, “sure ham” u Related Approaches u neural networks instead of Bayesian filtering u essentially also uses statistical techniques
© 2003 Franz J. Kurfess Spam Filtering 18 Binary Classification of Documents u documents are parsed, and tokens extracted u pieces of the message that may serve as classification criteria u determined by the developer u the number of occurrences for each token is calculated u done for two corpora: one ham, one spam u results in two tables with occurrences of tokens in ham and spam u a third table is created that reflects the probability of a message being ham or spam
© 2003 Franz J. Kurfess Spam Filtering 19 Calculation of Probabilities Tokenizer Scoring Training Testing
© 2003 Franz J. Kurfess Spam Filtering 20 Tokenizer breaks up a mail message into a series of tokens usually words or word stems sometimes complete phrases may consider non-textual elements message headers, HTML constructs, images, comments it can be difficult to identify meaningful tokens message body tokens embedded URLs message headers correlation between different types of clues
© 2003 Franz J. Kurfess Spam Filtering 21 Scoring assigns a number to each message 0 definite ham 1 definite spam most difficult and sensitive part of the system incorrect scores false positives false negatives unjustified confidence scores are mostly close to 0 and 1, and rarely in between improvements through using two separate probabilities ham probability spam probability allows better treatment of unknown cases as “unsure” substantial reduction of false positives and false negatives
© 2003 Franz J. Kurfess Spam Filtering 22 Training presentation of examples for ham and spam generates the probabilities used by the scoring system to assign values to new messages corpus size usually the larger, the better too large may lead to overtraining the number of ham and spam examples should be roughly equal corpus quality representative samples are very valuable better quality can make up for lack of quantity avoid misleading cues e.g. recent spam vs. old ham; tags added by the mail system
© 2003 Franz J. Kurfess Spam Filtering 23 Testing messages categorized as ham or spam are used for testing the performance of the system frequently the existing collection of categorized messages is divided into a training and a testing set intuitive insights often don’t work well HTML tags exclamation marks in the header MESSAGES WRITTEN IN CAPITALS cross-validation formal technique that systematically divides the corpus into various combinations of training and test sets
© 2003 Franz J. Kurfess Spam Filtering 24 Results performance results are notoriously difficult to compare message corpus training methods threshold cut-off value for spam “magic numbers” parameters adjusted by the developer or user
© 2003 Franz J. Kurfess Spam Filtering 25 Selected Results based on Paul Graham’s article “Better Bayesian Filtering” [Graham, 2003b][Graham, 2003b] filtering rate on 1750 messages over 1 month 4 false negatives: spam got through usage of mostly legitimate words neutral text with an innocent-sounding URL 3 false positives: ham got blocked newsletters sent through commercial ers almost spam that happens to have features typically associated with spam ALL CAPITALS,, in-line images, URLs
© 2003 Franz J. Kurfess Spam Filtering 26 Token Probabilities based on Paul Graham’s article “Better Bayesian Filtering” [Graham, 2003b] [Graham, 2003b] Subject*FREE free!! To*free Subject*free free! Free Url*free FREE From*free free
© 2003 Franz J. Kurfess Spam Filtering 27 N-ary Classification u more than two categories u similar techniques as in the binary approach u can be substantially more complex
© 2003 Franz J. Kurfess Spam Filtering 28 Related Approaches u collaborative filtering u many people categorize messages as spam, and submit them to a central system u also should have ham samples u may “wash out” individual differences u neural networks u similar concepts, but different learning methods
© 2003 Franz J. Kurfess Spam Filtering 29 Implementation u SpamBayes Project [SpamBayes][SpamBayes] u stand-alone filter u plug-in for some popular mail programs u Related Projects u SpamAssassin u combines statistical techniques, rules, black-lists, collaborative filtering u see Paul Graham’s list of spam filters at
© 2003 Franz J. Kurfess Spam Filtering 30 Future Work extension to more sophisticated tokens phrases letters replaced by visually similar symbols e.g. o/0, l/1 separators inserted between characters spam -> s p a m, s-p-a-m combination with other approaches blacklists, whitelists, rule-based systems,... genetic algorithms construction of filters through evolution
© 2003 Franz J. Kurfess Spam Filtering 31 References [Graham, 2003a] Paul Graham, A Plan for Spam. August [Graham, 2003a] [Graham, 2003b] Paul Graham, Better Bayesian Filtering. January [Graham, 2003b] [SpamBayes] SpamBayes : Bayesian anti-spam classifier written in Python. visited Feb [SpamBayes]SpamBayes : Bayesian anti-spam classifier written in Python. [Robinson, 2002] Gary Robinson's Rants: Spam Detection. etection.html, Dec [A revised version is to appear in the March 2003 issue of the Linux Journal, [Robinson, 2002] etection.html [Giarratano & Riley 1998]
© 2003 Franz J. Kurfess Spam Filtering 32 Important Concepts and Terms agenda backward chaining common-sense knowledge conflict resolution expert system (ES) expert system shell explanation forward chaining inference inference mechanism If-Then rules knowledge knowledge acquisition knowledge base knowledge-based system knowledge representation Markov algorithm matching Post production system problem domain production rules reasoning RETE algorithm rule working memory
© 2003 Franz J. Kurfess Spam Filtering 33 Summary Spam Filtering
© 2003 Franz J. Kurfess Spam Filtering 34