© 2003 Franz J. Kurfess Spam Filtering 1 CPE/CSC 481: Knowledge-Based Systems Dr. Franz J. Kurfess Computer Science Department Cal Poly.

© 2003 Franz J. Kurfess Spam Filtering 2 Course Overview u Introduction u Knowledge Representation u Semantic Nets, Frames, Logic u Reasoning and Inference u Predicate Logic, Inference Methods, Resolution u Reasoning with Uncertainty u Probability, Bayesian Decision Making u Expert System Design u ES Life Cycle u CLIPS Overview u Concepts, Notation, Usage u Pattern Matching u Variables, Functions, Expressions, Constraints u Expert System Implementation u Salience, Rete Algorithm u Expert System Examples u Conclusions and Outlook

© 2003 Franz J. Kurfess Spam Filtering 3 Overview Spam Filtering u Motivation u Objectives u Chapter Introduction u Spam u Terminology u Dealing with Spam u Laws and Regulations u Filtering via Keywords u Filtering via Rules u Learning u Spam and Bayes u Binary Classification of Documents u N-ary Classification u Implementation u SpamBayes Project u Related Projects u Important Concepts and Terms u Summary

© 2003 Franz J. Kurfess Spam Filtering 4 Logistics u Introductions u Course Materials u textbooks (see below) u lecture notes u PowerPoint Slides will be available on my Web page u handouts u Web page u http://www.csc.calpoly.edu/~fkurfess http://www.csc.calpoly.edu/~fkurfess u Term Project u Lab and Homework Assignments u Exams u Grading

© 2003 Franz J. Kurfess Spam Filtering 7 Motivation  dealing with spam “manually” is very time-consuming, tedious, and prone to errors  various methods have been tried to “filter” spam, with varying success  early results with Bayesian approaches look very promising

© 2003 Franz J. Kurfess Spam Filtering 8 Objectives  be familiar with the terminology  spam  Bayesian approaches  to understand  elementary methods for handling spam automatically  more advanced methods  scenarios and applications for those methods  important characteristics  differences between methods, advantages, disadvantages, performance, typical scenarios  to evaluate the suitability of approaches for specific tasks  binary classification  n-ary classification  to be able to apply Bayesian filtering  spam  similar problems

© 2003 Franz J. Kurfess Spam Filtering 11 Spam u broadly: any email that is not wanted by the recipient u similar to paper “junk” mail u easily recognized by recipients u unsolicited bulk email u not requested by the recipients u automatically sent out to a large number of recipients u “optional” characteristics u disguised or forged sender, return addresses and email forwarding information u questionable contents u illegal, unethical, fraudulent,... u hidden activities u acknowledgement of receipt, spyware (“Web bugs”), virus

© 2003 Franz J. Kurfess Spam Filtering 12 Terminology u spam terms u spam: negative (bad stuff) u ham: positive (good stuff) u Filtering terms u false negative u spam incorrectly classified as ham spam “gets through” u false positive u ham incorrectly classified as spam valid messages are blocked u corpus u body of documents (email messages) u hapax, hapax legomenon u unique word in a specific message u sample or training set u messages used to train the system u test set u messages used to evaluate the system http://spambayes.sourceforge.net/

© 2003 Franz J. Kurfess Spam Filtering 14 Keywords u identify keywords that frequently occur in spam u simple and efficient u all incoming messages are checked for the occurrence of these keywords u if a message contains any or several of them, it is blocked u the list of keywords can be modified easily u not very accurate u many false positives u legitimate messages that happen to include “forbidden” words u many false negatives u can be easily circumvented u used in some early email filtering and Web blocking tools u little to moderate success

© 2003 Franz J. Kurfess Spam Filtering 15 Rules u characteristics of spam messages are described through if... then rules u not too complicated, moderately efficient u characteristics can be combined u not only keywords u also formatting, headers u more accurate u fewer false positives u allows a better description of spam messages u fewer false negatives u somewhat more difficult to circumvent

© 2003 Franz J. Kurfess Spam Filtering 16 Learning u samples of good (ham) and bad (spam) messages are given to the system before it is deployed u the system analyses various criteria, and tries to determine which criteria are most valuable for the distinction u used earlier for general email categorization u assignment of messages to folders u suggestion of actions to be performed (e.g. reply, delete, forward) u spam was not a problem at that time

© 2003 Franz J. Kurfess Spam Filtering 17 Spam and Bayes u Binary Classification of Documents u two bins: u spam, ham u sometimes an implicit “undecided” bin is used u N-ary Classification u uses n bins u “sure spam”, ”probably spam”, “maybe spam”, “unclear” “maybe ham”, “probably ham”, “sure ham” u Related Approaches u neural networks instead of Bayesian filtering u essentially also uses statistical techniques

© 2003 Franz J. Kurfess Spam Filtering 18 Binary Classification of Documents u documents are parsed, and tokens extracted u pieces of the message that may serve as classification criteria u determined by the developer u the number of occurrences for each token is calculated u done for two corpora: one ham, one spam u results in two tables with occurrences of tokens in ham and spam u a third table is created that reflects the probability of a message being ham or spam

© 2003 Franz J. Kurfess Spam Filtering 20 Tokenizer  breaks up a mail message into a series of tokens  usually words or word stems  sometimes complete phrases  may consider non-textual elements  message headers, HTML constructs, images, comments  it can be difficult to identify meaningful tokens  message body tokens  embedded URLs  message headers  correlation between different types of clues http://spambayes.sourceforge.net/

© 2003 Franz J. Kurfess Spam Filtering 21 Scoring  assigns a number to each message  0 definite ham  1 definite spam  most difficult and sensitive part of the system  incorrect scores  false positives  false negatives  unjustified confidence  scores are mostly close to 0 and 1, and rarely in between  improvements through using two separate probabilities  ham probability  spam probability  allows better treatment of unknown cases as “unsure”  substantial reduction of false positives and false negatives http://spambayes.sourceforge.net/

© 2003 Franz J. Kurfess Spam Filtering 22 Training  presentation of examples for ham and spam  generates the probabilities used by the scoring system to assign values to new messages  corpus size  usually the larger, the better  too large may lead to overtraining  the number of ham and spam examples should be roughly equal  corpus quality  representative samples are very valuable  better quality can make up for lack of quantity  avoid misleading cues  e.g. recent spam vs. old ham; tags added by the mail system http://spambayes.sourceforge.net/

© 2003 Franz J. Kurfess Spam Filtering 23 Testing  messages categorized as ham or spam are used for testing the performance of the system  frequently the existing collection of categorized messages is divided into a training and a testing set  intuitive insights often don’t work well  HTML tags  exclamation marks in the header  MESSAGES WRITTEN IN CAPITALS  cross-validation  formal technique that systematically divides the corpus into various combinations of training and test sets http://spambayes.sourceforge.net/

© 2003 Franz J. Kurfess Spam Filtering 24 Results  performance results are notoriously difficult to compare  message corpus  training methods  threshold  cut-off value for spam  “magic numbers”  parameters adjusted by the developer or user

© 2003 Franz J. Kurfess Spam Filtering 25 Selected Results  based on Paul Graham’s article “Better Bayesian Filtering” [Graham, 2003b][Graham, 2003b]  99.75 filtering rate on 1750 messages over 1 month  4 false negatives: spam got through  usage of mostly legitimate words  neutral text with an innocent-sounding URL  3 false positives: ham got blocked  newsletters sent through commercial emailers  almost spam  email that happens to have features typically associated with spam  ALL CAPITALS,, in-line images, URLs

© 2003 Franz J. Kurfess Spam Filtering 26 Token Probabilities  based on Paul Graham’s article “Better Bayesian Filtering” [Graham, 2003b] [Graham, 2003b] Subject*FREE 0.9999 free!! 0.9999 To*free 0.9998 Subject*free 0.9782 free! 0.9199 Free 0.9198 Url*free 0.9091 FREE 0.8747 From*free 0.7636 free 0.6546

© 2003 Franz J. Kurfess Spam Filtering 28 Related Approaches u collaborative filtering u many people categorize messages as spam, and submit them to a central system u also should have ham samples u may “wash out” individual differences u neural networks u similar concepts, but different learning methods

© 2003 Franz J. Kurfess Spam Filtering 29 Implementation u SpamBayes Project [SpamBayes][SpamBayes] u stand-alone filter u plug-in for some popular mail programs u Related Projects u SpamAssassin http://spamassassin.org/http://spamassassin.org/ u combines statistical techniques, rules, black-lists, collaborative filtering u see Paul Graham’s list of spam filters at http://www.paulgraham.com/filters.html http://www.paulgraham.com/filters.html

© 2003 Franz J. Kurfess Spam Filtering 30 Future Work  extension to more sophisticated tokens  phrases  letters replaced by visually similar symbols  e.g. o/0, l/1  separators inserted between characters  spam -> s p a m, s-p-a-m  combination with other approaches  blacklists, whitelists, rule-based systems,...  genetic algorithms  construction of filters through evolution

© 2003 Franz J. Kurfess Spam Filtering 31 References  [Graham, 2003a] Paul Graham, A Plan for Spam. http://www.paulgraham.com/spam.html, August 2002. [Graham, 2003a] http://www.paulgraham.com/spam.html  [Graham, 2003b] Paul Graham, Better Bayesian Filtering. http://www.paulgraham.com/better.html, January 2003. [Graham, 2003b] http://www.paulgraham.com/better.html  [SpamBayes] SpamBayes : Bayesian anti-spam classifier written in Python. http://spambayes.sourceforge.net/, visited Feb. 2003 [SpamBayes]SpamBayes : Bayesian anti-spam classifier written in Python.http://spambayes.sourceforge.net/  [Robinson, 2002] Gary Robinson's Rants: Spam Detection. http://radio.weblogs.com/0101454/stories/2002/09/16/spamD etection.html, Dec. 2002. [A revised version is to appear in the March 2003 issue of the Linux Journal, http://www.linuxjournal.com/. [Robinson, 2002] http://radio.weblogs.com/0101454/stories/2002/09/16/spamD etection.html http://www.linuxjournal.com/ [Giarratano & Riley 1998]

© 2003 Franz J. Kurfess Spam Filtering 32 Important Concepts and Terms  agenda  backward chaining  common-sense knowledge  conflict resolution  expert system (ES)  expert system shell  explanation  forward chaining  inference  inference mechanism  If-Then rules  knowledge  knowledge acquisition  knowledge base  knowledge-based system  knowledge representation  Markov algorithm  matching  Post production system  problem domain  production rules  reasoning  RETE algorithm  rule  working memory

© 2003 Franz J. Kurfess Spam Filtering 1 CPE/CSC 481: Knowledge-Based Systems Dr. Franz J. Kurfess Computer Science Department Cal Poly.

Similar presentations

Presentation on theme: "© 2003 Franz J. Kurfess Spam Filtering 1 CPE/CSC 481: Knowledge-Based Systems Dr. Franz J. Kurfess Computer Science Department Cal Poly."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© 2003 Franz J. Kurfess Spam Filtering 1 CPE/CSC 481: Knowledge-Based Systems Dr. Franz J. Kurfess Computer Science Department Cal Poly.

Similar presentations

Presentation on theme: "© 2003 Franz J. Kurfess Spam Filtering 1 CPE/CSC 481: Knowledge-Based Systems Dr. Franz J. Kurfess Computer Science Department Cal Poly."— Presentation transcript:

Similar presentations

About project

Feedback