Bayesian Spam Filter By Joshua Spaulding
Statement of Problem “Spam now accounts for more than half of all messages sent and imposes huge productivity costs…By 2007, Spam-stopping should grow to a $2.4 Billion Business.” Technology Review 8/03
Objective Using Bayes’ rule I will attempt to classify an message as spam or non-spam (ham). I will use a corpus of spam and ham to determine the probability that a new is spam given the tokens in the message.
Definition of Spam Unsolicited automated
Bayes’ Rule P(A|B) = P(B|A)P(A) / P(B) P(A|B) is the conditional probability that event A occurs given that event B has occurred; P(B|A) is the conditional probability of event B occurring given that event A has occurred; P(A) is the probability of event A occurring; P(B) is the probability of event B occurring.
P(spam|token) = P(token|spam)P(spam) / P(token) P(spam|token) – probability that is spam given a token P(token|spam) – probability token exists given is spam P(spam) – probability of being spam P(token) – probability of token in Bayes’ Rule
Project Design (orig) Read in large text file containing 1000 spam. Read in large text file containing 1000 ham. Create a file for each corpus consisting of the token and it’s occurrence in the corpus. I'll then create another file with the token and the probability that an containing it is spam using Bayesian rule. When an arrives I will parse the . I will look up the probability that the is spam given the token. I’ll then combine all the probabilities to determine the probability that the is spam.
Project Design Create Narl model from 100 spam and 100 ham contained in two separate CSV files. Used Narl’s built-in Excel Model function. ( Corpus.narl) Parse body slot from Corpus.narl, create word nodes and calculate the probability. (kb.narl) Examine incoming text body, tokenize and create nodeNames. If nodeName is already in the kb then lookup the probability. Otherwise assign probability value of “0.5”.
Model
node
Word Node
Issues Text is unknown and often incomplete. Java data structures Vector, StringTokenizer, floating-point operations Unfamiliar with Narl
Enhancements Read slots other than body. Read data in from another format. Gain more knowledge about the . Better error handling. Read as they enter the mail server. Regular expression matching of Stringtokenizer. Performance tuning with more data. Take advantage of Narl functionality??
Demonstration
Questions?