Bayesian Filtering Team Glyph Debbie Bridygham Pravesvuth Uparanukraw Ronald Ko Rihui Luo Thuong Luu Team Glyph Debbie Bridygham Pravesvuth Uparanukraw Ronald Ko Rihui Luo Thuong Luu
Background Strong need exists to identify “bad” items in a population and remove them -- Examples: SPAM, Unsolicited IMs, Etc. Filtering often results in “Arm’s Race” requiring rapid response “Arm’s Race” favors inherently adaptive methods over others Strong need exists to identify “bad” items in a population and remove them -- Examples: SPAM, Unsolicited IMs, Etc. Filtering often results in “Arm’s Race” requiring rapid response “Arm’s Race” favors inherently adaptive methods over others
Benefits of Filters Less unwanted traffic, thus less wasted space on clients & servers Greater use of internet services due to reduced customer frustration Provide some protection against dangerous traffic: scams, phishing attacks, viruses, etc. Less unwanted traffic, thus less wasted space on clients & servers Greater use of internet services due to reduced customer frustration Provide some protection against dangerous traffic: scams, phishing attacks, viruses, etc.
Downsides of Filtering Exclusion of even one legitimate item (i.e., False Positives) less desirable than letting 10 or more illegitimate items pass. Reducing the percentage of undesirable traffic often causes legitimate traffic to be excluded as well. Exclusion of even one legitimate item (i.e., False Positives) less desirable than letting 10 or more illegitimate items pass. Reducing the percentage of undesirable traffic often causes legitimate traffic to be excluded as well.
Cost of Filtering Manual filtering has become prohibitive Maintenance of static filters costs time & money Time spent maintaining keywords or updating software delays response “Arm’s Race” often results in ever escalating costs Manual filtering has become prohibitive Maintenance of static filters costs time & money Time spent maintaining keywords or updating software delays response “Arm’s Race” often results in ever escalating costs
Methodologies Manual filtering prohibitive in terms of time Static filtering based on heuristics and keywords does not adapt except via manual updates Bayesian filtering is dynamic, adapting with each new item scanned and/or marked Manual filtering prohibitive in terms of time Static filtering based on heuristics and keywords does not adapt except via manual updates Bayesian filtering is dynamic, adapting with each new item scanned and/or marked
What is Bayesian Filtering? Uses Naïve Bayes Classifier, which uses Bayes Theorem Classifier allows items to be adaptively categorized using probabilities & has low rate of False Positives Most well-known use in SPAM filtering; often credited to initial work by Paul Graham (“A Plan for Spam”) in 2002 Uses Naïve Bayes Classifier, which uses Bayes Theorem Classifier allows items to be adaptively categorized using probabilities & has low rate of False Positives Most well-known use in SPAM filtering; often credited to initial work by Paul Graham (“A Plan for Spam”) in 2002
Naïve Bayes Classifier Uses Bayes Theorem with assumptions that probabilities are independent (rarely true), thus “naïve” Classifier can start with initial assumptions, i.e., probabilities that words occur in legitimate or illegitimate messages Is trained over time and adapts. If final probability reaches some threshold, an item is rejected. Superior to keyword filtering. Uses Bayes Theorem with assumptions that probabilities are independent (rarely true), thus “naïve” Classifier can start with initial assumptions, i.e., probabilities that words occur in legitimate or illegitimate messages Is trained over time and adapts. If final probability reaches some threshold, an item is rejected. Superior to keyword filtering.
Bayes Theorem First presented in 1763 based on work by mathematician Thomas Bayes Pr(A|B) = Pr(B|A)· Pr(A) / Pr(B) Specifies relationships between conditional probabilities Currently has practical use in many fields First presented in 1763 based on work by mathematician Thomas Bayes Pr(A|B) = Pr(B|A)· Pr(A) / Pr(B) Specifies relationships between conditional probabilities Currently has practical use in many fields
Bayesian Filtering Usage Uses user input to develop individual statistics Probability matrix changes over time based on scanned messages and user decisions Matrix is used to calculate probability a message is unwanted Matrix adapts quickly to new input, resulting in surprisingly good results Uses user input to develop individual statistics Probability matrix changes over time based on scanned messages and user decisions Matrix is used to calculate probability a message is unwanted Matrix adapts quickly to new input, resulting in surprisingly good results
Example Matrix
Example Suppose the word “guarantee” occurs in 500 of 2000 Spam s, but only in 5 of 1000 Non-Spam s The probability of Spam for this word is then (500 / 2000) / ((500 / 2000) + (5 / 1000)) = 0.98 This probability is combined with that of others obtained from message to compute a probability for the entire message being Spam. Suppose the word “guarantee” occurs in 500 of 2000 Spam s, but only in 5 of 1000 Non-Spam s The probability of Spam for this word is then (500 / 2000) / ((500 / 2000) + (5 / 1000)) = 0.98 This probability is combined with that of others obtained from message to compute a probability for the entire message being Spam.
Bayesian Poisoning Attempts to fool BF systems by adding irrelevant words (often hidden) Type I attacks attempt to get messages through filter -- could be active or passive, with active producing feedback to sender via a “Web Bug” or other means Type II attacks attempt to cause “False Positives”, i.e., force desirable messages to be rejected Attempts to fool BF systems by adding irrelevant words (often hidden) Type I attacks attempt to get messages through filter -- could be active or passive, with active producing feedback to sender via a “Web Bug” or other means Type II attacks attempt to cause “False Positives”, i.e., force desirable messages to be rejected
Poisoning Effectiveness Passive attacks are rarely effective as filters are individual and sender gets no feedback Active attacks can be initially highly effective, if systems access “Web Bugs” All attacks lose effectiveness as the filter adjusts to incoming traffic Passive attacks are rarely effective as filters are individual and sender gets no feedback Active attacks can be initially highly effective, if systems access “Web Bugs” All attacks lose effectiveness as the filter adjusts to incoming traffic
Products that use Bayesian Filtering AlienCamelDSPAMEudoraeXpurgateJunk-OutMozilla Pegasus Mail POPFilePostiniSeaMonkey SpamAssas sin SpamBayesSpamProbeThunderbird
Summary BF adapts to individual needs BF is highly effective BF adapts more quickly than other solutions BF is resistant to “poisoning” BF adapts to individual needs BF is highly effective BF adapts more quickly than other solutions BF is resistant to “poisoning”
References [1] Sahami, M., et. al. “A Bayesian Approach to Filtering Junk ”, 1998A Bayesian Approach to Filtering Junk [2] Graham, Paul. “A Plan for SPAM”, 2002A Plan for SPAM [3] Graham-Cumming, John. “Does Bayesian poisoning exist?”, 2006Does Bayesian poisoning exist? [1] Sahami, M., et. al. “A Bayesian Approach to Filtering Junk ”, 1998A Bayesian Approach to Filtering Junk [2] Graham, Paul. “A Plan for SPAM”, 2002A Plan for SPAM [3] Graham-Cumming, John. “Does Bayesian poisoning exist?”, 2006Does Bayesian poisoning exist?
References, cont. [4] Naive Bayes Classifier, Wikipedia, 2007Naive Bayes Classifier [5] Bayes Theorem, Wikipedia, 2007Bayes Theorem [4] Naive Bayes Classifier, Wikipedia, 2007Naive Bayes Classifier [5] Bayes Theorem, Wikipedia, 2007Bayes Theorem