CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly
Objective Create a text-filtering algorithm that can accurately and efficiently identify spam s based on data collected from past spam s.
Background spam - : that is not wanted : that is sent to large numbers of people and that consists mostly of advertising : unsolicited usually commercial sent to a large number of addresses Spam is estimated to account for anywhere from 70 – 95% of all s
Method Create a word bank from parsing through the body of spam s in database – Our methods disregard sender address, subject line Each word is associated with a frequency of appearance within all s evaluated during the learning phase Use this data to evaluate s with one of two methods: – Naïve bayes classifier – Markov model
Naïve Bayes Classifier - Background One of the most popular/oldest methods of spam detection, first known use in 1996 Common text identification method – utilizing features from the “bag of words” model – Disregards grammar, word order but not multiplicity Assumes independence among features - value of any particular feature is unrelated to the presence or absence of any other feature Tailored to a specific user Offers low false-positive detection rate
Naïve Bayes Classifier - Process Each word has a probability of being in a spam – Training phase accounts for building these probabilities ( user marking an as spam) Probabilities of individual words are used to compute the probability that an with a particular set of words is spam or not If this probability meets a certain threshold – the is determined to be spam
Naïve Bayes Classifier - Process Considering one word’s effect on an being spam: Pr(S|W) – probability an is spam knowing it contains word X Pr(W|S)- probability that word X appears in spam Pr(S) – probability any given message is spam Pr(W|H)- probability that word X doesn’t appear in spam Pr(H) – probability any given message isn’t spam Pr(S) =.8, Pr(H) =.2 ? Pr(S) =.9 Pr(H) =.1 ? (based on recent statistics) Most bayesian spam software makes no assumptions about incoming s So the formula can be simplified to :
Naïve Bayes Classifier - Process Combining individual probabilities: p = probability the in question is spam p 1 = probability of a word being in a spam n = number of words being evaluated *multiplication shown here is actually done as addition in the log domain because the numbers involved are very small Compare p to a determined threshold, if p is below threshold – cannot be classified spam if p is equal to or above threshold – can be classified as spam
Naïve Bayes Classifier - Results 15,000 spam s evaluated during learning phase Average classifier value of s in learning phase used as threshold – 2.86% success rate in testing (86/3000 s could be confidently identified as spam) Median – better summary statistic for data that is not normally distributed – 52.03% success rate when using median value as threshold (1561/3000) SAS output shown on the right displays results from a PROC UNIVARIATE procedure ran on a data set containing the bayes classifier values for the 15,000 s in the learning set. This data is highly skewed and three different normality tests support that this data is not normally distributed. This evidence supports that the model considering individual probabilities of every word within an is not the best fit for our data.
Naïve Bayes Classifier - Results Only consider the 15 most “interesting” (highest) probabilities for each in the classifier Neutral words (words associated with a low spam probability) should not effect the statistical significance of highly incriminating words, no matter how many there are 97.13% success rate (2914/3000 spam s correctly identified) – using average bayes value from learning set as threshold
Markhov Model - Background Models the statistical behaviors of spam s. Widely used in current spam classification systems. In essence, a Bayes filter works on single words alone, while a Markovian filter works on phrases or possibly whole sentences.
Markhov Model - Process Training – Analyze a training set of s that are all known to be spam Examining adjacent words, ‘A’ and ‘B’, compute the frequency that word ‘B’ follows word ‘A’, for every word in the body of a . If word ‘A’ is followed by a period, question mark, or exclamation point, skip it.
Markhov Model - Process Calculate and store the average occurrence rate of word ‘B’ following word ‘A’, for every word in each in training set. avgPer (‘A’ ’B’) = Summing all of the average occurrence rates of ‘B’ following ‘A’ and dividing by the total number of s in the training set, results in the final average rate that word ‘B’ followed word ‘A’ in the training set. Final Avg. Occurrence (‘B’ Follows ‘A’) = 1 + … + avgPer (‘A’ ’B’) n Number of s in Training Set Using a weighted directed graph, store each word encountered as a vertex, with edges between adjacent words containing the average rate of occurrence in all spam s from training set.
Markhov Model - Process Classification: When “grading” an in question, Examine adjacent words Lookup the corresponding edge weight in the graph (The average rate that a word follows another word in the training set collection.) Accumulate these weights per each and calculate the average weight as a final grade for the . If this grade is greater than or equal to a determined threshold, consider this as spam, if less than, consider this as not spam. If an edge does not exist, (two words were never adjacent in training collection) It is skipped, having no affect on the overall grade. Skip common words that could potentially be frequent in both spam and non-spam s.(ie. the, this, I, etc. )
Markhov Model - Results 3000 spam s evaluated during learning phase 1000 test spam s used in Testing Set. Average classifier grade of s in learning phase used as threshold. 920 spam s correctly identified as spam. 92% Success rate.