CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.

Slides:



Advertisements
Similar presentations
Document Filtering Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike.
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
STATISTICS.
Cognitive Modelling – An exemplar-based context model Benjamin Moloney Student No:
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Chapter 14 Comparing two groups Dr Richard Bußmann.
S ENTIMENTAL A NALYSIS O F B LOGS B Y C OMBINING L EXICAL K NOWLEDGE W ITH T EXT C LASSIFICATION. 1 By Prem Melville, Wojciech Gryc, Richard D. Lawrence.
Probabilistic inference
Assuming normally distributed data! Naïve Bayes Classifier.
Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
1 Spam Filtering Using Bayesian Approach Presented by: Nitin Kumar.
Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.
Experimental Evaluation
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Bayesian Decision Theory Making Decisions Under uncertainty 1.
1 of 27 PSYC 4310/6310 Advanced Experimental Methods and Statistics © 2013, Michael Kalsher Michael J. Kalsher Department of Cognitive Science Adv. Experimental.
Advanced Multimedia Text Classification Tamara Berg.
QNT 531 Advanced Problems in Statistics and Research Methods
Lecture Presentation Software to accompany Investment Analysis and Portfolio Management Seventh Edition by Frank K. Reilly & Keith C. Brown Chapter 7.
Bayesian Networks. Male brain wiring Female brain wiring.
Some Background Assumptions Markowitz Portfolio Theory
Demo. Overview Overall the project has two main goals: 1) Develop a method to use sensor data to determine behavior probability. 2) Use the behavior probability.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Bug Localization with Machine Learning Techniques Wujie Zheng
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner.
A Graph-based Friend Recommendation System Using Genetic Algorithm
Investment Analysis and Portfolio Management First Canadian Edition By Reilly, Brown, Hedges, Chang 6.
Bayesian Spam Filter By Joshua Spaulding. Statement of Problem “Spam now accounts for more than half of all messages sent and imposes huge productivity.
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
1 Bayesian Methods. 2 Naïve Bayes New data point to classify: X=(x 1,x 2,…x m ) Strategy: – Calculate P(C i /X) for each class C i. – Select C i for which.
Classification Techniques: Bayesian Classification
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 10 Comparing Two Groups Section 10.4 Analyzing Dependent Samples.
Spam Detection Ethan Grefe December 13, 2013.
23 November Md. Tanvir Al Amin (Presenter) Anupam Bhattacharjee Department of Computer Science and Engineering,
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Slides for “Data Mining” by I. H. Witten and E. Frank.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
RESEARCH & DATA ANALYSIS
1 1 Slide Simulation Professor Ahmadi. 2 2 Slide Simulation Chapter Outline n Computer Simulation n Simulation Modeling n Random Variables and Pseudo-Random.
Naïve Bayes Classifier Christina Wallin, Period 3 Computer Systems Research Lab Christina Wallin, Period 3 Computer Systems Research Lab
Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring implicit/new.
Chapter 1 Introduction to Statistics. Section 1.1 Fundamental Statistical Concepts.
Classification using Co-Training
© The McGraw-Hill Companies, Inc., Chapter 12 Analysis of Variance (ANOVA)
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Averages and Variability
Classification Results for Folder Classification on Enron Dataset.
1 Relational Factor Graphs Lin Liao Joint work with Dieter Fox.
Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam s, also known as junk s, are unwanted s sent to numerous recipients.
Econ 3790: Business and Economics Statistics Instructor: Yogesh Uppal
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Bidding Strategies. Outline of Presentation Markup Expected Profit Cost of Construction Maximizing Expected Profit Case 1: Single Known Competitor Case.
Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier.
Bidding Strategies.
Chapter 7. Classification and Prediction
Lecture 15: Text Classification & Naive Bayes
Data Mining Lecture 11.
Machine Learning. k-Nearest Neighbor Classifiers.
Hidden Markov Models Part 2: Algorithms
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Chi Square (2) Dr. Richard Jackson
Naïve Bayes Classifiers
Introduction Previous lessons have demonstrated that the normal distribution provides a useful model for many situations in business and industry, as.
NAÏVE BAYES CLASSIFICATION
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly

Objective Create a text-filtering algorithm that can accurately and efficiently identify spam s based on data collected from past spam s.

Background spam - : that is not wanted : that is sent to large numbers of people and that consists mostly of advertising : unsolicited usually commercial sent to a large number of addresses Spam is estimated to account for anywhere from 70 – 95% of all s

Method Create a word bank from parsing through the body of spam s in database – Our methods disregard sender address, subject line Each word is associated with a frequency of appearance within all s evaluated during the learning phase Use this data to evaluate s with one of two methods: – Naïve bayes classifier – Markov model

Naïve Bayes Classifier - Background One of the most popular/oldest methods of spam detection, first known use in 1996 Common text identification method – utilizing features from the “bag of words” model – Disregards grammar, word order but not multiplicity Assumes independence among features - value of any particular feature is unrelated to the presence or absence of any other feature Tailored to a specific user Offers low false-positive detection rate

Naïve Bayes Classifier - Process Each word has a probability of being in a spam – Training phase accounts for building these probabilities ( user marking an as spam) Probabilities of individual words are used to compute the probability that an with a particular set of words is spam or not If this probability meets a certain threshold – the is determined to be spam

Naïve Bayes Classifier - Process Considering one word’s effect on an being spam: Pr(S|W) – probability an is spam knowing it contains word X Pr(W|S)- probability that word X appears in spam Pr(S) – probability any given message is spam Pr(W|H)- probability that word X doesn’t appear in spam Pr(H) – probability any given message isn’t spam Pr(S) =.8, Pr(H) =.2 ? Pr(S) =.9 Pr(H) =.1 ? (based on recent statistics) Most bayesian spam software makes no assumptions about incoming s So the formula can be simplified to :

Naïve Bayes Classifier - Process Combining individual probabilities: p = probability the in question is spam p 1 = probability of a word being in a spam n = number of words being evaluated *multiplication shown here is actually done as addition in the log domain because the numbers involved are very small Compare p to a determined threshold, if p is below threshold – cannot be classified spam if p is equal to or above threshold – can be classified as spam

Naïve Bayes Classifier - Results 15,000 spam s evaluated during learning phase Average classifier value of s in learning phase used as threshold – 2.86% success rate in testing (86/3000 s could be confidently identified as spam) Median – better summary statistic for data that is not normally distributed – 52.03% success rate when using median value as threshold (1561/3000) SAS output shown on the right displays results from a PROC UNIVARIATE procedure ran on a data set containing the bayes classifier values for the 15,000 s in the learning set. This data is highly skewed and three different normality tests support that this data is not normally distributed. This evidence supports that the model considering individual probabilities of every word within an is not the best fit for our data.

Naïve Bayes Classifier - Results Only consider the 15 most “interesting” (highest) probabilities for each in the classifier Neutral words (words associated with a low spam probability) should not effect the statistical significance of highly incriminating words, no matter how many there are 97.13% success rate (2914/3000 spam s correctly identified) – using average bayes value from learning set as threshold

Markhov Model - Background Models the statistical behaviors of spam s. Widely used in current spam classification systems. In essence, a Bayes filter works on single words alone, while a Markovian filter works on phrases or possibly whole sentences.

Markhov Model - Process Training – Analyze a training set of s that are all known to be spam Examining adjacent words, ‘A’ and ‘B’, compute the frequency that word ‘B’ follows word ‘A’, for every word in the body of a . If word ‘A’ is followed by a period, question mark, or exclamation point, skip it.

Markhov Model - Process Calculate and store the average occurrence rate of word ‘B’ following word ‘A’, for every word in each in training set. avgPer (‘A’  ’B’) = Summing all of the average occurrence rates of ‘B’ following ‘A’ and dividing by the total number of s in the training set, results in the final average rate that word ‘B’ followed word ‘A’ in the training set. Final Avg. Occurrence (‘B’ Follows ‘A’) = 1 + … + avgPer (‘A’  ’B’) n Number of s in Training Set Using a weighted directed graph, store each word encountered as a vertex, with edges between adjacent words containing the average rate of occurrence in all spam s from training set.

Markhov Model - Process Classification: When “grading” an in question, Examine adjacent words Lookup the corresponding edge weight in the graph (The average rate that a word follows another word in the training set collection.) Accumulate these weights per each and calculate the average weight as a final grade for the . If this grade is greater than or equal to a determined threshold, consider this as spam, if less than, consider this as not spam. If an edge does not exist, (two words were never adjacent in training collection) It is skipped, having no affect on the overall grade. Skip common words that could potentially be frequent in both spam and non-spam s.(ie. the, this, I, etc. )

Markhov Model - Results 3000 spam s evaluated during learning phase 1000 test spam s used in Testing Set. Average classifier grade of s in learning phase used as threshold. 920 spam s correctly identified as spam. 92% Success rate.