By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING.

Slides:



Advertisements
Similar presentations
Is Random Model Better? -On its accuracy and efficiency-
Advertisements

Data Mining Classification: Alternative Techniques
DECISION TREES. Decision trees  One possible representation for hypotheses.
On-line learning and Boosting
Lectures 17,18 – Boosting and Additive Trees Rice ECE697 Farinaz Koushanfar Fall 2006.
RIPPER Fast Effective Rule Induction
Boosting Rong Jin.
Model Assessment, Selection and Averaging
CMPUT 466/551 Principal Source: CMU
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Longin Jan Latecki Temple University
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Review of : Yoav Freund, and Robert E
Introduction to Boosting Slides Adapted from Che Wanxiang( 车 万翔 ) at HIT, and Robin Dhamankar of Many thanks!
Sparse vs. Ensemble Approaches to Supervised Learning
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Boosting Rong Jin. Inefficiency with Bagging D Bagging … D1D1 D2D2 DkDk Boostrap Sampling h1h1 h2h2 hkhk Inefficiency with boostrap sampling: Every example.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
A Brief Introduction to Adaboost
Ensemble Learning: An Introduction
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Classification II.
Machine Learning: Ensemble Methods
Sparse vs. Ensemble Approaches to Supervised Learning
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Foundations of Computer Vision Rapid object / face detection using a Boosted Cascade of Simple features Presented by Christos Stoilas Rapid object / face.
Crash Course on Machine Learning
Learning at Low False Positive Rate Scott Wen-tau Yih Joshua Goodman Learning for Messaging and Adversarial Problems Microsoft Research Geoff Hulten Microsoft.
Face Detection using the Viola-Jones Method
A speech about Boosting Presenter: Roberto Valenti.
Benk Erika Kelemen Zsolt
Boosting of classifiers Ata Kaban. Motivation & beginnings Suppose we have a learning algorithm that is guaranteed with high probability to be slightly.
SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Machine learning system design Prioritizing what to work on
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
1 Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1classifier 2classifier.
Classification Techniques: Bayesian Classification
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
E NSEMBLE L EARNING : A DA B OOST Jianping Fan Dept of Computer Science UNC-Charlotte.
Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1 CHUKWUEMEKA DURUAMAKU.  Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.
Ensemble Methods in Machine Learning
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Classification using Co-Training
Boosting and Additive Trees (Part 1) Ch. 10 Presented by Tal Blum.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Boosting ---one of combining models Xin Li Machine Learning Course.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
AdaBoost Algorithm and its Application on Object Detection Fayin Li.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
1 Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1classifier 2classifier.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Fast Effective Rule Induction
Classification Techniques: Bayesian Classification
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.
Machine Learning Ensemble Learning: Voting, Boosting(Adaboost)
Presentation transcript:

By Ankur Khator Gaurav Sharma Arpit Mathur 01D05014 SPAM FILTERING

“junk ” or “unsolicited commercial ”. Spam filtering - a special case of classification. Only 2 classes – Spam and Non-spam. What is Spam ?

Various Approaches Bayesian Learning  Probabilistic model for Spam Filtering  Bag of Words Representation Ripper algorithm  Context Sensitive Learning. Boosting algorithm  Improving Accuracy by combining weaker hypotheses.

Term Vectors

Naive Bayes for Spam Seeking model to find P(Y=1/X 1 =x 1,X 2 =x 2,..,X d =x d ) From Bayes theorem P(Y=1/X 1 =x 1,..,X d =x d ) = P(Y=1) * P(X 1 =x 1,..,X d =x d /Y=1) P(X 1 =x 1,..X d =x d ) P(Y=0/X 1 =x 1,..,X d =x d ) = P(Y=0) * P(X 1 =x 1,..,X d =x d /Y=0) P(X 1 =x 1,..X d =x d )

Justification of using Bayes Theorem Sparseness of data P(B/A) can be easily and accurately determined as compared to P(A/B)

Assume P(X 1 =x 1,..,X d =x d /Y=k) = ∏ P(X i =x i / Y=k) Naive Bayes for Spam(contd.) Also assume X i = 1 if no of occurrence of word i >= 1 = 0 otherwise

referred to as weights of evidence Inconsistency when some probability is zero. Smooth the estimates by adding a smooth positive constant to both numerator and denominator of each probability estimate Naive Bayes for Spam(contd.)

Assume new mail with text  “The quick rabbit rests” Classifying = 2.63 Probability = 0.93

Threshold Lower threshold  Higher false positive rate Higher threshold  Higher false negative rate  Preferred

Linear Classifier Ignores the effect of Context of word on its meaning. Unrealistic. Build a linear classifier that test for more complex Features like Simultaneous Occurrences. High Computation Cost !! Non-Linear Classification is the Solution Non-Linear Classification

Ripper Disjunction of Different Contexts Each Contexts is conjunction of Simple terms Context of w1 is : if w2 belongs to data and w3 belongs to data. i.e. for context to be true w1 must occur with w2 and w3. Three Components of Ripper Algorithm:

Rule Learning : Spam  spam Є Subject Spam  Free Є Subject,Spam Є Subject. Spam  Gift!! Є Subject, Click Є Subject. The rule would be disjunction of three statements stated above. There is an initial set of rules too

Constructing Rule Set Initial Rule Set is Constructed Using a greedy Strategy. Based on the IREP (Incremental Reduced Error Pruning) To Construct A new Rule partitioning Dataset into two parts training Set And Pruning Set is Done. Every Time a Single condition is Added to Rule.

Simplification And Optimization At every step the density of +ve examples covered is increased. Adding stops until clause cover no –ve example or there is no positive gain. After this, pruning i.e. simplification is done. At every stage, again following greedy Strategy

Reaching Sufficient Rules The clause is deleted which maximizes the Function where U + (i+1) and U - (i-1) are the positive and negative examples. Termination when information gain is non-zero i.e. every rule covers +ve examples. But If data is noisy then number of rules increase

MDL Several heuristics are applied to solve the problem. MDL(Minimum Description Length) is one of them. After addition of each rule, total length of current rule set and example is calculated. Addition of rule is stopped when this length is d bits larger than shortest length.

AdaBoost Easy to find rule of thumb which are often correct If ‘buy now’ occurs in message, then predict ‘spam’ Hard to find one rule which is very accurate AdaBoost helps here  general method of converting rough rules of thumb into highly accurate prediction rule  Concentrating on hard examples

Pictorially

Algorithm Input S = { (X i, Y i ) } m i=1 Initialize D(i) = 1/m for all i For i = 1 to T H(t) = WeakLearner(S,D t ) Choose β t  ln((1-ε)/ε) (proven to Minimize error for 2class) [2] Update D t+1 (i) = D t (i) exp(-β t Y i h t (x i )) and Normalize Final Hypotheses f(x) = ∑β t h t (x)

Example

Accuracy Weighted accuracy measure (λL - + S + ) / (λL + S) λ strictness measure L : # legitimate messages S : # spam L- : #legitimate messages classified as legitimate S+ : #spam classified as spam Improving accuracy  Increase λ  Introduce θ threshold Example classified positive only if f(x) > θ Default is ZERO Recall correctly predicted spam out of number of spam in corpus Precision correctly classified spam out of number predicted as spam

Results on corpus PU1... [1] TRECALLPRECISIONACC Tree Depth 1 Θ = 10.2 λ = Tree Depth 1 θ = 46.9 λ = Tree Depth 5 θ = 37.4 λ = Tree Depth 5 Θ = 178 λ =

Pros and Cons Fast and Simple No parameters to tune Flexible Can combine with any learning Algorithm No knowledge needed of WeakLearner Error reduces exponentially Robust to overfitting Data Driven – requires lots of data Performance depends on WeakLearner May fail if WeakLearner is too weak

Conclusion RIPPER as text categorization algorithm works better than Naïve Bayes (better for more classes). Comparable for spam filtering (2 classes) Boosting better than any weak learner it works on.

References [1] Boosting trees for Anti Spam Filtering by Xavier Carreras and Llius Marquez [2] The boosting approach to machine learning: An overview. by Robert E. Schapire in MSRI Workshop on Nonlinear Estimation and Classification, [3] Statistics and The War on Spam by David Madigan, David Madigan, [4] Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D. Spyropoulos. An Evaluation of Naive Bayesian Anti-Spam Filtering. In Proc. of the workshop on Machine Learning in the New Information Age, [5] William W. Cohen, Yoram Singer: Context-sensitive Learning Methods for Text Categorization. SIGIR 1996: