Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting.

Slides:



Advertisements
Similar presentations
Basic Communication on the Internet:
Advertisements

Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Fast Bayesian Matching Pursuit Presenter: Changchun Zhang ECE / CMR Tennessee Technological University November 12, 2010 Reading Group (Authors: Philip.
Partitioned Logistic Regression for Spam Filtering Ming-wei Chang University of Illinois at Urbana-Champaign Wen-tau Yih and Christopher Meek Microsoft.
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Generative and Discriminative Models in Text Classification David D. Lewis Independent Consultant Chicago, IL, USA
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)
Lecture 5 (Classification with Decision Trees)
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Spam Filters. What is Spam? Unsolicited (legally, “no existing relationship” Automated Bulk Not necessarily commercial – “flaming”, political.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.
Bayes Classifier, Linear Regression 10701/15781 Recitation January 29, 2008 Parts of the slides are from previous years’ recitation and lecture notes,
Spam Detection Jingrui He 10/08/2007. Spam Types  Spam Unsolicited commercial  Blog Spam Unwanted comments in blogs  Splogs Fake blogs.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Crash Course on Machine Learning
Computer vision: models, learning and inference Chapter 6 Learning and Inference in Vision.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
by B. Zadrozny and C. Elkan
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
A Neural Network Classifier for Junk Ian Stuart, Sung-Hyuk Cha, and Charles Tappert CSIS Student/Faculty Research Day May 7, 2004.
Group 2 R 李庭閣 R 孔垂玖 R 許守傑 R 鄭力維.
The Internet 8th Edition Tutorial 2 Basic Communication on the Internet: .
An Introduction to Machine Learning and Natural Language Processing Tools Presented by: Mark Sammons, Vivek Srikumar (Many slides courtesy of Nick Rizzolo)
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Copyright (c) 2003 David D. Lewis (Spam vs.) Forty Years of Machine Learning for Text Classification David D. Lewis, Ph.D. Independent Consultant Chicago,
A Technical Approach to Minimizing Spam Mallory J. Paine.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
Machine Learning Tutorial Amit Gruber The Hebrew University of Jerusalem.
1 A Study of Supervised Spam Detection Applied to Eight Months of Personal E- Mail Gordon Cormack and Thomas Lynam Presented by Hui Fang.
Marketing Amanda Freeman. Design Guidelines Set your width to pixels Avoid too many tables Flash, JavaScript, ActiveX and movies will not.
Naïve Bayes Classifier Ke Chen Modified and extended by Longin Jan Latecki
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
1 Fighting Against Spam. 2 How might we analyze ? Identify different parts – Reply blocks, signature blocks Integrate with workflow tasks Build.
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Bayesian Filtering Team Glyph Debbie Bridygham Pravesvuth Uparanukraw Ronald Ko Rihui Luo Thuong Luu Team Glyph Debbie Bridygham Pravesvuth Uparanukraw.
Logistic Regression William Cohen.
Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.
Linear Models (II) Rong Jin. Recap  Classification problems Inputs x  output y y is from a discrete set Example: height 1.8m  male/female?  Statistical.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Spam By Dan Sterrett. Overview ► What is spam? ► Why it’s a problem ► The source of spam ► How spammers get your address ► Preventing Spam ► Possible.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
Machine Learning – Classification David Fenyő
MMS Software Deliverables: Year 1
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Generative Models and Naïve Bayes
Naïve Bayes Classifiers
LECTURE 07: BAYESIAN ESTIMATION
Speech recognition, machine learning
Multivariate Methods Berlin Chen
Logistic Regression Chapter 7.
Multivariate Methods Berlin Chen, 2005 References:
Generative Models and Naïve Bayes
NAÏVE BAYES CLASSIFICATION
Speech recognition, machine learning
Naïve Bayes Classifier
Presentation transcript:

Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting Presented at ASA Chicago Chapter Spring Conference., Loyola Univ., May 7, 2004.

Copyright 2004, David D. Lewis Menu Spam Spam Filtering Classification for Spam Filtering Classification Bayesian Classification Naive Bayesian Classification Naive Bayesian Text Classification Naive Bayesian Text Classification for Spam Filtering (Feature Extraction for) Spam Filtering Text Classification (for Marketing) (Better) Bayesian Classification

Copyright 2004, David D. Lewis Spam Unsolicited bulk –or, in practice, whatever you don’t want Large fraction of all sent –Brightmail est. 64%, Postini est. 77% –Still growing Est. cost to US businesses exceeded $30 billion in Y2003

Copyright 2004, David D. Lewis Approaches to Spam Control Economic ( pricing,...) Legal (CANSPAM,...) Societal pressure (trade groups,...) Securing infrastructure ( servers,...) Authentication (challenge/response,...) Filtering

Copyright 2004, David D. Lewis Spam Filtering Intensional (feature-based) vs. Extensional (white/blacklist-based) Applied at sender vs. receiver Applied at client vs. mail server vs. ISP

Copyright 2004, David D. Lewis Statistical Classification 1.Define classes of objects 2.Specify probability distribution model connecting classes to observable features 3.Fit parameters of model to data 4.Observe features on inputs and compute probability of class membership 5.Assign object to a class

Copyright 2004, David D. Lewis Classifier Inter- preter CLASSIFIER Feature Extraction

Copyright 2004, David D. Lewis Extract features from header, content Train classifier Classify message and process: –Block message, insert tag, put in folder, etc. Classification for Spam Filtering vs. Define classes:

Copyright 2004, David D. Lewis Two Classes of Classifier Generative: Naive Bayes, LDA,... –Model joint distribution of class and features –Derive class probability by Bayes rule Discriminative: logistic regression, CART,... –Model conditional distribution of class given known feature values –Model directly estimates class probability

Copyright 2004, David D. Lewis 2. Specify probability model 2b. And prior distribution over parameters 3. Find posterior distribution of model parameters, given data 4. Compute class probabilities using posterior distribution (or element of it) 5. Classify object Bayesian Classification (1) 1.Define classes

Copyright 2004, David D. Lewis Bayesian Classification (2) = “Naive”/”Idiot”/”Simple” Bayes A particular generative model –Assumes independence of observable features within each class of messages –Bayes rule used to compute class probability Might or might not use a prior on model parameters

Copyright 2004, David D. Lewis Naive Bayes for Text Classification - History Maron (JACM, 1961) – automated indexing Mosteller and Wallace (1964) – author identification Van Rijsbergen, Robertson, Sparck Jones, Croft, Harper (early 1970’s) – search engines Sahami, Dumais, Heckerman, Horvitz (1998) – spam filtering

Copyright 2004, David D. Lewis Graham’s A Plan for Spam –And its mutant offspring... Naive Bayes-like classifier with weird parameter estimation Widely used in spam filters –Classic Naive Bayes superior when appropriately used Bayesian Classification (3)

Copyright 2004, David D. Lewis NB & Friends: Advantages Simple to implement –No numerical optimization, matrix algebra, etc. Efficient to train and use –Fitting = computing means of feature values –Easy to update with new data –Equivalent to linear classifier, so fast to apply Binary or polytomous

Copyright 2004, David D. Lewis NB & Friends: Advantages Independence allows parameters to be estimated on different data sets, e.g. –Estimate content features from messages with headers omitted –Estimate header features from messages with content missing

Copyright 2004, David D. Lewis NB & Friends: Advantages Generative model –Comparatively good effectiveness with small training sets –Unlabeled data can be used in parameter estimation (in theory)

Copyright 2004, David D. Lewis NB & Friends: Disadvantages Independence assumption wrong –Absurd estimates of class probabilities –Threshold must be tuned, not set analytically Generative model –Generally lower effectiveness than discriminative techniques (e.g. log. regress.) –Improving parameter estimates can hurt classification effectiveness

Copyright 2004, David D. Lewis Feature Extraction Convert message to feature vector Header: sender, recipient, routing,… –Possibly break up domain names Text –Words, phrases, character strings –Become binary or numeric features URLs, HTML tags, images,…

Copyright 2004, David D. Lewis

From: Sam Elegy To: Subject: you can buy Spamlike content in image form Irrelevant legit content; doubles as hash buster Typographic variations Randomly generated name and

Copyright 2004, David D. Lewis Defeating Feature Extraction Misspellings, character set choice, HTML games: mislead extraction of words Put content in images Forge headers (to avoid identification, but also interferes with classification) Innocuous content to mimic distribution in nonspam Hashbusters (zyArh73Gf) clog dictionaries

Copyright 2004, David D. Lewis Survival of the Fittest Filter designers get to see spam Spammers use spam filters Unprecedented arms race for a statistical field Countermeasures mostly target feature extraction, not modeling assumptions

Copyright 2004, David D. Lewis Miscellany 1.Getting legitimate bulk mail past spam filters 2.Other uses of text classification in marketing 3.Frontiers in Bayesian classification

Copyright 2004, David D. Lewis Getting Legit Bulk Past Filters Test against several filters –Send to accounts on multiple ISPs –Multiple client-based filters if particularly concerned Coherent content, correctly spelled Non-tricky headers and markup Avoid spam keywords where possible Don’t use spammer tricks

Copyright 2004, David D. Lewis Text Classification in Marketing Routing incoming –Responses to promotions –Detect opportunities for selling –(Automated response sometimes possible) Analysis of text/mixed data on customers –e.g. customer or CSR comments Content analysis –Focus groups, , chat, blogs, news,…

Copyright 2004, David D. Lewis Better Bayesian Classification Discriminative –Logistic regression with informative priors –Sharing strength across related problems –Calibration and confidence of predictions Generative –Bayesian networks/graphical models –Use of unlabeled and partially labeled data Hybrid

Copyright 2004, David D. Lewis BBR Logistic regression w/ informative priors –Gaussian = ridge logistic regression –Laplace = lasso logistic regression Sparse data structures & fast optimizer –10^4 cases, 10^5 predictors, few seconds! Accuracy competitive with SVMs Free for research use – Joint work w/ Madigan & Genkin (Rutgers)

Copyright 2004, David D. Lewis Gaussian Laplace Gaussian vs. Laplace Prior

Copyright 2004, David D. Lewis Future of Spam Filtering More attention to training data selection, personalization Image processing  Robustness against word variations More linguistic sophistication Replacing naive Bayes with better learners Keep hoping for economic cure

Copyright 2004, David D. Lewis Summary By volume, spam filtering is easily the biggest application of text classification –Possible of supervised learning Filters have helped a lot –Naive Bayes is just a starting point Other interesting applications of Bayesian classification