Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting.

Similar presentations


Presentation on theme: "Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting."— Presentation transcript:

1 Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting www.daviddlewis.com Presented at ASA Chicago Chapter Spring Conference., Loyola Univ., May 7, 2004.

2 Copyright 2004, David D. Lewis Menu Spam Spam Filtering Classification for Spam Filtering Classification Bayesian Classification Naive Bayesian Classification Naive Bayesian Text Classification Naive Bayesian Text Classification for Spam Filtering (Feature Extraction for) Spam Filtering Text Classification (for Marketing) (Better) Bayesian Classification

3 Copyright 2004, David D. Lewis Spam Unsolicited bulk email –or, in practice, whatever email you don’t want Large fraction of all email sent –Brightmail est. 64%, Postini est. 77% –Still growing Est. cost to US businesses exceeded $30 billion in Y2003

4 Copyright 2004, David D. Lewis Approaches to Spam Control Economic (email pricing,...) Legal (CANSPAM,...) Societal pressure (trade groups,...) Securing infrastructure (email servers,...) Authentication (challenge/response,...) Filtering

5 Copyright 2004, David D. Lewis Spam Filtering Intensional (feature-based) vs. Extensional (white/blacklist-based) Applied at sender vs. receiver Applied at email client vs. mail server vs. ISP

6 Copyright 2004, David D. Lewis Statistical Classification 1.Define classes of objects 2.Specify probability distribution model connecting classes to observable features 3.Fit parameters of model to data 4.Observe features on inputs and compute probability of class membership 5.Assign object to a class

7 Copyright 2004, David D. Lewis Classifier Inter- preter CLASSIFIER Feature Extraction

8 Copyright 2004, David D. Lewis Extract features from header, content Train classifier Classify message and process: –Block message, insert tag, put in folder, etc. Classification for Spam Filtering vs. Define classes:

9 Copyright 2004, David D. Lewis Two Classes of Classifier Generative: Naive Bayes, LDA,... –Model joint distribution of class and features –Derive class probability by Bayes rule Discriminative: logistic regression, CART,... –Model conditional distribution of class given known feature values –Model directly estimates class probability

10 Copyright 2004, David D. Lewis 2. Specify probability model 2b. And prior distribution over parameters 3. Find posterior distribution of model parameters, given data 4. Compute class probabilities using posterior distribution (or element of it) 5. Classify object Bayesian Classification (1) 1.Define classes

11 Copyright 2004, David D. Lewis Bayesian Classification (2) = “Naive”/”Idiot”/”Simple” Bayes A particular generative model –Assumes independence of observable features within each class of messages –Bayes rule used to compute class probability Might or might not use a prior on model parameters

12 Copyright 2004, David D. Lewis Naive Bayes for Text Classification - History Maron (JACM, 1961) – automated indexing Mosteller and Wallace (1964) – author identification Van Rijsbergen, Robertson, Sparck Jones, Croft, Harper (early 1970’s) – search engines Sahami, Dumais, Heckerman, Horvitz (1998) – spam filtering

13 Copyright 2004, David D. Lewis Graham’s A Plan for Spam –And its mutant offspring... Naive Bayes-like classifier with weird parameter estimation Widely used in spam filters –Classic Naive Bayes superior when appropriately used Bayesian Classification (3)

14 Copyright 2004, David D. Lewis NB & Friends: Advantages Simple to implement –No numerical optimization, matrix algebra, etc. Efficient to train and use –Fitting = computing means of feature values –Easy to update with new data –Equivalent to linear classifier, so fast to apply Binary or polytomous

15 Copyright 2004, David D. Lewis NB & Friends: Advantages Independence allows parameters to be estimated on different data sets, e.g. –Estimate content features from messages with headers omitted –Estimate header features from messages with content missing

16 Copyright 2004, David D. Lewis NB & Friends: Advantages Generative model –Comparatively good effectiveness with small training sets –Unlabeled data can be used in parameter estimation (in theory)

17 Copyright 2004, David D. Lewis NB & Friends: Disadvantages Independence assumption wrong –Absurd estimates of class probabilities –Threshold must be tuned, not set analytically Generative model –Generally lower effectiveness than discriminative techniques (e.g. log. regress.) –Improving parameter estimates can hurt classification effectiveness

18 Copyright 2004, David D. Lewis Feature Extraction Convert message to feature vector Header: sender, recipient, routing,… –Possibly break up domain names Text –Words, phrases, character strings –Become binary or numeric features URLs, HTML tags, images,…

19 Copyright 2004, David D. Lewis

20

21 From: Sam Elegy aj6xfdou7@yahoo.com To: ddlewis4@att.netddlewis4@att.net Subject: you can buy V!@gra Spamlike content in image form Irrelevant legit content; doubles as hash buster Typographic variations Randomly generated name and email

22 Copyright 2004, David D. Lewis Defeating Feature Extraction Misspellings, character set choice, HTML games: mislead extraction of words Put content in images Forge headers (to avoid identification, but also interferes with classification) Innocuous content to mimic distribution in nonspam Hashbusters (zyArh73Gf) clog dictionaries

23 Copyright 2004, David D. Lewis Survival of the Fittest Filter designers get to see spam Spammers use spam filters Unprecedented arms race for a statistical field Countermeasures mostly target feature extraction, not modeling assumptions

24 Copyright 2004, David D. Lewis Miscellany 1.Getting legitimate bulk mail past spam filters 2.Other uses of text classification in marketing 3.Frontiers in Bayesian classification

25 Copyright 2004, David D. Lewis Getting Legit Bulk Email Past Filters Test email against several filters –Send to accounts on multiple ISPs –Multiple client-based filters if particularly concerned Coherent content, correctly spelled Non-tricky headers and markup Avoid spam keywords where possible Don’t use spammer tricks

26 Copyright 2004, David D. Lewis Text Classification in Marketing Routing incoming email –Responses to promotions –Detect opportunities for selling –(Automated response sometimes possible) Analysis of text/mixed data on customers –e.g. customer or CSR comments Content analysis –Focus groups, email, chat, blogs, news,…

27 Copyright 2004, David D. Lewis Better Bayesian Classification Discriminative –Logistic regression with informative priors –Sharing strength across related problems –Calibration and confidence of predictions Generative –Bayesian networks/graphical models –Use of unlabeled and partially labeled data Hybrid

28 Copyright 2004, David D. Lewis BBR Logistic regression w/ informative priors –Gaussian = ridge logistic regression –Laplace = lasso logistic regression Sparse data structures & fast optimizer –10^4 cases, 10^5 predictors, few seconds! Accuracy competitive with SVMs Free for research use –www.stat.rutgers.edu/~madigan/BBR/www.stat.rutgers.edu/~madigan/BBR/ Joint work w/ Madigan & Genkin (Rutgers)

29 Copyright 2004, David D. Lewis Gaussian Laplace Gaussian vs. Laplace Prior

30 Copyright 2004, David D. Lewis Future of Spam Filtering More attention to training data selection, personalization Image processing  Robustness against word variations More linguistic sophistication Replacing naive Bayes with better learners Keep hoping for economic cure

31 Copyright 2004, David D. Lewis Summary By volume, spam filtering is easily the biggest application of text classification –Possible of supervised learning Filters have helped a lot –Naive Bayes is just a starting point Other interesting applications of Bayesian classification


Download ppt "Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting."

Similar presentations


Ads by Google