Naïve Bayes Advanced Statistical Methods in NLP Ling 572 January 17, 2012.

Slides:



Advertisements
Similar presentations
Basics of Statistical Estimation
Advertisements

Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
Naïve Bayes Advanced Statistical Methods in NLP Ling572 January 19,
Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.
What is Statistical Modeling
Maximum Entropy Model (I) LING 572 Fei Xia Week 5: 02/05-02/07/08 1.
Introduction to Classification Shallow Processing Techniques for NLP Ling570 November 9, 2011.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Probabilistic inference
Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Text Classification from Labeled and Unlabeled Documents using EM Kamal Nigam Andrew K. McCallum Sebastian Thrun Tom Mitchell Machine Learning (2000) Presented.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.
Thanks to Nir Friedman, HU
Text Categorization Moshe Koppel Lecture 2: Naïve Bayes Slides based on Manning, Raghavan and Schutze.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Review: Probability Random variables, events Axioms of probability
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Bayesian Networks. Male brain wiring Female brain wiring.
Text Classification, Active/Interactive learning.
How to classify reading passages into predefined categories ASH.
1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)
Naive Bayes Classifier
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Classification Techniques: Bayesian Classification
PRIORS David Kauchak CS159 Fall Admin Assignment 7 due Friday at 5pm Project proposals due Thursday at 11:59pm Grading update.
MLE’s, Bayesian Classifiers and Naïve Bayes Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 30,
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
A Comparison of Event Models for Naïve Bayes Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
Natural Language Processing Statistical Inference: n-grams
Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
KNN & Naïve Bayes Hongning Wang
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier.
N-Grams Chapter 4 Part 2.
Bayesian inference, Naïve Bayes model
Matt Gormley Lecture 3 September 7, 2016
Bayesian and Markov Test
Maximum Likelihood Estimation
Lecture 15: Text Classification & Naive Bayes
Machine Learning. k-Nearest Neighbor Classifiers.
Statistical NLP: Lecture 4
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Parametric Methods Berlin Chen, 2005 References:
Naïve Bayes Classifier
Presentation transcript:

Naïve Bayes Advanced Statistical Methods in NLP Ling 572 January 17, 2012

Roadmap Motivation: Example: Spam Tagging Naïve Bayes Model Bayesian structure Naïve Assumption Training & Decoding Variants Issues

Bayesian Spam Filtering Task: Given an message, classify it as ‘spam’ or ‘not spam’ Form of automatic text categorization Features?

Doc1 Western Union Money Transfer One Bishops Square Akpakpa E1 6AO, Cotonou Benin Republic Website: info/selectCountry.asP Phone: Attention Beneficiary, This to inform you that the federal ministry of finance Benin Republic has started releasing scam victim compensation fund mandated by United Nation Organization through our office. I am contacting you because our agent have sent you the first payment of $5,000 for your compensation funds total amount of $ USD (Five hundred thousand united state dollar) We need your urgent response so that we shall release your payment information to you. You can call our office hot line for urgent attention( )

Doc2 Hello! my dear. How are you today and your family? I hope all is good, kindly pay Attention and understand my aim of communicating you today through this Letter, My names is Saif al-Islam al-Gaddafi the Son of former Libyan President. i was born on 1972 in Tripoli Libya,By Gaddafi’s second wive. I want you to help me clear this fund in your name which i deposited in Europe please i would like this money to be transferred into your account before they find it. the amount is ,000 million GBP British Pounds sterling through a

Doc4 from: REMINDER: If you have not received a PIN number to vote in the elections and have not already contacted us, please contact either Drago Radev or Priscilla Rasmussen right away. Everyone who has not received a pin but who has contacted us already will get a new pin over the weekend. Anyone who still wants to join for 2011 needs to do this by Monday (November 7th) in order to be eligible to vote. And, if you do have your PIN number and have not voted yet, remember every vote

What are good features?

Possible Features Words!

Possible Features Words! Feature for each word

Possible Features Words! Feature for each word Binary: presence/absence Integer: occurrence count Particular word types: money/sex/: [Vv].*gr.*

Possible Features Words! Feature for each word Binary: presence/absence Integer: occurrence count Particular word types: money/sex/: [Vv].*gr.* Errors: Spelling, grammar

Possible Features Words! Feature for each word Binary: presence/absence Integer: occurrence count Particular word types: money/sex/: [Vv].*gr.* Errors: Spelling, grammar Images

Possible Features Words! Feature for each word Binary: presence/absence Integer: occurrence count Particular word types: money/sex/: [Vv].*gr.* Errors: Spelling, grammar Images Header info

Classification w/Features Probabilistic framework: Higher probability of spam given features than !spam P(spam|f) > P(!spam|f)

Classification w/Features Probabilistic framework: Higher probability of spam given features than !spam P(spam|f) > P(!spam|f) Combining features?

Classification w/Features Probabilistic framework: Higher probability of spam given features than !spam P(spam|f) > P(!spam|f) Combining features? Could use existing models Decision trees, nearest neighbor

Classification w/Features Probabilistic framework: Higher probability of spam given features than !spam P(spam|f) > P(!spam|f) Combining features? Could use existing models Decision trees, nearest neighbor Alternative: Consider association of each feature with spam/!spam Combine

ML Questions: Reminder Modeling: What is the model structure? Why is it called ‘Naïve Bayes’? What assumptions does it make?

ML Questions: Reminder Modeling: What is the model structure? Why is it called ‘Naïve Bayes’? What assumptions does it make? What types of parameters are learned? How many parameters?

ML Questions: Reminder Modeling: What is the model structure? Why is it called ‘Naïve Bayes’? What assumptions does it make? What types of parameters are learned? How many parameters? Training: How are model parameters learned from data?

ML Questions: Reminder Modeling: What is the model structure? Why is it called ‘Naïve Bayes’? What assumptions does it make? What types of parameters are learned? How many parameters? Training: How are model parameters learned from data? Decoding: How is model used to classify new data?

Probabilistic Model Given an instance x with features f 1 …f k, Find the class with highest probability

Probabilistic Model Given an instance x with features f 1 …f k, Find the class with highest probability Formally, x = (f 1,f 2,….,f k ) Find c * =argmax c P(c|x)

Probabilistic Model Given an instance x with features f 1 …f k, Find the class with highest probability Formally, x = (f 1,f 2,….,f k ) Find c * =argmax c P(c|x) Applying Bayes’ Rule: c * =argmax c P(x|c)P(c)/P(x)

Probabilistic Model Given an instance x with features f 1 …f k, Find the class with highest probability Formally, x = (f 1,f 2,….,f k ) Find c * =argmax c P(c|x) Applying Bayes’ Rule: c * =argmax c P(x|c)P(c)/P(x) Maximizing: c * =argmax c P(x|c)P(c)

Naïve Bayes Model So far just Bayes’ Rule Key question: How do we handle/combine features?

Naïve Bayes Model So far just Bayes’ Rule Key question: How do we handle/combine features? Consider just P(x|c) P(x|c)

Naïve Bayes Model So far just Bayes’ Rule Key question: How do we handle/combine features? Consider just P(x|c) P(x|c) = P(f 1,f 2,…,f k |c)

Naïve Bayes Model So far just Bayes’ Rule Key question: How do we handle/combine features? Consider just P(x|c) P(x|c) = P(f 1,f 2,…,f k |c) Can we simplify?

Naïve Bayes Model So far just Bayes’ Rule Key question: How do we handle/combine features? Consider just P(x|c) P(x|c) = P(f 1,f 2,…,f k |c) Can we simplify? (Remember ngrams)

Naïve Bayes Model So far just Bayes’ Rule Key question: How do we handle/combine features? Consider just P(x|c) P(x|c) = P(f 1,f 2,…,f k |c) Can we simplify? (Remember ngrams) Assume conditional independence

Naïve Bayes Model So far just Bayes’ Rule Key question: How do we handle/combine features? Consider just P(x|c) P(x|c) = P(f 1,f 2,…,f k |c) Can we simplify? (Remember ngrams) Assume conditional independence

Naïve Bayes Model Naïve Bayes Model assumes features f are independent of each other, given the class C c f1f1 f2f2 f3f3 fkfk

Model Parameters Our model: c * = argmax c P(c)Π j P(f j |c)

Model Parameters Our model: c * = argmax c P(c)Π j P(f j |c) How many types of parameters?

Model Parameters Our model: c * = argmax c P(c)Π j P(f j |c) How many types of parameters? Two:

Model Parameters Our model: c * = argmax c P(c)Π j P(f j |c) How many types of parameters? Two: P(C): Class priors P(f j |c): Class conditional feature probabilities

Model Parameters Our model: c * = argmax c P(c)Π j P(f j |c) How many types of parameters? Two: P(C): Class priors P(f j |c): Class conditional feature probabilities How many features in total?

Model Parameters Our model: c * = argmax c P(c)Π j P(f j |c) How many types of parameters? Two: P(C): Class priors P(f j |c): Class conditional feature probabilities How many features in total? |priors|+|conditional probabilities|

Model Parameters Our model: c * = argmax c P(c)Π j P(f j |c) How many types of parameters? Two: P(C): Class priors P(f j |c): Class conditional feature probabilities How many features in total? |priors|+|conditional probabilities| |C| + |FC| |C|+|VC|, if features are words in vocabulary V

Training Parameter estimation from labeled training data Maximum likelihood estimation

Training Parameter estimation from labeled training data Maximum likelihood estimation Class priors:

Training Parameter estimation from labeled training data Maximum likelihood estimation Class priors:

Training Parameter estimation from labeled training data Maximum likelihood estimation Class priors: Conditional probabilities

Training Parameter estimation from labeled training data Maximum likelihood estimation Class priors: Conditional probabilities

Training MLE issues?

Training MLE issues? What’s the probability of a feature not seen with a c i ?

Training MLE issues? What’s the probability of a feature not seen with a c i ? 0 What happens then?

Training MLE issues? What’s the probability of a feature not seen with a c i ? 0 What happens then? Solutions?

Training MLE issues? What’s the probability of a feature not seen with a c i ? 0 What happens then? Solutions? Smoothing Laplace smoothing, Good-Turing, Witten-Bell Interpolation, Backoff….

What are Zero Counts? Some of those zeros are really zeros... Things that really can’t or shouldn’t happen.

What are Zero Counts? Some of those zeros are really zeros... Things that really can’t or shouldn’t happen. On the other hand, some of them are just rare events. If the training corpus had been a little bigger they would have had a count (probably a count of 1!).

What are Zero Counts? Some of those zeros are really zeros... Things that really can’t or shouldn’t happen. On the other hand, some of them are just rare events. If the training corpus had been a little bigger they would have had a count (probably a count of 1!). Zipf’s Law (long tail phenomenon): A small number of events occur with high frequency A large number of events occur with low frequency You can quickly collect statistics on the high frequency events You might have to wait an arbitrarily long time to get valid statistics on low frequency events

Laplace Smoothing Add 1 to all counts (aka Add-one Smoothing)

Laplace Smoothing Add 1 to all counts (aka Add-one Smoothing) V: size of vocabulary; N: size of corpus Unigram: P MLE

Laplace Smoothing Add 1 to all counts (aka Add-one Smoothing) V: size of vocabulary; N: size of corpus Unigram: P MLE : P(w i ) = C(w i )/N P Laplace (w i ) =

Laplace Smoothing Add 1 to all counts (aka Add-one Smoothing) V: size of vocabulary; N: size of corpus Unigram: P MLE : P(w i ) = C(w i )/N P Laplace (w i ) =

Decoding Maximum a posteriori (MAP) decision rule

Decoding Maximum a posteriori (MAP) decision rule Given our model, and an instance x={f 1,f 2,….,f k } Classify(x)= = Classify(f 1,f 2,…,f k ) =

Decoding Maximum a posteriori (MAP) decision rule Given our model, and an instance x={f 1,f 2,….,f k } Classify(x)= = Classify(f 1,f 2,…,f k ) = argmax c P(c|x) =

Decoding Maximum a posteriori (MAP) decision rule Given our model, and an instance x={f 1,f 2,….,f k } Classify(x)= = Classify(f 1,f 2,…,f k ) = argmax c P(c|x) = argmax c P(x|c)P(c) = argmax c Π j P(f j |c)P(c)

Naïve Bayes Spam Tagging Pantel & Lin Represent message x as set of features Features: f 1,f 2,….,f n

Naïve Bayes Spam Tagging Pantel & Lin Represent message x as set of features Features: f 1,f 2,….,f n Features: Words Alternatively (skip) n-gram sequences Stemmed (?)

Naïve Bayes Spam Tagging II Estimating term conditional probabilities Add-delta smoothing

Naïve Bayes Spam Tagging II Estimating term conditional probabilities Add-delta smoothing Selecting good features: Exclude terms s.t.

Naïve Bayes Spam Tagging II Estimating term conditional probabilities Add-delta smoothing Selecting good features: Exclude terms s.t. N(W|Spam)+N(W|NotSpam)< <=P(W|Spam)/(P(W|Spam)+P(W|NotSpam))<=0.55

Experimentation (Pantel & Lin) Training: 160 spam, 466 non-spam Test: 277 spam, 346 non-spam 230,449 training words; spam terms; filtering reduces to 3848

Results (PL) False positives: 1.16% False negatives: 8.3% Overall error: 4.33% Simple approach, effective

Naïve Bayes for Text Classification

Naïve Bayes and Text Classification Naïve Bayes model applied for many TC applications

Naïve Bayes and Text Classification Naïve Bayes model applied for many TC applications Spam tagging Many widespread systems (e.g., SpamAssassin)

Naïve Bayes and Text Classification Naïve Bayes model applied for many TC applications Spam tagging Many widespread systems (e.g., SpamAssassin) Authorship classification

Naïve Bayes and Text Classification Naïve Bayes model applied for many TC applications Spam tagging Many widespread systems (e.g., SpamAssassin) Authorship classification General document categorization e.g. McCallum & Nigam paper WebKB, Reuters, 20 newsgroups,….

Features General text classification features: Words Referred to as ‘bag of words’ models No word order information

Features General text classification features: Words Referred to as ‘bag of words’ models No word order information # features:

Features General text classification features: Words Referred to as ‘bag of words’ models No word order information # features: |V| Features: w t, t in {1,..,|V|}

Questions How do we model word features? Should we model absence of features directly? In addition to presence

Naïve Bayes Models in Detail (McCallum & Nigam, 1998) Alternate models for Naïve Bayes Text Classification

Naïve Bayes Models in Detail (McCallum & Nigam, 1998) Alternate models for Naïve Bayes Text Classification Multivariate Bernoulli event model Binary independence model Features treated as binary – counts ignored

Naïve Bayes Models in Detail (McCallum & Nigam, 1998) Alternate models for Naïve Bayes Text Classification Multivariate Bernoulli event model Binary independence model Features treated as binary – counts ignored Multinomial event model Unigram language model

Multivariate Bernoulli Event Model Bernoulli distribution: Bernoulli trial: statistical experiment with: exactly two mutually exclusive outcomes with constant probability of occurrence

Multivariate Bernoulli Event Model Bernoulli distribution: Bernoulli trial: statistical experiment with: exactly two mutually exclusive outcomes with constant probability of occurrence Typical example is a coin flip Heads, tails P(X=heads)= p: P(X=tails)=1-p

Multivariate Bernoulli Event Text Model Each document: Result of |V| independent Bernoulli trials I.e. for each word in vocabulary, does the word appear in the document?

Multivariate Bernoulli Event Text Model Each document: Result of |V| independent Bernoulli trials I.e. for each word in vocabulary, does the word appear in the document? From general Naïve Bayes perspective Each word corresponds to two variables, w t and In each doc, either w t or appears Always have |V| elements in a document

Training MLE: Laplace smoothed:

Training MLE: Laplace smoothed:

Training MLE: Laplace smoothed:

Notation Let B it =1 if w t appears in document d i =0 otherwise P(c j |d i ) = 1 if document d i is of class c j = 0 otherwise

Testing MAP decision rule: classify(d) = argmax c P(c)P(d|c)

Testing MAP decision rule: classify(d) = argmax c P(c)P(d|c) = P(c) Π k P(f k |c)

Testing MAP decision rule: classify(d) = argmax c P(c)P(d|c) = P(c) Π k P(f k |c) = P(c) ( )

Testing MAP decision rule: classify(d) = argmax c P(c)P(d|c) = P(c) Π k P(f k |c) = P(c) ( )