Naïve Bayes Advanced Statistical Methods in NLP Ling572 January 19, 2012 1.

Slides:

Advertisements

Similar presentations

On-line learning and Boosting

Advertisements

Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

What is Statistical Modeling

Learning for Text Categorization

Final review LING572 Fei Xia Week 10: 03/13/08 1.

Assuming normally distributed data! Naïve Bayes Classifier.

K nearest neighbor and Rocchio algorithm

Optimizing Text Classification Mark Trenorden Supervisor: Geoff Webb.

Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.

Lecture 13-1: Text Classification & Naive Bayes

2D1431 Machine Learning Boosting.

Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(

Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.

CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {

CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.

Thanks to Nir Friedman, HU

Text Categorization Moshe Koppel Lecture 2: Naïve Bayes Slides based on Manning, Raghavan and Schutze.

The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

1 Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under:

Crash Course on Machine Learning

Exercise Session 10 – Image Categorization

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Final review LING572 Fei Xia Week 10: 03/11/

Naïve Bayes Advanced Statistical Methods in NLP Ling 572 January 17, 2012.

Text Classification, Active/Interactive learning.

How to classify reading passages into predefined categories ASH.

Bayesian Learning CS446 -FALL ‘14 f:X  V, finite set of values Instances x  X can be described as a collection of features x = (x 1, x 2, … x n ) x i.

Bayesian Networks Martin Bachler MLA - VO

Comparison of Bayesian Neural Networks with TMVA classifiers Richa Sharma, Vipin Bhatnagar Panjab University, Chandigarh India-CMS March, 2009 Meeting,

One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.

BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

1 In this case, each element of a population is assigned to one and only one of several classes or categories. Chapter 11 – Test of Independence - Hypothesis.

1 Naïve Bayes Classifiers CS 171/ Definition A classifier is a system that categorizes instances Inputs to a classifier: feature/attribute values.

Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.

Optimal Bayes Classification

METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Lecture notes 9 Bayesian Belief Networks.

© 2002 Thomson / South-Western Slide 5-1 Chapter 5 Discrete Probability Distributions.

INTRODUCTION TO Machine Learning 3rd Edition

CHAPTER 6 Naive Bayes Models for Classification. QUESTION????

A Comparison of Event Models for Naïve Bayes Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University.

Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang.

Naïve Bayes Classifier Christina Wallin, Period 3 Computer Systems Research Lab Christina Wallin, Period 3 Computer Systems Research Lab

Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring implicit/new.

Machine Learning 5. Parametric Methods.

Linear Models (II) Rong Jin. Recap  Classification problems Inputs x  output y y is from a discrete set Example: height 1.8m  male/female?  Statistical.

Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.

Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Chapter 11 – Test of Independence - Hypothesis Test for Proportions of a Multinomial Population In this case, each element of a population is assigned.

Matt Gormley Lecture 3 September 7, 2016

Bayesian and Markov Test

Maximum Likelihood Estimation

Lecture 15: Text Classification & Naive Bayes

Machine Learning. k-Nearest Neighbor Classifiers.

Language Models for Information Retrieval

Make an Organized List and Simulate a Problem

Statistical NLP: Lecture 4

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Naive Bayes for Document Classification

LECTURE 07: BAYESIAN ESTIMATION

Machine Learning in Practice Lecture 6

Parametric Methods Berlin Chen, 2005 References:

1.7.2 Multinomial Naïve Bayes

Presentation transcript:

Naïve Bayes Advanced Statistical Methods in NLP Ling572 January 19,

Roadmap Naïve Bayes Multi-variate Bernoulli event model (recap) Multinomial event model Analysis HW#3 2

Naïve Bayes Models in Detail (McCallum & Nigam, 1998) Alternate models for Naïve Bayes Text Classification Multivariate Bernoulli event model Binary independence model Features treated as binary – counts ignored Multinomial event model Unigram language model 3

Multivariate Bernoulli Event Text Model Each document: Result of |V| independent Bernoulli trials I.e. for each word in vocabulary, does the word appear in the document? From general Naïve Bayes perspective Each word corresponds to two variables, w t and In each doc, either w t or appears Always have |V| elements in a document 4

Training & Testing Laplace smoothed training: MAP decision rule classification: P(c) 5

Multinomial Event Model 6

Multinomial Distribution Trial: select a word according to its probability Possible outcomes: {w 1, w 2,…,w |V| } 7

Multinomial Distribution Trial: select a word according to its probability Possible outcomes: {w 1, w 2,…,w |V| } Document is viewed as result of: One trial for each position P(word = w i ) = p i Σ i p i = 1 8

Multinomial Distribution Trial: select a word according to its probability Possible outcomes: {w 1, w 2,…,w |V| } Document is viewed as result of: One trial for each position P(word = w i ) = p i Σ i p i = 1 P(X 1 =x 1,X 2 =x 2,….,X |V| =x |V| ) 9

Multinomial Distribution Trial: select a word according to its probability Possible outcomes: {w 1, w 2,…,w |V| } Document is viewed as result of: One trial for each position P(word = w i ) = p i Σ i p i = 1 P(X 1 =x 1,X 2 =x 2,….,X |V| =x |V| ) 10

Multinomial Distribution Trial: select a word according to its probability Possible outcomes: {w 1, w 2,…,w |V| } Document is viewed as result of: One trial for each position P(word = w i ) = p i Σ i p i = 1 P(X 1 =x 1,X 2 =x 2,….,X |V| =x |V| ) 11

Example Consider a vocabulary V with only three words: a, b, c Due to F. Xia 12

Example Consider a vocabulary V with only three words: a, b, c Document d i contains only 2 word instances Due to F. Xia 13

Example Consider a vocabulary V with only three words: a, b, c Document d i contains only 2 word instances For each position: (P(w=a)=p 1, P(w=b)=p 2, P(w=c) = p 3 Due to F. Xia 14

Example Consider a vocabulary V with only three words: a, b, c Document d i contains only 2 word instances For each position: (P(w=a)=p 1, P(w=b)=p 2, P(w=c) = p 3 What is the probability that we see ‘a’ once and ‘b’ once in d i ? Due to F. Xia 15

Example (cont’d) How many possible sequences? 16 Due to F. Xia

Example (cont’d) How many possible sequences? 3^2 = 9 Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc 17 Due to F. Xia

Example (cont’d) How many possible sequences? 3^2 = 9 Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc How many sequences with one ‘a’ and one ‘b’? 18 Due to F. Xia

Example (cont’d) How many possible sequences? 3^2 = 9 Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc How many sequences with one ‘a’ and one ‘b’? n!/(x 1 !..x |v| !) = 2 Probability of the sequence ‘ab’ is: 19 Due to F. Xia

Example (cont’d) How many possible sequences? 3^2 = 9 Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc How many sequences with one ‘a’ and one ‘b’? n!/(x 1 !..x |v| !) = 2 Probability of the sequence ‘ab’ is: p 1 *p 2 Probability of the sequence ‘ba’ 20 Due to F. Xia

Example (cont’d) How many possible sequences? 3^2 = 9 Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc How many sequences with one ‘a’ and one ‘b’? n!/(x 1 !..x |v| !) = 2 Probability of the sequence ‘ab’ is: p 1 *p 2 Probability of the sequence ‘ba’ : p 1 * p 2 So probability of seeing ‘a’ once and ‘b’ once is: 21 Due to F. Xia

Example (cont’d) How many possible sequences? 3^2 = 9 Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc How many sequences with one ‘a’ and one ‘b’? n!/(x 1 !..x |v| !) = 2 Probability of the sequence ‘ab’ is: p 1 *p 2 Probability of the sequence ‘ba’ : p 1 * p 2 So probability of seeing ‘a’ once and ‘b’ once is: = 2 p 1 *p 2 22 Due to F. Xia

Multinomial Event Model Document is sequence of word events drawn from vocabulary V. Assume document length independent of class Assume (Naïve Bayes) words independent of context 23

Multinomial Event Model Document is sequence of word events drawn from vocabulary V. Assume document length independent of class Assume (Naïve Bayes) words independent of context Define N it = # of occurrences of w t in document d i 24

Multinomial Event Model Document is sequence of word events drawn from vocabulary V. Assume document length independent of class Assume (Naïve Bayes) words independent of context Define N it = # of occurrences of w t in document d i Then under multinomial event model: 25

Multinomial Event Model Document is sequence of word events drawn from vocabulary V. Assume document length independent of class Assume (Naïve Bayes) words independent of context Define N it = # of occurrences of w t in document d i Then under multinomial event model: 26

Multinomial Event Model Document is sequence of word events drawn from vocabulary V. Assume document length independent of class Assume (Naïve Bayes) words independent of context Define N it = # of occurrences of w t in document d i Then under multinomial event model: 27

Multinomial Event Model Document is sequence of word events drawn from vocabulary V. Assume document length independent of class Assume (Naïve Bayes) words independent of context Define N it = # of occurrences of w t in document d i Then under multinomial event model: 28

Training P(c j |d i )=1 if document d i is of class c j, and 0 o.w. So, 29

Training P(c j |d i )=1 if document d i is of class c j, and 0 o.w. So, 30

Training P(c j |d i )=1 if document d i is of class c j, and 0 o.w. So, 31

Training P(c j |d i )=1 if document d i is of class c j, and 0 o.w. So, 32

Training P(c j |d i )=1 if document d i is of class c j, and 0 o.w. So, Contrast this with multivariate Bernoulli 33

Testing To classify a document d i compute: argmax c P(c)P(d i |c) 34

Testing To classify a document d i compute: argmax c P(c)P(d i |c) argmax c P(c) 35

Two Naïve Bayes Models Multi-variate Bernoulli event model: Models binary presence/absence of word feature 36

Two Naïve Bayes Models Multi-variate Bernoulli event model: Models binary presence/absence of word feature Multinomial event model: Models counts of word features, unigram models 37

Two Naïve Bayes Models Multi-variate Bernoulli event model: Models binary presence/absence of word feature Multinomial event model: Models counts of word features, unigram models In experiments on a range of different text classification corpora, multinomial model usually outperforms multivariate Bernoulli (McCallum & Nigam, 1998) 38

Thinking about Performance Naïve Bayes: conditional independence assumption Clearly unrealistic, but performance is often good Why? 39

Thinking about Performance Naïve Bayes: conditional independence assumption Clearly unrealistic, but performance is often good Why? Classification based on sign, not magnitude Direction of classification usually right Multivariate Bernoulli vs Multinomial Why does multinomial perform better? 40

Thinking about Performance Naïve Bayes: conditional independence assumption Clearly unrealistic, but performance is often good Why? Classification based on sign, not magnitude Direction of classification usually right Multivariate Bernoulli vs Multinomial Why does multinomial perform better? Captures additional information: presence/absence+freq What if we wanted to include other types of features? 41

Thinking about Performance Naïve Bayes: conditional independence assumption Clearly unrealistic, but performance is often good Why? Classification based on sign, not magnitude Direction of classification usually right Multivariate Bernoulli vs Multinomial Why does multinomial perform better? Captures additional information: presence/absence+freq What if we wanted to include other types of features? Multivariate: just another Bernoulli trial Multinomial can’t mix distributions 42

Model Comparison 43 Multivariate BernoulliMultinomial Event Features Trial P(c) P(w|c) Testing

Model Comparison 44 Multivariate BernoulliMultinomial Event FeaturesBinary Trial P(c) P(w|c) Testing

Model Comparison 45 Multivariate BernoulliMultinomial Event FeaturesBinary# of occurrences Trial P(c) P(w|c) Testing

Model Comparison 46 Multivariate BernoulliMultinomial Event FeaturesBinary# of occurrences TrialEach word in vocabulary P(c) P(w|c) Testing

Model Comparison 47 Multivariate BernoulliMultinomial Event FeaturesBinary# of occurrences TrialEach word in vocabularyEach position in document P(c) P(w|c) Testing

Model Comparison 48 Multivariate BernoulliMultinomial Event FeaturesBinary# of occurrences TrialEach word in vocabularyEach position in document P(c) P(w|c) Testing

Model Comparison 49 Multivariate BernoulliMultinomial Event FeaturesBinary# of occurrences TrialEach word in vocabularyEach position in document P(c) P(w|c) Testing

Model Comparison 50 Multivariate BernoulliMultinomial Event FeaturesBinary# of occurrences TrialEach word in vocabularyEach position in document P(c) P(w|c) Testing P(c)

Naïve Bayes: Strengths Advantages: 51

Naïve Bayes: Strengths Advantages: Simplicity (conceptual) Training efficiency Testing efficiency Scales fairly well to large data Performs multiclass classification Can provide n-best outputs 52

Naïve Bayes: Weaknesses Disadvantages: Theoretical foundation weak: Ragingly inaccurate independence assumption Decent accuracy, but outperformed by more sophisticated 53

Naïve Bayes: Weaknesses Disadvantages: 54

HW#3 Naïve Bayes Classification: Experiment with the Mallet Naïve Bayes Learner Implement Multivariate Bernoulli event model Implement Multinomial event model Compare with binary variables Analyze results 55

Notes Use add-delta smoothing (vs add-one) Beware numerical underflow log probs are your friend Also converts exponents to multipliers Look out for repeated computation Precompute normalization denominators E.g. for multinomial P(w|c), compute once for each c 56

Efficiency MVB: 57

Efficiency MVB: 58

Efficiency MVB: 59