Presentation is loading. Please wait.

Presentation is loading. Please wait.

Naïve Bayes Advanced Statistical Methods in NLP Ling572 January 19, 2012 1.

Similar presentations


Presentation on theme: "Naïve Bayes Advanced Statistical Methods in NLP Ling572 January 19, 2012 1."— Presentation transcript:

1 Naïve Bayes Advanced Statistical Methods in NLP Ling572 January 19, 2012 1

2 Roadmap Naïve Bayes Multi-variate Bernoulli event model (recap) Multinomial event model Analysis HW#3 2

3 Naïve Bayes Models in Detail (McCallum & Nigam, 1998) Alternate models for Naïve Bayes Text Classification Multivariate Bernoulli event model Binary independence model Features treated as binary – counts ignored Multinomial event model Unigram language model 3

4 Multivariate Bernoulli Event Text Model Each document: Result of |V| independent Bernoulli trials I.e. for each word in vocabulary, does the word appear in the document? From general Naïve Bayes perspective Each word corresponds to two variables, w t and In each doc, either w t or appears Always have |V| elements in a document 4

5 Training & Testing Laplace smoothed training: MAP decision rule classification: P(c) 5

6 Multinomial Event Model 6

7 Multinomial Distribution Trial: select a word according to its probability Possible outcomes: {w 1, w 2,…,w |V| } 7

8 Multinomial Distribution Trial: select a word according to its probability Possible outcomes: {w 1, w 2,…,w |V| } Document is viewed as result of: One trial for each position P(word = w i ) = p i Σ i p i = 1 8

9 Multinomial Distribution Trial: select a word according to its probability Possible outcomes: {w 1, w 2,…,w |V| } Document is viewed as result of: One trial for each position P(word = w i ) = p i Σ i p i = 1 P(X 1 =x 1,X 2 =x 2,….,X |V| =x |V| ) 9

10 Multinomial Distribution Trial: select a word according to its probability Possible outcomes: {w 1, w 2,…,w |V| } Document is viewed as result of: One trial for each position P(word = w i ) = p i Σ i p i = 1 P(X 1 =x 1,X 2 =x 2,….,X |V| =x |V| ) 10

11 Multinomial Distribution Trial: select a word according to its probability Possible outcomes: {w 1, w 2,…,w |V| } Document is viewed as result of: One trial for each position P(word = w i ) = p i Σ i p i = 1 P(X 1 =x 1,X 2 =x 2,….,X |V| =x |V| ) 11

12 Example Consider a vocabulary V with only three words: a, b, c Due to F. Xia 12

13 Example Consider a vocabulary V with only three words: a, b, c Document d i contains only 2 word instances Due to F. Xia 13

14 Example Consider a vocabulary V with only three words: a, b, c Document d i contains only 2 word instances For each position: (P(w=a)=p 1, P(w=b)=p 2, P(w=c) = p 3 Due to F. Xia 14

15 Example Consider a vocabulary V with only three words: a, b, c Document d i contains only 2 word instances For each position: (P(w=a)=p 1, P(w=b)=p 2, P(w=c) = p 3 What is the probability that we see ‘a’ once and ‘b’ once in d i ? Due to F. Xia 15

16 Example (cont’d) How many possible sequences? 16 Due to F. Xia

17 Example (cont’d) How many possible sequences? 3^2 = 9 Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc 17 Due to F. Xia

18 Example (cont’d) How many possible sequences? 3^2 = 9 Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc How many sequences with one ‘a’ and one ‘b’? 18 Due to F. Xia

19 Example (cont’d) How many possible sequences? 3^2 = 9 Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc How many sequences with one ‘a’ and one ‘b’? n!/(x 1 !..x |v| !) = 2 Probability of the sequence ‘ab’ is: 19 Due to F. Xia

20 Example (cont’d) How many possible sequences? 3^2 = 9 Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc How many sequences with one ‘a’ and one ‘b’? n!/(x 1 !..x |v| !) = 2 Probability of the sequence ‘ab’ is: p 1 *p 2 Probability of the sequence ‘ba’ 20 Due to F. Xia

21 Example (cont’d) How many possible sequences? 3^2 = 9 Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc How many sequences with one ‘a’ and one ‘b’? n!/(x 1 !..x |v| !) = 2 Probability of the sequence ‘ab’ is: p 1 *p 2 Probability of the sequence ‘ba’ : p 1 * p 2 So probability of seeing ‘a’ once and ‘b’ once is: 21 Due to F. Xia

22 Example (cont’d) How many possible sequences? 3^2 = 9 Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc How many sequences with one ‘a’ and one ‘b’? n!/(x 1 !..x |v| !) = 2 Probability of the sequence ‘ab’ is: p 1 *p 2 Probability of the sequence ‘ba’ : p 1 * p 2 So probability of seeing ‘a’ once and ‘b’ once is: = 2 p 1 *p 2 22 Due to F. Xia

23 Multinomial Event Model Document is sequence of word events drawn from vocabulary V. Assume document length independent of class Assume (Naïve Bayes) words independent of context 23

24 Multinomial Event Model Document is sequence of word events drawn from vocabulary V. Assume document length independent of class Assume (Naïve Bayes) words independent of context Define N it = # of occurrences of w t in document d i 24

25 Multinomial Event Model Document is sequence of word events drawn from vocabulary V. Assume document length independent of class Assume (Naïve Bayes) words independent of context Define N it = # of occurrences of w t in document d i Then under multinomial event model: 25

26 Multinomial Event Model Document is sequence of word events drawn from vocabulary V. Assume document length independent of class Assume (Naïve Bayes) words independent of context Define N it = # of occurrences of w t in document d i Then under multinomial event model: 26

27 Multinomial Event Model Document is sequence of word events drawn from vocabulary V. Assume document length independent of class Assume (Naïve Bayes) words independent of context Define N it = # of occurrences of w t in document d i Then under multinomial event model: 27

28 Multinomial Event Model Document is sequence of word events drawn from vocabulary V. Assume document length independent of class Assume (Naïve Bayes) words independent of context Define N it = # of occurrences of w t in document d i Then under multinomial event model: 28

29 Training P(c j |d i )=1 if document d i is of class c j, and 0 o.w. So, 29

30 Training P(c j |d i )=1 if document d i is of class c j, and 0 o.w. So, 30

31 Training P(c j |d i )=1 if document d i is of class c j, and 0 o.w. So, 31

32 Training P(c j |d i )=1 if document d i is of class c j, and 0 o.w. So, 32

33 Training P(c j |d i )=1 if document d i is of class c j, and 0 o.w. So, Contrast this with multivariate Bernoulli 33

34 Testing To classify a document d i compute: argmax c P(c)P(d i |c) 34

35 Testing To classify a document d i compute: argmax c P(c)P(d i |c) argmax c P(c) 35

36 Two Naïve Bayes Models Multi-variate Bernoulli event model: Models binary presence/absence of word feature 36

37 Two Naïve Bayes Models Multi-variate Bernoulli event model: Models binary presence/absence of word feature Multinomial event model: Models counts of word features, unigram models 37

38 Two Naïve Bayes Models Multi-variate Bernoulli event model: Models binary presence/absence of word feature Multinomial event model: Models counts of word features, unigram models In experiments on a range of different text classification corpora, multinomial model usually outperforms multivariate Bernoulli (McCallum & Nigam, 1998) 38

39 Thinking about Performance Naïve Bayes: conditional independence assumption Clearly unrealistic, but performance is often good Why? 39

40 Thinking about Performance Naïve Bayes: conditional independence assumption Clearly unrealistic, but performance is often good Why? Classification based on sign, not magnitude Direction of classification usually right Multivariate Bernoulli vs Multinomial Why does multinomial perform better? 40

41 Thinking about Performance Naïve Bayes: conditional independence assumption Clearly unrealistic, but performance is often good Why? Classification based on sign, not magnitude Direction of classification usually right Multivariate Bernoulli vs Multinomial Why does multinomial perform better? Captures additional information: presence/absence+freq What if we wanted to include other types of features? 41

42 Thinking about Performance Naïve Bayes: conditional independence assumption Clearly unrealistic, but performance is often good Why? Classification based on sign, not magnitude Direction of classification usually right Multivariate Bernoulli vs Multinomial Why does multinomial perform better? Captures additional information: presence/absence+freq What if we wanted to include other types of features? Multivariate: just another Bernoulli trial Multinomial can’t mix distributions 42

43 Model Comparison 43 Multivariate BernoulliMultinomial Event Features Trial P(c) P(w|c) Testing

44 Model Comparison 44 Multivariate BernoulliMultinomial Event FeaturesBinary Trial P(c) P(w|c) Testing

45 Model Comparison 45 Multivariate BernoulliMultinomial Event FeaturesBinary# of occurrences Trial P(c) P(w|c) Testing

46 Model Comparison 46 Multivariate BernoulliMultinomial Event FeaturesBinary# of occurrences TrialEach word in vocabulary P(c) P(w|c) Testing

47 Model Comparison 47 Multivariate BernoulliMultinomial Event FeaturesBinary# of occurrences TrialEach word in vocabularyEach position in document P(c) P(w|c) Testing

48 Model Comparison 48 Multivariate BernoulliMultinomial Event FeaturesBinary# of occurrences TrialEach word in vocabularyEach position in document P(c) P(w|c) Testing

49 Model Comparison 49 Multivariate BernoulliMultinomial Event FeaturesBinary# of occurrences TrialEach word in vocabularyEach position in document P(c) P(w|c) Testing

50 Model Comparison 50 Multivariate BernoulliMultinomial Event FeaturesBinary# of occurrences TrialEach word in vocabularyEach position in document P(c) P(w|c) Testing P(c)

51 Naïve Bayes: Strengths Advantages: 51

52 Naïve Bayes: Strengths Advantages: Simplicity (conceptual) Training efficiency Testing efficiency Scales fairly well to large data Performs multiclass classification Can provide n-best outputs 52

53 Naïve Bayes: Weaknesses Disadvantages: Theoretical foundation weak: Ragingly inaccurate independence assumption Decent accuracy, but outperformed by more sophisticated 53

54 Naïve Bayes: Weaknesses Disadvantages: 54

55 HW#3 Naïve Bayes Classification: Experiment with the Mallet Naïve Bayes Learner Implement Multivariate Bernoulli event model Implement Multinomial event model Compare with binary variables Analyze results 55

56 Notes Use add-delta smoothing (vs add-one) Beware numerical underflow log probs are your friend Also converts exponents to multipliers Look out for repeated computation Precompute normalization denominators E.g. for multinomial P(w|c), compute once for each c 56

57 Efficiency MVB: 57

58 Efficiency MVB: 58

59 Efficiency MVB: 59


Download ppt "Naïve Bayes Advanced Statistical Methods in NLP Ling572 January 19, 2012 1."

Similar presentations


Ads by Google