Presentation is loading. Please wait.

Presentation is loading. Please wait.

Maximum Entropy Model (I) LING 572 Fei Xia Week 5: 02/05-02/07/08 1.

Similar presentations


Presentation on theme: "Maximum Entropy Model (I) LING 572 Fei Xia Week 5: 02/05-02/07/08 1."— Presentation transcript:

1 Maximum Entropy Model (I) LING 572 Fei Xia Week 5: 02/05-02/07/08 1

2 MaxEnt in NLP The maximum entropy principle has a long history. The MaxEnt algorithm was introduced to the NLP field by Berger et. al. (1996). Used in many NLP tasks: Tagging, Parsing, PP attachment, … 2

3 Reference papers (Ratnaparkhi, 1997) (Berger et. al., 1996) (Ratnaparkhi, 1996) (Klein and Manning, 2003) People often choose different notations. 3

4 Notation InputOutputThe pair (Berger et. al., 1996) xy(x,y) (Ratnaparkhi, 1997) bax (Ratnaparkhi, 1996) ht(h,t) (Klein and Manning, 2003) dc(c,d) 4 We following the notation in (Berger et al., 1996)

5 Outline Overview The Maximum Entropy Principle Modeling** Decoding Training* Case study: POS tagging 5

6 The Overview 6

7 Joint vs. Conditional models Given training data {(x,y)}, we want to build a model to predict y for new x’s. For each model, we need to estimate the parameters µ. Joint (generative) models estimate P(x,y) by maximizing the likelihood: P(X,Y| µ ) –Ex: n-gram models, HMM, Naïve Bayes, PCFG –Choosing weights is trivial: just use relative frequencies. Conditional (discriminative) models estimate P(y|x) by maximizing the conditional likelihood: P(Y|X, µ ) –Ex: MaxEnt, SVM, etc. –Choosing weights is harder. 7

8 Naïve Bayes Model C f1f1 f2f2 fnfn … Assumption: each f m is conditionally independent from f n given C. 8

9 The conditional independence assumption f m and f n are conditionally independent given c: P(f m | c, f n ) = P(f m | c) Counter-examples in the text classification task: - P(“Iowa” | politics) != P(Iowa | politics, “caucus”) - P(“delegate”| politics) != P(“delegate” | politics, “primary”) Q: How to deal with correlated features? A: Many models, including MaxEnt, do not assume that features are conditionally independent. 9

10 Naïve Bayes highlights Choose c* = arg max c P(c)  k P(f k | c) Two types of model parameters: –Class prior: P(c) –Conditional probability: P(f k | c) The number of model parameters: |C|+|CV| 10

11 P(f | c) in NB 11 f1f1 f2f2 …fjfj c1c1 P(f 1 |c 1 )P(f 2 |c 1 )…P(f j | c 1 ) c2c2 P(f 1 |c 2 )……… …… cici P(f 1 |c i )……P(f j | c i ) Each cell is a weight for a particular (class, feat) pair.

12 Weights in NB and MaxEnt In NB –P(f | c) are probabilities (i.e., 2 [0,1]) –P(f | c) are multiplied at test time In MaxEnt –the weights are real numbers: they can be negative. –the weights are added at test time 12

13 The highlights in MaxEnt 13 Training: to estimate Testing: to calculate P(y|x) f j (x,y) is a feature function, which normally corresponds to a (feature, class) pair.

14 Main questions What is the maximum entropy principle? What is a feature function? Modeling: Why does P(y|x) have the form? Training: How do we estimate ¸ j ? 14

15 Outline Overview The Maximum Entropy Principle Modeling** Decoding Training* Case study 15

16 The maximal entropy principle 16

17 The maximum entropy principle Related to Occam’s razor and other similar justifications for scientific inquiry Make the minimum assumptions about unseen data. Also: Laplace’s Principle of Insufficient Reason: when one has no information to distinguish between the probability of two events, the best strategy is to consider them equally likely. 17

18 Maximum Entropy Why maximum entropy? –Maximize entropy = Minimize commitment Model all that is known and assume nothing about what is unknown. –Model all that is known: satisfy a set of constraints that must hold –Assume nothing about what is unknown: choose the most “uniform” distribution  choose the one with maximum entropy 18

19 Ex1: Coin-flip example (Klein & Manning, 2003) Toss a coin: p(H)=p1, p(T)=p2. Constraint: p1 + p2 = 1 Question: what’s p(x)? That is, what is the value of p1? Answer: choose the p that maximizes H(p) p1 H p1=0.3 19 H

20 Coin-flip example (cont) p1 p2 H p1 + p2 = 1 p1+p2=1.0, p1=0.3 20

21 Ex2: An MT example (Berger et. al., 1996) Possible translation for the word “in” is: Constraint: Intuitive answer: 21

22 An MT example (cont) Constraints: Intuitive answer: 22

23 An MT example (cont) Constraints: Intuitive answer: ?? 23

24 Ex3: POS tagging (Klein and Manning, 2003) 24

25 Ex3 (cont) 25

26 Ex4: Overlapping features (Klein and Manning, 2003) 26 p1 p2 p3 p4

27 Ex4 (cont) 27 p1 p2 2/3-p1 1/3-p2

28 Ex4 (cont) 28 p1 2/3-p1 2/3-p1 p1-1/3

29 Reading #3 (Q1): Let P(X=i) be the probability of getting an i when rolling a dice. What is the value of P(x) with the maximum entropy if the following is true? (a)P(X=1) + P(X=2) = ½ ¼, ¼, 1/8, 1/8, 1/8, 1/8 (b) P(X=1) + P(X=2) = 1/2 and P(X=6) = 1/3 ¼, ¼, 1/18, 1/18, 1/18, 1/3 29

30 (Q2) In the text classification task, |V| is the number of features, |C| is the number of classes. How many feature functions are there? |C| * |V| 30

31 The MaxEnt Principle summary Goal: Among all the distributions that satisfy the constraints, choose the one, p*, that maximizes H(p). Q1: How to represent constraints? Q2: How to find such distributions? 31

32 Outline Overview The Maximum Entropy Principle Modeling** Decoding Training* Case study 32

33 Modeling 33

34 The Setting From the training data, collect (x, y) pairs: –x 2 X: the observed data –y 2 Y: thing to be predicted (e.g., a class in a classification problem) –Ex: In a text classification task x: a document y: the category of the document To estimate P(y|x) 34

35 The basic idea Goal: to estimate p(y|x) Choose p(x,y) with maximum entropy (or “uncertainty”) subject to the constraints (or “evidence”). 35

36 The outline for modeling Feature function: Calculating the expectation of a feature function The forms of P(x,y) and P(y|x) 36

37 Feature function 37

38 The definition A feature function is a binary-valued function on events: A feature function corresponds to a (feature, class) pair: (t, c) f j (x,y) =1 if and only if (t is present in x) and (y == c). Ex: 38

39 The weights in NB 39 f1f1 f2f2 …fkfk c1c1 c2c2 … cici

40 The weights in NB 40 f1f1 f2f2 …fjfj c1c1 P(f 1 |c 1 )P(f 2 |c 1 )…P(f j | c 1 ) c2c2 P(f 1 |c 2 )……… …… cici P(f 1 |c i )……P(f j | c i ) Each cell is a weight for a particular (class, feat) pair.

41 The matrix in MaxEnt 41 t1t1 t2t2 …tktk c1c1 f1 f1 f2f2 …fkfk c2c2 f k+1 f k+2 …f 2k …… cici f k*(i-1)+1 f k*i Each feature function f j corresponds to a (feat, class) pair.

42 The weights in MaxEnt 42 t1t1 t2t2 …tktk c1c1 ¸1 ¸1 ¸2¸2 … ¸k ¸k c2c2 ………… …… cici … ¸ ki Each feature function f j has a weight ¸ j.

43 Feature function summary A feature function in MaxEnt corresponds to a (feat, class) pair. The number of feature functions in MaxEnt is approximately |C| * |F|. A MaxEnt trainer is to learn the weights for the feature functions. 43

44 The outline for modeling Feature function: Calculating the expectation of a feature function The forms of P(x,y) and P(y|x) 44

45 The expectation Ex1: –Flip a coin if it is a head, you will get 100 dollars if it is a tail, you will lose 50 dollars –What is the expected return? P(X=H) * 100 + P(X=T) * (-50) Ex2: –If it is a x i, you will receive v i dollars? –What is the expected return? 45

46 Empirical expectation Denoted as : Ex1: Toss a coin four times and get H, T, H, and H. The average return: (100-50+100+100)/4 = 62.5 Empirical distribution: Empirical expectation: ¾ * 100 + ¼ * (-50) = 62.5 46

47 Model expectation Ex1: Toss a coin four times and get H, T, H, and H. A model: Model expectation: 1/2 * 100 + ½ * (-50) = 25 47

48 Some notations Training data: Empirical distribution: A model: Model expectation of Empirical expectation of The j th feature: 48

49 Empirical expectation 49

50 An example Training data: x1 c1 t1 t2 t3 x2 c2 t1 t4 x3 c1 t4 x4 c3 t1 t3 50 t1t2t3t4 c11111 c21001 c31010 Raw counts

51 An example Training data: x1 c1 t1 t2 t3 x2 c2 t1 t4 x3 c1 t4 x4 c3 t1 t3 51 t1t2t3t4 c11/4 c21/40/4 1/4 c31/40/41/40/4 Empirical expectation

52 Calculating empirical expectation Let N be the number of training instances for each instance x let y be the true class label of x for each feature t in x empirical_expect [t] [y] += 1/N 52

53 Model expectation 53

54 An example Suppose P(y|x i ) = 1/3 Training data: x1 c1 t1 t2 t3 x2 c2 t1 t4 x3 c1 t4 x4 c3 t1 t3 54 t1t2t3t4 c11/3*31/32/3 c21/3*31/32/3 c31/3*31/32/3 “Raw” counts

55 Calculating model expectation Let N be the number of training instances for each instance x calculate P(y|x) for every y 2 Y for each feature t in x for each y 2 Y model_expect [t] [y] += 1/N * P(y|x) 55

56 The outline for modeling Feature function: Calculating the expectation of a feature function The forms of P(x,y) and P(y|x) 56

57 Constraints Model expectation = Empirical expectation Why impose such constraints? –The MaxEnt principle: Model what is known –To maximize the conditional likelihood: see Slides #24-28 in (Klein and Manning, 2003) 57

58 The conditional likelihood (**) Given the data (X,Y), the conditional likelihood is a function of the parameters ¸ 58

59 The effect of adding constraints Lower the entropy Raise the likelihood of data Bring the distribution further away from uniform Bring the distribution closer to the data 59

60 Restating the problem The task: find p* s.t. where Objective function: H(p) Constraints: 60

61 Questions Is P empty? Does p* exist? Is p* unique? What is the form of p*? How to find p*? 61

62 What is the form of p*? (Ratnaparkhi, 1997) Theorem: if then Furthermore, p* is unique. 62

63 Two equivalent forms 63

64 Modeling summary - it maximizes the conditional likelihood of the training data - it is a model in Q Goal: find p* in P, which maximizes H(p). It can be proved that, when p* exists - it is unique 64

65 Outline Overview The Maximum Entropy Principle Modeling** Decoding Training* Case study: POS tagging 65

66 Decoding 66

67 Decoding 67 Z is a constant w.r.t. y t1t1 t2t2 …tktk c1c1 ¸1 ¸1 ¸2¸2 … ¸k ¸k c2c2 ………… …… cici … ¸ ki

68 The procedure for calculating P(y | x) Z=0; for each y 2 Y sum = 0; for each feature t present in x sum += the weight for (t, y); result[y] = exp(sum); Z += result[y]; for each y 2 Y P(y|x) = result[y] / Z; 68

69 MaxEnt summary so far Idea: choose the p* that maximizes entropy while satisfying all the constraints. p* is also the model within a model family that maximizes the conditional likelihood of the training data. MaxEnt handles overlapping features well. In general, MaxEnt achieves good performances on many NLP tasks. Next: Training: many methods (e.g., GIS, IIS, L-BFGS). 69

70 Hw5 70

71 Q1: run Mallet MaxEnt learner The format of the model file: FEATURES FOR CLASS c1 0.3243 t1 0.245 t2 0.491 …. FEATURES FOR CLASS c2 0.3243 t1 -30.412 t2 1.349 …. 71

72 Q2: write a MaxEnt decoder The formula for P(y|x): 72 ¸ 0 is the weight for the default feature.

73 Q3: Write functions to calculate expectation 73

74 Q4-Q6: The stoplight example 74 Which model? Bernoulli or multi-nominal? Why? What features? Q4: f1 and f2 Q5: f1, f2, and f3 Q6: f3 only


Download ppt "Maximum Entropy Model (I) LING 572 Fei Xia Week 5: 02/05-02/07/08 1."

Similar presentations


Ads by Google