Presentation is loading. Please wait.

Presentation is loading. Please wait.

MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

Similar presentations


Presentation on theme: "MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1."— Presentation transcript:

1 MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1

2 Roadmap Maxent: Training Smoothing Case study: POS Tagging (redux) Beam search 2

3 Training 3

4 Learn λs from training data 4

5 Training Learn λs from training data Challenge: Usually can’t solve analytically Employ numerical methods 5

6 Training Learn λs from training data Challenge: Usually can’t solve analytically Employ numerical methods Main different techniques: Generalized Iterative Scaling (GIS, Darroch &Ratcliffe, ‘72) Improved Iterative Scaling (IIS, Della Pietra et al, ‘95) L-BFGS,….. 6

7 Generalized Iterative Scaling GIS Setup: GIS required constraint:, where C is a constant 7

8 Generalized Iterative Scaling GIS Setup: GIS required constraint:, where C is a constant If not, then set : 8

9 Generalized Iterative Scaling GIS Setup: GIS required constraint:, where C is a constant If not, then set 9

10 Generalized Iterative Scaling GIS Setup: GIS required constraint:, where C is a constant If not, then set and add a correction feature function f k+1: 10

11 Generalized Iterative Scaling GIS Setup: GIS required constraint:, where C is a constant If not, then set and add a correction feature function f k+1: GIS also requires at least one active feature for any event Default feature functions solve this problem 11

12 GIS Iteration Compute the empirical expectation 12

13 GIS Iteration Compute the empirical expectation Initialization:λ j (0) ; set to 0 or some value 13

14 GIS Iteration Compute the empirical expectation Initialization:λ j (0) ; set to 0 or some value Iterate until convergence for each j: 14

15 GIS Iteration Compute the empirical expectation Initialization:λ j (0) ; set to 0 or some value Iterate until convergence for each j: Compute p(y|x) under the current model 15

16 GIS Iteration Compute the empirical expectation Initialization:λ j (0) ; set to 0 or some value Iterate until convergence for each j: Compute p(y|x) under the current model Compute model expectation under current model 16

17 GIS Iteration Compute the empirical expectation Initialization:λ j (0) ; set to 0 or some value Iterate until convergence for each j: Compute p(y|x) under the current model Compute model expectation under current model Update model parameters by weighted ratio of empirical and model expectations 17

18 GIS Iteration Compute 18

19 GIS Iteration Compute Initialization:λ j (0) ; set to 0 or some value 19

20 GIS Iteration Compute Initialization:λ j (0) ; set to 0 or some value Iterate until convergence: Compute 20

21 GIS Iteration Compute Initialization:λ j (0) ; set to 0 or some value Iterate until convergence: Compute p (n) (y|x)= 21

22 GIS Iteration Compute Initialization:λ j (0) ; set to 0 or some value Iterate until convergence: Compute p (n) (y|x)= Compute 22

23 GIS Iteration Compute Initialization:λ j (0) ; set to 0 or some value Iterate until convergence: Compute p (n) (y|x)= Compute Update 23

24 Convergence Methods have convergence guarantees 24

25 Convergence Methods have convergence guarantees However, full convergence may take very long time 25

26 Convergence Methods have convergence guarantees However, full convergence may take very long time Frequently use threshold 26

27 Convergence Methods have convergence guarantees However, full convergence may take very long time Frequently use threshold 27

28 Calculating LL(p) LL = 0 For each sample x in the training data Let y be the true label of x prob = p(y|x) LL += 1/N * prob 28

29 Running Time For each iteration the running time is: 29

30 Running Time For each iteration the running time is O(NPA), where: N: number of training instances P: number of classes A: Average number of active features for instance (x,y) 30

31 L-BFGS Limited-memory version of Broyden–Fletcher–Goldfarb–Shanno (BFGS) method Broyden–Fletcher–Goldfarb–Shanno (BFGS) 31

32 L-BFGS Limited-memory version of Broyden–Fletcher–Goldfarb–Shanno (BFGS) method Broyden–Fletcher–Goldfarb–Shanno (BFGS) Quasi-Newton method for unconstrained optimization 32

33 L-BFGS Limited-memory version of Broyden–Fletcher–Goldfarb–Shanno (BFGS) method Broyden–Fletcher–Goldfarb–Shanno (BFGS) Quasi-Newton method for unconstrained optimization Good for optimization problems with many variables 33

34 L-BFGS Limited-memory version of Broyden–Fletcher–Goldfarb–Shanno (BFGS) method Broyden–Fletcher–Goldfarb–Shanno (BFGS) Quasi-Newton method for unconstrained optimization Good for optimization problems with many variables “Algorithm of choice” for MaxEnt and related models 34

35 L-BFGS References: Nocedal, J. (1980). "Updating Quasi-Newton Matrices with Limited Storage". Mathematics of Computation 35: 773–782 Liu, D. C.; Nocedal, J. (1989)"On the Limited Memory Method for Large Scale Optimization". Mathematical Programming B 45 (3): 503–528"On the Limited Memory Method for Large Scale Optimization" 35

36 L-BFGS References: Nocedal, J. (1980). "Updating Quasi-Newton Matrices with Limited Storage". Mathematics of Computation 35: 773–782 Liu, D. C.; Nocedal, J. (1989)"On the Limited Memory Method for Large Scale Optimization". Mathematical Programming B 45 (3): 503–528"On the Limited Memory Method for Large Scale Optimization" Implementations: Java, Matlab, Python via scipy, R, etc See Wikipedia page 36

37 Smoothing Based on Klein & Manning, 2003; F. Xia 37

38 Smoothing Problems of scale: 38

39 Smoothing Problems of scale: Large numbers of features Some NLP problems in MaxEnt  1M features Storage can be a problem 39

40 Smoothing Problems of scale: Large numbers of features Some NLP problems in MaxEnt  1M features Storage can be a problem Sparseness problems Ease of overfitting 40

41 Smoothing Problems of scale: Large numbers of features Some NLP problems in MaxEnt  1M features Storage can be a problem Sparseness problems Ease of overfitting Optimization problems Features can be near infinite, take long time to converge 41

42 Smoothing Consider the coin flipping problem Three empirical distributions Models From K&M ‘03 42

43 Need for Smoothing Two problems From K&M ‘03 43

44 Need for Smoothing Two problems Optimization: Optimal value of λ? ∞ Slow to optimize From K&M ‘03 44

45 Need for Smoothing Two problems Optimization: Optimal value of λ? ∞ Slow to optimize No smoothing Learned distribution just as spiky (K&M’03) From K&M ‘03 45

46 Possible Solutions 46

47 Possible Solutions Early stopping Feature selection Regularization 47

48 Early Stopping Prior use of early stopping 48

49 Early Stopping Prior use of early stopping Decision tree heuristics 49

50 Early Stopping Prior use of early stopping Decision tree heuristics Similarly here Stop training after a few iterations λwill have increased Guarantees bounded, finite training time 50

51 Feature Selection Approaches: 51

52 Feature Selection Approaches: Heuristic: Drop features based on fixed thresholds i.e. number of occurrences 52

53 Feature Selection Approaches: Heuristic: Drop features based on fixed thresholds i.e. number of occurrences Wrapper methods: Add feature selection to training loop 53

54 Feature Selection Approaches: Heuristic: Drop features based on fixed thresholds i.e. number of occurrences Wrapper methods: Add feature selection to training loop Heuristic approaches: Simple, reduce features, but could harm performance 54

55 Regularization In statistics and machine learning, regularization is any method of preventing overfitting of data by a model. From K&M ’03, F. Xia 55

56 Regularization In statistics and machine learning, regularization is any method of preventing overfitting of data by a model. Typical examples of regularization in statistical machine learning include ridge regression, lasso, and L2-normin support vector machines. From K&M ’03, F. Xia 56

57 Regularization In statistics and machine learning, regularization is any method of preventing overfitting of data by a model. Typical examples of regularization in statistical machine learning include ridge regression, lasso, and L2-normin support vector machines. In this case, we change the objective function: log P(Y,λ|X) = log P(λ)+log P(Y|X,λ) From K&M ’03, F. Xia 57

58 Prior Possible prior distributions: uniform, exponential 58

59 Prior Possible prior distributions: uniform, exponential Gaussian prior: 59

60 Prior Possible prior distributions: uniform, exponential Gaussian prior: log P(Y,λ|X) = log P(λ)+log P(Y|X,λ) 60

61 Maximize P(Y|X,λ) Maximize P(Y, λ|X) In practice, μ=0; 2σ 2 =1 61

62 L1 and L2 Regularization 62

63 Smoothing: POS Example 63

64 Advantages of Smoothing Smooths distributions 64

65 Advantages of Smoothing Smooths distributions Moves weight onto more informative features 65

66 Advantages of Smoothing Smooths distributions Moves weight onto more informative features Enables effective use of larger numbers of features 66

67 Advantages of Smoothing Smooths distributions Moves weight onto more informative features Enables effective use of larger numbers of features Can speed up convergence 67

68 Summary: Training Many training methods: Generalized Iterative Scaling (GIS) Smoothing: Early stopping, feature selection, regularization Regularization: Change objective function – add prior Common prior: Gaussian prior Maximizing posterior not equivalent to max ent 68

69 MaxEnt POS Tagging 69

70 Notation (Ratnaparkhi, 1996) h: history  x Word and tag history t: tag  y 70

71 POS Tagging Model P(t 1,…,t n |w 1,…,w n ) where h i ={w i,w i-1,w i-2,w i+1,w i+2,t i-1,t i-2 } 71

72 MaxEnt Feature Set 72

73 Example Feature for ‘about’ Exclude features seen < 10 times 73

74 Training GIS Training time: O(NTA) N: training set size T: number of tags A: average number of features active for event (h,t) 24 hours on a ‘96 machine 74

75 Finding Features In training, where do features come from? Where do features come from in testing? w -1 w0w0 w -1 w 0 w +1 t -1 y x1(Time ) Time fliesBOSN x2 (flies) TimefliesTime flieslikeNN x3 (like)flieslikeflies likeanNV 75

76 Finding Features In training, where do features come from? Where do features come from in testing? tag features come from classification of prior word w -1 w0w0 w -1 w 0 w +1 t -1 y x1(Time ) Time fliesBOSN x2 (flies) TimefliesTime flieslikeNN x3 (like)flieslikeflies likeanNV 76

77 Decoding Goal: Identify highest probability tag sequence 77

78 Decoding Goal: Identify highest probability tag sequence Issues: Features include tags from previous words Not immediately available 78

79 Decoding Goal: Identify highest probability tag sequence Issues: Features include tags from previous words Not immediately available Uses tag history Just knowing highest probability preceding tag insufficient 79

80 Beam Search Intuition: Breadth-first search explores all paths Lots of paths are (pretty obviously) bad Why explore bad paths? Restrict to (apparently best) paths Approach: Perform breadth-first search, but 80

81 Beam Search Intuition: Breadth-first search explores all paths Lots of paths are (pretty obviously) bad Why explore bad paths? Restrict to (apparently best) paths Approach: Perform breadth-first search, but Retain only k ‘best’ paths thus far k: beam width 81

82 Beam Search, k=3 time flies like an arrow 82

83 Beam Search, k=3 time flies like an arrow 83

84 Beam Search, k=3 time flies like an arrow 84

85 Beam Search, k=3 time flies like an arrow 85

86 Beam Search, k=3 time flies like an arrow 86

87 Beam Search W={w 1,w 2,…,w n }: test sentence 87

88 Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i 88

89 Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly 89

90 Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: 90

91 Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: Extension: add tags for w i to each s (i-1)j 91

92 Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: Extension: add tags for w i to each s (i-1)j Beam selection: Sort sequences by probability Keep only top k sequences 92

93 Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep topN, set s 1j accordingly for i=2 to n: For each s (i-1)j for w i form vector, keep topN tags for w i Beam selection: Sort sequences by probability Keep only top sequences, using pruning on next slide Return highest probability sequence s n1 93

94 Beam Search Pruning and storage: W = beam width For each node, store: Tag for w i Probability of sequence so far, prob i,j = For each candidate j, s i,j Keep the node if prob i,j in topK, and prob i,j is sufficiently high e.g. lg(prob i,j )+W>=lg(max_prob) 94

95 Decoding Tag dictionary: known word: returns tags seen with word in training unknown word: returns all tags Beam width = 5 Running time: O(NTAB) N,T,A as before B: beam width 95

96 POS Tagging Overall accuracy: 96.3+% Unseen word accuracy: 86.2% Comparable to HMM tagging accuracy or TBL Provides Probabilistic framework Better able to model different info sources Topline accuracy 96-97% Consistency issues 96

97 Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: 97

98 Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% 98

99 Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming 99

100 Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming Running time: 100

101 Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming Disadvantage: Not guaranteed optimal (or complete) 101

102 MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words 102

103 MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words Sequence classification problems: Tag features depend on prior classification 103

104 MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words Sequence classification problems: Tag features depend on prior classification Beam search decoding Efficient, but inexact Near optimal in practice 104


Download ppt "MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1."

Similar presentations


Ads by Google