Download presentation
Presentation is loading. Please wait.
Published byMargaret George Modified over 9 years ago
1
MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1
2
Roadmap Maxent: Training Smoothing Case study: POS Tagging (redux) Beam search 2
3
Training 3
4
Learn λs from training data 4
5
Training Learn λs from training data Challenge: Usually can’t solve analytically Employ numerical methods 5
6
Training Learn λs from training data Challenge: Usually can’t solve analytically Employ numerical methods Main different techniques: Generalized Iterative Scaling (GIS, Darroch &Ratcliffe, ‘72) Improved Iterative Scaling (IIS, Della Pietra et al, ‘95) L-BFGS,….. 6
7
Generalized Iterative Scaling GIS Setup: GIS required constraint:, where C is a constant 7
8
Generalized Iterative Scaling GIS Setup: GIS required constraint:, where C is a constant If not, then set : 8
9
Generalized Iterative Scaling GIS Setup: GIS required constraint:, where C is a constant If not, then set 9
10
Generalized Iterative Scaling GIS Setup: GIS required constraint:, where C is a constant If not, then set and add a correction feature function f k+1: 10
11
Generalized Iterative Scaling GIS Setup: GIS required constraint:, where C is a constant If not, then set and add a correction feature function f k+1: GIS also requires at least one active feature for any event Default feature functions solve this problem 11
12
GIS Iteration Compute the empirical expectation 12
13
GIS Iteration Compute the empirical expectation Initialization:λ j (0) ; set to 0 or some value 13
14
GIS Iteration Compute the empirical expectation Initialization:λ j (0) ; set to 0 or some value Iterate until convergence for each j: 14
15
GIS Iteration Compute the empirical expectation Initialization:λ j (0) ; set to 0 or some value Iterate until convergence for each j: Compute p(y|x) under the current model 15
16
GIS Iteration Compute the empirical expectation Initialization:λ j (0) ; set to 0 or some value Iterate until convergence for each j: Compute p(y|x) under the current model Compute model expectation under current model 16
17
GIS Iteration Compute the empirical expectation Initialization:λ j (0) ; set to 0 or some value Iterate until convergence for each j: Compute p(y|x) under the current model Compute model expectation under current model Update model parameters by weighted ratio of empirical and model expectations 17
18
GIS Iteration Compute 18
19
GIS Iteration Compute Initialization:λ j (0) ; set to 0 or some value 19
20
GIS Iteration Compute Initialization:λ j (0) ; set to 0 or some value Iterate until convergence: Compute 20
21
GIS Iteration Compute Initialization:λ j (0) ; set to 0 or some value Iterate until convergence: Compute p (n) (y|x)= 21
22
GIS Iteration Compute Initialization:λ j (0) ; set to 0 or some value Iterate until convergence: Compute p (n) (y|x)= Compute 22
23
GIS Iteration Compute Initialization:λ j (0) ; set to 0 or some value Iterate until convergence: Compute p (n) (y|x)= Compute Update 23
24
Convergence Methods have convergence guarantees 24
25
Convergence Methods have convergence guarantees However, full convergence may take very long time 25
26
Convergence Methods have convergence guarantees However, full convergence may take very long time Frequently use threshold 26
27
Convergence Methods have convergence guarantees However, full convergence may take very long time Frequently use threshold 27
28
Calculating LL(p) LL = 0 For each sample x in the training data Let y be the true label of x prob = p(y|x) LL += 1/N * prob 28
29
Running Time For each iteration the running time is: 29
30
Running Time For each iteration the running time is O(NPA), where: N: number of training instances P: number of classes A: Average number of active features for instance (x,y) 30
31
L-BFGS Limited-memory version of Broyden–Fletcher–Goldfarb–Shanno (BFGS) method Broyden–Fletcher–Goldfarb–Shanno (BFGS) 31
32
L-BFGS Limited-memory version of Broyden–Fletcher–Goldfarb–Shanno (BFGS) method Broyden–Fletcher–Goldfarb–Shanno (BFGS) Quasi-Newton method for unconstrained optimization 32
33
L-BFGS Limited-memory version of Broyden–Fletcher–Goldfarb–Shanno (BFGS) method Broyden–Fletcher–Goldfarb–Shanno (BFGS) Quasi-Newton method for unconstrained optimization Good for optimization problems with many variables 33
34
L-BFGS Limited-memory version of Broyden–Fletcher–Goldfarb–Shanno (BFGS) method Broyden–Fletcher–Goldfarb–Shanno (BFGS) Quasi-Newton method for unconstrained optimization Good for optimization problems with many variables “Algorithm of choice” for MaxEnt and related models 34
35
L-BFGS References: Nocedal, J. (1980). "Updating Quasi-Newton Matrices with Limited Storage". Mathematics of Computation 35: 773–782 Liu, D. C.; Nocedal, J. (1989)"On the Limited Memory Method for Large Scale Optimization". Mathematical Programming B 45 (3): 503–528"On the Limited Memory Method for Large Scale Optimization" 35
36
L-BFGS References: Nocedal, J. (1980). "Updating Quasi-Newton Matrices with Limited Storage". Mathematics of Computation 35: 773–782 Liu, D. C.; Nocedal, J. (1989)"On the Limited Memory Method for Large Scale Optimization". Mathematical Programming B 45 (3): 503–528"On the Limited Memory Method for Large Scale Optimization" Implementations: Java, Matlab, Python via scipy, R, etc See Wikipedia page 36
37
Smoothing Based on Klein & Manning, 2003; F. Xia 37
38
Smoothing Problems of scale: 38
39
Smoothing Problems of scale: Large numbers of features Some NLP problems in MaxEnt 1M features Storage can be a problem 39
40
Smoothing Problems of scale: Large numbers of features Some NLP problems in MaxEnt 1M features Storage can be a problem Sparseness problems Ease of overfitting 40
41
Smoothing Problems of scale: Large numbers of features Some NLP problems in MaxEnt 1M features Storage can be a problem Sparseness problems Ease of overfitting Optimization problems Features can be near infinite, take long time to converge 41
42
Smoothing Consider the coin flipping problem Three empirical distributions Models From K&M ‘03 42
43
Need for Smoothing Two problems From K&M ‘03 43
44
Need for Smoothing Two problems Optimization: Optimal value of λ? ∞ Slow to optimize From K&M ‘03 44
45
Need for Smoothing Two problems Optimization: Optimal value of λ? ∞ Slow to optimize No smoothing Learned distribution just as spiky (K&M’03) From K&M ‘03 45
46
Possible Solutions 46
47
Possible Solutions Early stopping Feature selection Regularization 47
48
Early Stopping Prior use of early stopping 48
49
Early Stopping Prior use of early stopping Decision tree heuristics 49
50
Early Stopping Prior use of early stopping Decision tree heuristics Similarly here Stop training after a few iterations λwill have increased Guarantees bounded, finite training time 50
51
Feature Selection Approaches: 51
52
Feature Selection Approaches: Heuristic: Drop features based on fixed thresholds i.e. number of occurrences 52
53
Feature Selection Approaches: Heuristic: Drop features based on fixed thresholds i.e. number of occurrences Wrapper methods: Add feature selection to training loop 53
54
Feature Selection Approaches: Heuristic: Drop features based on fixed thresholds i.e. number of occurrences Wrapper methods: Add feature selection to training loop Heuristic approaches: Simple, reduce features, but could harm performance 54
55
Regularization In statistics and machine learning, regularization is any method of preventing overfitting of data by a model. From K&M ’03, F. Xia 55
56
Regularization In statistics and machine learning, regularization is any method of preventing overfitting of data by a model. Typical examples of regularization in statistical machine learning include ridge regression, lasso, and L2-normin support vector machines. From K&M ’03, F. Xia 56
57
Regularization In statistics and machine learning, regularization is any method of preventing overfitting of data by a model. Typical examples of regularization in statistical machine learning include ridge regression, lasso, and L2-normin support vector machines. In this case, we change the objective function: log P(Y,λ|X) = log P(λ)+log P(Y|X,λ) From K&M ’03, F. Xia 57
58
Prior Possible prior distributions: uniform, exponential 58
59
Prior Possible prior distributions: uniform, exponential Gaussian prior: 59
60
Prior Possible prior distributions: uniform, exponential Gaussian prior: log P(Y,λ|X) = log P(λ)+log P(Y|X,λ) 60
61
Maximize P(Y|X,λ) Maximize P(Y, λ|X) In practice, μ=0; 2σ 2 =1 61
62
L1 and L2 Regularization 62
63
Smoothing: POS Example 63
64
Advantages of Smoothing Smooths distributions 64
65
Advantages of Smoothing Smooths distributions Moves weight onto more informative features 65
66
Advantages of Smoothing Smooths distributions Moves weight onto more informative features Enables effective use of larger numbers of features 66
67
Advantages of Smoothing Smooths distributions Moves weight onto more informative features Enables effective use of larger numbers of features Can speed up convergence 67
68
Summary: Training Many training methods: Generalized Iterative Scaling (GIS) Smoothing: Early stopping, feature selection, regularization Regularization: Change objective function – add prior Common prior: Gaussian prior Maximizing posterior not equivalent to max ent 68
69
MaxEnt POS Tagging 69
70
Notation (Ratnaparkhi, 1996) h: history x Word and tag history t: tag y 70
71
POS Tagging Model P(t 1,…,t n |w 1,…,w n ) where h i ={w i,w i-1,w i-2,w i+1,w i+2,t i-1,t i-2 } 71
72
MaxEnt Feature Set 72
73
Example Feature for ‘about’ Exclude features seen < 10 times 73
74
Training GIS Training time: O(NTA) N: training set size T: number of tags A: average number of features active for event (h,t) 24 hours on a ‘96 machine 74
75
Finding Features In training, where do features come from? Where do features come from in testing? w -1 w0w0 w -1 w 0 w +1 t -1 y x1(Time ) Time fliesBOSN x2 (flies) TimefliesTime flieslikeNN x3 (like)flieslikeflies likeanNV 75
76
Finding Features In training, where do features come from? Where do features come from in testing? tag features come from classification of prior word w -1 w0w0 w -1 w 0 w +1 t -1 y x1(Time ) Time fliesBOSN x2 (flies) TimefliesTime flieslikeNN x3 (like)flieslikeflies likeanNV 76
77
Decoding Goal: Identify highest probability tag sequence 77
78
Decoding Goal: Identify highest probability tag sequence Issues: Features include tags from previous words Not immediately available 78
79
Decoding Goal: Identify highest probability tag sequence Issues: Features include tags from previous words Not immediately available Uses tag history Just knowing highest probability preceding tag insufficient 79
80
Beam Search Intuition: Breadth-first search explores all paths Lots of paths are (pretty obviously) bad Why explore bad paths? Restrict to (apparently best) paths Approach: Perform breadth-first search, but 80
81
Beam Search Intuition: Breadth-first search explores all paths Lots of paths are (pretty obviously) bad Why explore bad paths? Restrict to (apparently best) paths Approach: Perform breadth-first search, but Retain only k ‘best’ paths thus far k: beam width 81
82
Beam Search, k=3 time flies like an arrow 82
83
Beam Search, k=3 time flies like an arrow 83
84
Beam Search, k=3 time flies like an arrow 84
85
Beam Search, k=3 time flies like an arrow 85
86
Beam Search, k=3 time flies like an arrow 86
87
Beam Search W={w 1,w 2,…,w n }: test sentence 87
88
Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i 88
89
Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly 89
90
Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: 90
91
Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: Extension: add tags for w i to each s (i-1)j 91
92
Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: Extension: add tags for w i to each s (i-1)j Beam selection: Sort sequences by probability Keep only top k sequences 92
93
Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep topN, set s 1j accordingly for i=2 to n: For each s (i-1)j for w i form vector, keep topN tags for w i Beam selection: Sort sequences by probability Keep only top sequences, using pruning on next slide Return highest probability sequence s n1 93
94
Beam Search Pruning and storage: W = beam width For each node, store: Tag for w i Probability of sequence so far, prob i,j = For each candidate j, s i,j Keep the node if prob i,j in topK, and prob i,j is sufficiently high e.g. lg(prob i,j )+W>=lg(max_prob) 94
95
Decoding Tag dictionary: known word: returns tags seen with word in training unknown word: returns all tags Beam width = 5 Running time: O(NTAB) N,T,A as before B: beam width 95
96
POS Tagging Overall accuracy: 96.3+% Unseen word accuracy: 86.2% Comparable to HMM tagging accuracy or TBL Provides Probabilistic framework Better able to model different info sources Topline accuracy 96-97% Consistency issues 96
97
Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: 97
98
Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% 98
99
Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming 99
100
Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming Running time: 100
101
Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming Disadvantage: Not guaranteed optimal (or complete) 101
102
MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words 102
103
MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words Sequence classification problems: Tag features depend on prior classification 103
104
MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words Sequence classification problems: Tag features depend on prior classification Beam search decoding Efficient, but inexact Near optimal in practice 104
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.