MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7,
Roadmap Maxent: Training Smoothing Case study: POS Tagging (redux) Beam search 2
Training 3
Learn λs from training data 4
Training Learn λs from training data Challenge: Usually can’t solve analytically Employ numerical methods 5
Training Learn λs from training data Challenge: Usually can’t solve analytically Employ numerical methods Main different techniques: Generalized Iterative Scaling (GIS, Darroch &Ratcliffe, ‘72) Improved Iterative Scaling (IIS, Della Pietra et al, ‘95) L-BFGS,….. 6
Generalized Iterative Scaling GIS Setup: GIS required constraint:, where C is a constant 7
Generalized Iterative Scaling GIS Setup: GIS required constraint:, where C is a constant If not, then set : 8
Generalized Iterative Scaling GIS Setup: GIS required constraint:, where C is a constant If not, then set 9
Generalized Iterative Scaling GIS Setup: GIS required constraint:, where C is a constant If not, then set and add a correction feature function f k+1: 10
Generalized Iterative Scaling GIS Setup: GIS required constraint:, where C is a constant If not, then set and add a correction feature function f k+1: GIS also requires at least one active feature for any event Default feature functions solve this problem 11
GIS Iteration Compute the empirical expectation 12
GIS Iteration Compute the empirical expectation Initialization:λ j (0) ; set to 0 or some value 13
GIS Iteration Compute the empirical expectation Initialization:λ j (0) ; set to 0 or some value Iterate until convergence for each j: 14
GIS Iteration Compute the empirical expectation Initialization:λ j (0) ; set to 0 or some value Iterate until convergence for each j: Compute p(y|x) under the current model 15
GIS Iteration Compute the empirical expectation Initialization:λ j (0) ; set to 0 or some value Iterate until convergence for each j: Compute p(y|x) under the current model Compute model expectation under current model 16
GIS Iteration Compute the empirical expectation Initialization:λ j (0) ; set to 0 or some value Iterate until convergence for each j: Compute p(y|x) under the current model Compute model expectation under current model Update model parameters by weighted ratio of empirical and model expectations 17
GIS Iteration Compute 18
GIS Iteration Compute Initialization:λ j (0) ; set to 0 or some value 19
GIS Iteration Compute Initialization:λ j (0) ; set to 0 or some value Iterate until convergence: Compute 20
GIS Iteration Compute Initialization:λ j (0) ; set to 0 or some value Iterate until convergence: Compute p (n) (y|x)= 21
GIS Iteration Compute Initialization:λ j (0) ; set to 0 or some value Iterate until convergence: Compute p (n) (y|x)= Compute 22
GIS Iteration Compute Initialization:λ j (0) ; set to 0 or some value Iterate until convergence: Compute p (n) (y|x)= Compute Update 23
Convergence Methods have convergence guarantees 24
Convergence Methods have convergence guarantees However, full convergence may take very long time 25
Convergence Methods have convergence guarantees However, full convergence may take very long time Frequently use threshold 26
Convergence Methods have convergence guarantees However, full convergence may take very long time Frequently use threshold 27
Calculating LL(p) LL = 0 For each sample x in the training data Let y be the true label of x prob = p(y|x) LL += 1/N * prob 28
Running Time For each iteration the running time is: 29
Running Time For each iteration the running time is O(NPA), where: N: number of training instances P: number of classes A: Average number of active features for instance (x,y) 30
L-BFGS Limited-memory version of Broyden–Fletcher–Goldfarb–Shanno (BFGS) method Broyden–Fletcher–Goldfarb–Shanno (BFGS) 31
L-BFGS Limited-memory version of Broyden–Fletcher–Goldfarb–Shanno (BFGS) method Broyden–Fletcher–Goldfarb–Shanno (BFGS) Quasi-Newton method for unconstrained optimization 32
L-BFGS Limited-memory version of Broyden–Fletcher–Goldfarb–Shanno (BFGS) method Broyden–Fletcher–Goldfarb–Shanno (BFGS) Quasi-Newton method for unconstrained optimization Good for optimization problems with many variables 33
L-BFGS Limited-memory version of Broyden–Fletcher–Goldfarb–Shanno (BFGS) method Broyden–Fletcher–Goldfarb–Shanno (BFGS) Quasi-Newton method for unconstrained optimization Good for optimization problems with many variables “Algorithm of choice” for MaxEnt and related models 34
L-BFGS References: Nocedal, J. (1980). "Updating Quasi-Newton Matrices with Limited Storage". Mathematics of Computation 35: 773–782 Liu, D. C.; Nocedal, J. (1989)"On the Limited Memory Method for Large Scale Optimization". Mathematical Programming B 45 (3): 503–528"On the Limited Memory Method for Large Scale Optimization" 35
L-BFGS References: Nocedal, J. (1980). "Updating Quasi-Newton Matrices with Limited Storage". Mathematics of Computation 35: 773–782 Liu, D. C.; Nocedal, J. (1989)"On the Limited Memory Method for Large Scale Optimization". Mathematical Programming B 45 (3): 503–528"On the Limited Memory Method for Large Scale Optimization" Implementations: Java, Matlab, Python via scipy, R, etc See Wikipedia page 36
Smoothing Based on Klein & Manning, 2003; F. Xia 37
Smoothing Problems of scale: 38
Smoothing Problems of scale: Large numbers of features Some NLP problems in MaxEnt 1M features Storage can be a problem 39
Smoothing Problems of scale: Large numbers of features Some NLP problems in MaxEnt 1M features Storage can be a problem Sparseness problems Ease of overfitting 40
Smoothing Problems of scale: Large numbers of features Some NLP problems in MaxEnt 1M features Storage can be a problem Sparseness problems Ease of overfitting Optimization problems Features can be near infinite, take long time to converge 41
Smoothing Consider the coin flipping problem Three empirical distributions Models From K&M ‘03 42
Need for Smoothing Two problems From K&M ‘03 43
Need for Smoothing Two problems Optimization: Optimal value of λ? ∞ Slow to optimize From K&M ‘03 44
Need for Smoothing Two problems Optimization: Optimal value of λ? ∞ Slow to optimize No smoothing Learned distribution just as spiky (K&M’03) From K&M ‘03 45
Possible Solutions 46
Possible Solutions Early stopping Feature selection Regularization 47
Early Stopping Prior use of early stopping 48
Early Stopping Prior use of early stopping Decision tree heuristics 49
Early Stopping Prior use of early stopping Decision tree heuristics Similarly here Stop training after a few iterations λwill have increased Guarantees bounded, finite training time 50
Feature Selection Approaches: 51
Feature Selection Approaches: Heuristic: Drop features based on fixed thresholds i.e. number of occurrences 52
Feature Selection Approaches: Heuristic: Drop features based on fixed thresholds i.e. number of occurrences Wrapper methods: Add feature selection to training loop 53
Feature Selection Approaches: Heuristic: Drop features based on fixed thresholds i.e. number of occurrences Wrapper methods: Add feature selection to training loop Heuristic approaches: Simple, reduce features, but could harm performance 54
Regularization In statistics and machine learning, regularization is any method of preventing overfitting of data by a model. From K&M ’03, F. Xia 55
Regularization In statistics and machine learning, regularization is any method of preventing overfitting of data by a model. Typical examples of regularization in statistical machine learning include ridge regression, lasso, and L2-normin support vector machines. From K&M ’03, F. Xia 56
Regularization In statistics and machine learning, regularization is any method of preventing overfitting of data by a model. Typical examples of regularization in statistical machine learning include ridge regression, lasso, and L2-normin support vector machines. In this case, we change the objective function: log P(Y,λ|X) = log P(λ)+log P(Y|X,λ) From K&M ’03, F. Xia 57
Prior Possible prior distributions: uniform, exponential 58
Prior Possible prior distributions: uniform, exponential Gaussian prior: 59
Prior Possible prior distributions: uniform, exponential Gaussian prior: log P(Y,λ|X) = log P(λ)+log P(Y|X,λ) 60
Maximize P(Y|X,λ) Maximize P(Y, λ|X) In practice, μ=0; 2σ 2 =1 61
L1 and L2 Regularization 62
Smoothing: POS Example 63
Advantages of Smoothing Smooths distributions 64
Advantages of Smoothing Smooths distributions Moves weight onto more informative features 65
Advantages of Smoothing Smooths distributions Moves weight onto more informative features Enables effective use of larger numbers of features 66
Advantages of Smoothing Smooths distributions Moves weight onto more informative features Enables effective use of larger numbers of features Can speed up convergence 67
Summary: Training Many training methods: Generalized Iterative Scaling (GIS) Smoothing: Early stopping, feature selection, regularization Regularization: Change objective function – add prior Common prior: Gaussian prior Maximizing posterior not equivalent to max ent 68
MaxEnt POS Tagging 69
Notation (Ratnaparkhi, 1996) h: history x Word and tag history t: tag y 70
POS Tagging Model P(t 1,…,t n |w 1,…,w n ) where h i ={w i,w i-1,w i-2,w i+1,w i+2,t i-1,t i-2 } 71
MaxEnt Feature Set 72
Example Feature for ‘about’ Exclude features seen < 10 times 73
Training GIS Training time: O(NTA) N: training set size T: number of tags A: average number of features active for event (h,t) 24 hours on a ‘96 machine 74
Finding Features In training, where do features come from? Where do features come from in testing? w -1 w0w0 w -1 w 0 w +1 t -1 y x1(Time ) Time fliesBOSN x2 (flies) TimefliesTime flieslikeNN x3 (like)flieslikeflies likeanNV 75
Finding Features In training, where do features come from? Where do features come from in testing? tag features come from classification of prior word w -1 w0w0 w -1 w 0 w +1 t -1 y x1(Time ) Time fliesBOSN x2 (flies) TimefliesTime flieslikeNN x3 (like)flieslikeflies likeanNV 76
Decoding Goal: Identify highest probability tag sequence 77
Decoding Goal: Identify highest probability tag sequence Issues: Features include tags from previous words Not immediately available 78
Decoding Goal: Identify highest probability tag sequence Issues: Features include tags from previous words Not immediately available Uses tag history Just knowing highest probability preceding tag insufficient 79
Beam Search Intuition: Breadth-first search explores all paths Lots of paths are (pretty obviously) bad Why explore bad paths? Restrict to (apparently best) paths Approach: Perform breadth-first search, but 80
Beam Search Intuition: Breadth-first search explores all paths Lots of paths are (pretty obviously) bad Why explore bad paths? Restrict to (apparently best) paths Approach: Perform breadth-first search, but Retain only k ‘best’ paths thus far k: beam width 81
Beam Search, k=3 time flies like an arrow 82
Beam Search, k=3 time flies like an arrow 83
Beam Search, k=3 time flies like an arrow 84
Beam Search, k=3 time flies like an arrow 85
Beam Search, k=3 time flies like an arrow 86
Beam Search W={w 1,w 2,…,w n }: test sentence 87
Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i 88
Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly 89
Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: 90
Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: Extension: add tags for w i to each s (i-1)j 91
Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: Extension: add tags for w i to each s (i-1)j Beam selection: Sort sequences by probability Keep only top k sequences 92
Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep topN, set s 1j accordingly for i=2 to n: For each s (i-1)j for w i form vector, keep topN tags for w i Beam selection: Sort sequences by probability Keep only top sequences, using pruning on next slide Return highest probability sequence s n1 93
Beam Search Pruning and storage: W = beam width For each node, store: Tag for w i Probability of sequence so far, prob i,j = For each candidate j, s i,j Keep the node if prob i,j in topK, and prob i,j is sufficiently high e.g. lg(prob i,j )+W>=lg(max_prob) 94
Decoding Tag dictionary: known word: returns tags seen with word in training unknown word: returns all tags Beam width = 5 Running time: O(NTAB) N,T,A as before B: beam width 95
POS Tagging Overall accuracy: 96.3+% Unseen word accuracy: 86.2% Comparable to HMM tagging accuracy or TBL Provides Probabilistic framework Better able to model different info sources Topline accuracy 96-97% Consistency issues 96
Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: 97
Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% 98
Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming 99
Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming Running time: 100
Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming Disadvantage: Not guaranteed optimal (or complete) 101
MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words 102
MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words Sequence classification problems: Tag features depend on prior classification 103
MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words Sequence classification problems: Tag features depend on prior classification Beam search decoding Efficient, but inexact Near optimal in practice 104