MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1

Roadmap Maxent: Training Smoothing Case study: POS Tagging (redux) Beam search 2

Training 3

Learn λs from training data 4

Training Learn λs from training data Challenge: Usually can’t solve analytically Employ numerical methods 5

Training Learn λs from training data Challenge: Usually can’t solve analytically Employ numerical methods Main different techniques: Generalized Iterative Scaling (GIS, Darroch &Ratcliffe, ‘72) Improved Iterative Scaling (IIS, Della Pietra et al, ‘95) L-BFGS,….. 6

Generalized Iterative Scaling GIS Setup: GIS required constraint:, where C is a constant 7

Generalized Iterative Scaling GIS Setup: GIS required constraint:, where C is a constant If not, then set : 8

Generalized Iterative Scaling GIS Setup: GIS required constraint:, where C is a constant If not, then set 9

Generalized Iterative Scaling GIS Setup: GIS required constraint:, where C is a constant If not, then set and add a correction feature function f k+1: 10

Generalized Iterative Scaling GIS Setup: GIS required constraint:, where C is a constant If not, then set and add a correction feature function f k+1: GIS also requires at least one active feature for any event Default feature functions solve this problem 11

GIS Iteration Compute the empirical expectation 12

GIS Iteration Compute the empirical expectation Initialization:λ j (0) ; set to 0 or some value 13

GIS Iteration Compute the empirical expectation Initialization:λ j (0) ; set to 0 or some value Iterate until convergence for each j: 14

GIS Iteration Compute the empirical expectation Initialization:λ j (0) ; set to 0 or some value Iterate until convergence for each j: Compute p(y|x) under the current model 15

GIS Iteration Compute the empirical expectation Initialization:λ j (0) ; set to 0 or some value Iterate until convergence for each j: Compute p(y|x) under the current model Compute model expectation under current model 16

GIS Iteration Compute the empirical expectation Initialization:λ j (0) ; set to 0 or some value Iterate until convergence for each j: Compute p(y|x) under the current model Compute model expectation under current model Update model parameters by weighted ratio of empirical and model expectations 17

GIS Iteration Compute 18

GIS Iteration Compute Initialization:λ j (0) ; set to 0 or some value 19

GIS Iteration Compute Initialization:λ j (0) ; set to 0 or some value Iterate until convergence: Compute 20

GIS Iteration Compute Initialization:λ j (0) ; set to 0 or some value Iterate until convergence: Compute p (n) (y|x)= 21

GIS Iteration Compute Initialization:λ j (0) ; set to 0 or some value Iterate until convergence: Compute p (n) (y|x)= Compute 22

GIS Iteration Compute Initialization:λ j (0) ; set to 0 or some value Iterate until convergence: Compute p (n) (y|x)= Compute Update 23

Convergence Methods have convergence guarantees 24

Convergence Methods have convergence guarantees However, full convergence may take very long time 25

Convergence Methods have convergence guarantees However, full convergence may take very long time Frequently use threshold 26

Convergence Methods have convergence guarantees However, full convergence may take very long time Frequently use threshold 27

Calculating LL(p) LL = 0 For each sample x in the training data Let y be the true label of x prob = p(y|x) LL += 1/N * prob 28

Running Time For each iteration the running time is: 29

Running Time For each iteration the running time is O(NPA), where: N: number of training instances P: number of classes A: Average number of active features for instance (x,y) 30

L-BFGS Limited-memory version of Broyden–Fletcher–Goldfarb–Shanno (BFGS) method Broyden–Fletcher–Goldfarb–Shanno (BFGS) 31

L-BFGS Limited-memory version of Broyden–Fletcher–Goldfarb–Shanno (BFGS) method Broyden–Fletcher–Goldfarb–Shanno (BFGS) Quasi-Newton method for unconstrained optimization 32

L-BFGS Limited-memory version of Broyden–Fletcher–Goldfarb–Shanno (BFGS) method Broyden–Fletcher–Goldfarb–Shanno (BFGS) Quasi-Newton method for unconstrained optimization Good for optimization problems with many variables 33

L-BFGS Limited-memory version of Broyden–Fletcher–Goldfarb–Shanno (BFGS) method Broyden–Fletcher–Goldfarb–Shanno (BFGS) Quasi-Newton method for unconstrained optimization Good for optimization problems with many variables “Algorithm of choice” for MaxEnt and related models 34

L-BFGS References: Nocedal, J. (1980). "Updating Quasi-Newton Matrices with Limited Storage". Mathematics of Computation 35: 773–782 Liu, D. C.; Nocedal, J. (1989)"On the Limited Memory Method for Large Scale Optimization". Mathematical Programming B 45 (3): 503–528"On the Limited Memory Method for Large Scale Optimization" 35

L-BFGS References: Nocedal, J. (1980). "Updating Quasi-Newton Matrices with Limited Storage". Mathematics of Computation 35: 773–782 Liu, D. C.; Nocedal, J. (1989)"On the Limited Memory Method for Large Scale Optimization". Mathematical Programming B 45 (3): 503–528"On the Limited Memory Method for Large Scale Optimization" Implementations: Java, Matlab, Python via scipy, R, etc See Wikipedia page 36

Smoothing Based on Klein & Manning, 2003; F. Xia 37

Smoothing Problems of scale: 38

Smoothing Problems of scale: Large numbers of features Some NLP problems in MaxEnt  1M features Storage can be a problem 39

Smoothing Problems of scale: Large numbers of features Some NLP problems in MaxEnt  1M features Storage can be a problem Sparseness problems Ease of overfitting 40

Smoothing Problems of scale: Large numbers of features Some NLP problems in MaxEnt  1M features Storage can be a problem Sparseness problems Ease of overfitting Optimization problems Features can be near infinite, take long time to converge 41

Smoothing Consider the coin flipping problem Three empirical distributions Models From K&M ‘03 42

Need for Smoothing Two problems From K&M ‘03 43

Need for Smoothing Two problems Optimization: Optimal value of λ? ∞ Slow to optimize From K&M ‘03 44

Need for Smoothing Two problems Optimization: Optimal value of λ? ∞ Slow to optimize No smoothing Learned distribution just as spiky (K&M’03) From K&M ‘03 45

Possible Solutions 46

Possible Solutions Early stopping Feature selection Regularization 47

Early Stopping Prior use of early stopping 48

Early Stopping Prior use of early stopping Decision tree heuristics 49

Early Stopping Prior use of early stopping Decision tree heuristics Similarly here Stop training after a few iterations λwill have increased Guarantees bounded, finite training time 50

Feature Selection Approaches: 51

Feature Selection Approaches: Heuristic: Drop features based on fixed thresholds i.e. number of occurrences 52

Feature Selection Approaches: Heuristic: Drop features based on fixed thresholds i.e. number of occurrences Wrapper methods: Add feature selection to training loop 53

Feature Selection Approaches: Heuristic: Drop features based on fixed thresholds i.e. number of occurrences Wrapper methods: Add feature selection to training loop Heuristic approaches: Simple, reduce features, but could harm performance 54

Regularization In statistics and machine learning, regularization is any method of preventing overfitting of data by a model. From K&M ’03, F. Xia 55

Regularization In statistics and machine learning, regularization is any method of preventing overfitting of data by a model. Typical examples of regularization in statistical machine learning include ridge regression, lasso, and L2-normin support vector machines. From K&M ’03, F. Xia 56

Regularization In statistics and machine learning, regularization is any method of preventing overfitting of data by a model. Typical examples of regularization in statistical machine learning include ridge regression, lasso, and L2-normin support vector machines. In this case, we change the objective function: log P(Y,λ|X) = log P(λ)+log P(Y|X,λ) From K&M ’03, F. Xia 57

Prior Possible prior distributions: uniform, exponential 58

Prior Possible prior distributions: uniform, exponential Gaussian prior: 59

Prior Possible prior distributions: uniform, exponential Gaussian prior: log P(Y,λ|X) = log P(λ)+log P(Y|X,λ) 60

Maximize P(Y|X,λ) Maximize P(Y, λ|X) In practice, μ=0; 2σ 2 =1 61

L1 and L2 Regularization 62

Smoothing: POS Example 63

Advantages of Smoothing Smooths distributions 64

Advantages of Smoothing Smooths distributions Moves weight onto more informative features 65

Advantages of Smoothing Smooths distributions Moves weight onto more informative features Enables effective use of larger numbers of features 66

Advantages of Smoothing Smooths distributions Moves weight onto more informative features Enables effective use of larger numbers of features Can speed up convergence 67

Summary: Training Many training methods: Generalized Iterative Scaling (GIS) Smoothing: Early stopping, feature selection, regularization Regularization: Change objective function – add prior Common prior: Gaussian prior Maximizing posterior not equivalent to max ent 68

MaxEnt POS Tagging 69

Notation (Ratnaparkhi, 1996) h: history  x Word and tag history t: tag  y 70

POS Tagging Model P(t 1,…,t n |w 1,…,w n ) where h i ={w i,w i-1,w i-2,w i+1,w i+2,t i-1,t i-2 } 71

MaxEnt Feature Set 72

Example Feature for ‘about’ Exclude features seen < 10 times 73

Training GIS Training time: O(NTA) N: training set size T: number of tags A: average number of features active for event (h,t) 24 hours on a ‘96 machine 74

Finding Features In training, where do features come from? Where do features come from in testing? w -1 w0w0 w -1 w 0 w +1 t -1 y x1(Time ) Time fliesBOSN x2 (flies) TimefliesTime flieslikeNN x3 (like)flieslikeflies likeanNV 75

Finding Features In training, where do features come from? Where do features come from in testing? tag features come from classification of prior word w -1 w0w0 w -1 w 0 w +1 t -1 y x1(Time ) Time fliesBOSN x2 (flies) TimefliesTime flieslikeNN x3 (like)flieslikeflies likeanNV 76

Decoding Goal: Identify highest probability tag sequence 77

Decoding Goal: Identify highest probability tag sequence Issues: Features include tags from previous words Not immediately available 78

Decoding Goal: Identify highest probability tag sequence Issues: Features include tags from previous words Not immediately available Uses tag history Just knowing highest probability preceding tag insufficient 79

Beam Search Intuition: Breadth-first search explores all paths Lots of paths are (pretty obviously) bad Why explore bad paths? Restrict to (apparently best) paths Approach: Perform breadth-first search, but 80

Beam Search Intuition: Breadth-first search explores all paths Lots of paths are (pretty obviously) bad Why explore bad paths? Restrict to (apparently best) paths Approach: Perform breadth-first search, but Retain only k ‘best’ paths thus far k: beam width 81

Beam Search, k=3 time flies like an arrow 82

Beam Search W={w 1,w 2,…,w n }: test sentence 87

Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i 88

Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly 89

Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: 90

Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: Extension: add tags for w i to each s (i-1)j 91

Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: Extension: add tags for w i to each s (i-1)j Beam selection: Sort sequences by probability Keep only top k sequences 92

Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep topN, set s 1j accordingly for i=2 to n: For each s (i-1)j for w i form vector, keep topN tags for w i Beam selection: Sort sequences by probability Keep only top sequences, using pruning on next slide Return highest probability sequence s n1 93

Beam Search Pruning and storage: W = beam width For each node, store: Tag for w i Probability of sequence so far, prob i,j = For each candidate j, s i,j Keep the node if prob i,j in topK, and prob i,j is sufficiently high e.g. lg(prob i,j )+W>=lg(max_prob) 94

Decoding Tag dictionary: known word: returns tags seen with word in training unknown word: returns all tags Beam width = 5 Running time: O(NTAB) N,T,A as before B: beam width 95

POS Tagging Overall accuracy: 96.3+% Unseen word accuracy: 86.2% Comparable to HMM tagging accuracy or TBL Provides Probabilistic framework Better able to model different info sources Topline accuracy 96-97% Consistency issues 96

Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: 97

Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% 98

Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming 99

Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming Running time: 100

Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming Disadvantage: Not guaranteed optimal (or complete) 101

MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words 102

MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words Sequence classification problems: Tag features depend on prior classification 103

MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words Sequence classification problems: Tag features depend on prior classification Beam search decoding Efficient, but inexact Near optimal in practice 104

MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

Similar presentations

Presentation on theme: "MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

Similar presentations

Presentation on theme: "MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1."— Presentation transcript:

Similar presentations

About project

Feedback