MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

Slides:

Advertisements

Similar presentations

Information Extraction Lecture 7 – Linear Models (Basic Machine Learning) CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.

Advertisements

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.

RIPPER Fast Effective Rule Induction

Angelo Dalli Department of Intelligent Computing Systems

SVM—Support Vector Machines

Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.

Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Chapter 4: Linear Models for Classification

Lecture 4: Embedded methods

Infinite Horizon Problems

Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi.

Maximum Entropy Model (I) LING 572 Fei Xia Week 5: 02/05-02/07/08 1.

MaxEnt POS Tagging Shallow Processing Techniques for NLP Ling570 November 21, 2011.

Final review LING572 Fei Xia Week 10: 03/13/08 1.

Bayesian network inference

Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.

Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

Today Linear Regression Logistic Regression Bayesians v. Frequentists

Conditional Random Fields

Expectation Maximization Algorithm

Sequence labeling and beam search LING 572 Fei Xia 2/15/07.

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.

Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.

Classification and Prediction: Regression Analysis

Crash Course on Machine Learning

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Final review LING572 Fei Xia Week 10: 03/11/

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

1 Persian Part Of Speech Tagging Mostafa Keikha Database Research Group (DBRG) ECE Department, University of Tehran.

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.

Graphical models for part of speech tagging

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

Albert Gatt Corpora and Statistical Methods Lecture 10.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

Chapter 12 Recursion, Complexity, and Searching and Sorting

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

Lecture 3: Uninformed Search

Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.

Ling 570 Day 16: Sequence modeling Named Entity Recognition.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.

Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.

Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

Conditional Markov Models: MaxEnt Tagging and MEMMs

Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

CS Ensembles and Bayes1 Ensembles, Model Combination and Bayesian Combination.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Lecture 3: Uninformed Search

CSC 594 Topics in AI – Natural Language Processing

ELN – Natural Language Processing

Learning From Observed Data

Programming with data Lecture 3

Sequential Learning with Dependency Nets

Reinforcement Learning (2)

Reinforcement Learning (2)

Presentation transcript:

MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7,

Roadmap Maxent: Training Smoothing Case study: POS Tagging (redux) Beam search 2

Training 3

Learn λs from training data 4

Training Learn λs from training data Challenge: Usually can’t solve analytically Employ numerical methods 5

Training Learn λs from training data Challenge: Usually can’t solve analytically Employ numerical methods Main different techniques: Generalized Iterative Scaling (GIS, Darroch &Ratcliffe, ‘72) Improved Iterative Scaling (IIS, Della Pietra et al, ‘95) L-BFGS,….. 6

Generalized Iterative Scaling GIS Setup: GIS required constraint:, where C is a constant 7

Generalized Iterative Scaling GIS Setup: GIS required constraint:, where C is a constant If not, then set : 8

Generalized Iterative Scaling GIS Setup: GIS required constraint:, where C is a constant If not, then set 9

Generalized Iterative Scaling GIS Setup: GIS required constraint:, where C is a constant If not, then set and add a correction feature function f k+1: 10

Generalized Iterative Scaling GIS Setup: GIS required constraint:, where C is a constant If not, then set and add a correction feature function f k+1: GIS also requires at least one active feature for any event Default feature functions solve this problem 11

GIS Iteration Compute the empirical expectation 12

GIS Iteration Compute the empirical expectation Initialization:λ j (0) ; set to 0 or some value 13

GIS Iteration Compute the empirical expectation Initialization:λ j (0) ; set to 0 or some value Iterate until convergence for each j: 14

GIS Iteration Compute the empirical expectation Initialization:λ j (0) ; set to 0 or some value Iterate until convergence for each j: Compute p(y|x) under the current model 15

GIS Iteration Compute the empirical expectation Initialization:λ j (0) ; set to 0 or some value Iterate until convergence for each j: Compute p(y|x) under the current model Compute model expectation under current model 16

GIS Iteration Compute the empirical expectation Initialization:λ j (0) ; set to 0 or some value Iterate until convergence for each j: Compute p(y|x) under the current model Compute model expectation under current model Update model parameters by weighted ratio of empirical and model expectations 17

GIS Iteration Compute 18

GIS Iteration Compute Initialization:λ j (0) ; set to 0 or some value 19

GIS Iteration Compute Initialization:λ j (0) ; set to 0 or some value Iterate until convergence: Compute 20

GIS Iteration Compute Initialization:λ j (0) ; set to 0 or some value Iterate until convergence: Compute p (n) (y|x)= 21

GIS Iteration Compute Initialization:λ j (0) ; set to 0 or some value Iterate until convergence: Compute p (n) (y|x)= Compute 22

GIS Iteration Compute Initialization:λ j (0) ; set to 0 or some value Iterate until convergence: Compute p (n) (y|x)= Compute Update 23

Convergence Methods have convergence guarantees 24

Convergence Methods have convergence guarantees However, full convergence may take very long time 25

Convergence Methods have convergence guarantees However, full convergence may take very long time Frequently use threshold 26

Convergence Methods have convergence guarantees However, full convergence may take very long time Frequently use threshold 27

Calculating LL(p) LL = 0 For each sample x in the training data Let y be the true label of x prob = p(y|x) LL += 1/N * prob 28

Running Time For each iteration the running time is: 29

Running Time For each iteration the running time is O(NPA), where: N: number of training instances P: number of classes A: Average number of active features for instance (x,y) 30

L-BFGS Limited-memory version of Broyden–Fletcher–Goldfarb–Shanno (BFGS) method Broyden–Fletcher–Goldfarb–Shanno (BFGS) 31

L-BFGS Limited-memory version of Broyden–Fletcher–Goldfarb–Shanno (BFGS) method Broyden–Fletcher–Goldfarb–Shanno (BFGS) Quasi-Newton method for unconstrained optimization 32

L-BFGS Limited-memory version of Broyden–Fletcher–Goldfarb–Shanno (BFGS) method Broyden–Fletcher–Goldfarb–Shanno (BFGS) Quasi-Newton method for unconstrained optimization Good for optimization problems with many variables 33

L-BFGS Limited-memory version of Broyden–Fletcher–Goldfarb–Shanno (BFGS) method Broyden–Fletcher–Goldfarb–Shanno (BFGS) Quasi-Newton method for unconstrained optimization Good for optimization problems with many variables “Algorithm of choice” for MaxEnt and related models 34

L-BFGS References: Nocedal, J. (1980). "Updating Quasi-Newton Matrices with Limited Storage". Mathematics of Computation 35: 773–782 Liu, D. C.; Nocedal, J. (1989)"On the Limited Memory Method for Large Scale Optimization". Mathematical Programming B 45 (3): 503–528"On the Limited Memory Method for Large Scale Optimization" 35

L-BFGS References: Nocedal, J. (1980). "Updating Quasi-Newton Matrices with Limited Storage". Mathematics of Computation 35: 773–782 Liu, D. C.; Nocedal, J. (1989)"On the Limited Memory Method for Large Scale Optimization". Mathematical Programming B 45 (3): 503–528"On the Limited Memory Method for Large Scale Optimization" Implementations: Java, Matlab, Python via scipy, R, etc See Wikipedia page 36

Smoothing Based on Klein & Manning, 2003; F. Xia 37

Smoothing Problems of scale: 38

Smoothing Problems of scale: Large numbers of features Some NLP problems in MaxEnt  1M features Storage can be a problem 39

Smoothing Problems of scale: Large numbers of features Some NLP problems in MaxEnt  1M features Storage can be a problem Sparseness problems Ease of overfitting 40

Smoothing Problems of scale: Large numbers of features Some NLP problems in MaxEnt  1M features Storage can be a problem Sparseness problems Ease of overfitting Optimization problems Features can be near infinite, take long time to converge 41

Smoothing Consider the coin flipping problem Three empirical distributions Models From K&M ‘03 42

Need for Smoothing Two problems From K&M ‘03 43

Need for Smoothing Two problems Optimization: Optimal value of λ? ∞ Slow to optimize From K&M ‘03 44

Need for Smoothing Two problems Optimization: Optimal value of λ? ∞ Slow to optimize No smoothing Learned distribution just as spiky (K&M’03) From K&M ‘03 45

Possible Solutions 46

Possible Solutions Early stopping Feature selection Regularization 47

Early Stopping Prior use of early stopping 48

Early Stopping Prior use of early stopping Decision tree heuristics 49

Early Stopping Prior use of early stopping Decision tree heuristics Similarly here Stop training after a few iterations λwill have increased Guarantees bounded, finite training time 50

Feature Selection Approaches: 51

Feature Selection Approaches: Heuristic: Drop features based on fixed thresholds i.e. number of occurrences 52

Feature Selection Approaches: Heuristic: Drop features based on fixed thresholds i.e. number of occurrences Wrapper methods: Add feature selection to training loop 53

Feature Selection Approaches: Heuristic: Drop features based on fixed thresholds i.e. number of occurrences Wrapper methods: Add feature selection to training loop Heuristic approaches: Simple, reduce features, but could harm performance 54

Regularization In statistics and machine learning, regularization is any method of preventing overfitting of data by a model. From K&M ’03, F. Xia 55

Regularization In statistics and machine learning, regularization is any method of preventing overfitting of data by a model. Typical examples of regularization in statistical machine learning include ridge regression, lasso, and L2-normin support vector machines. From K&M ’03, F. Xia 56

Regularization In statistics and machine learning, regularization is any method of preventing overfitting of data by a model. Typical examples of regularization in statistical machine learning include ridge regression, lasso, and L2-normin support vector machines. In this case, we change the objective function: log P(Y,λ|X) = log P(λ)+log P(Y|X,λ) From K&M ’03, F. Xia 57

Prior Possible prior distributions: uniform, exponential 58

Prior Possible prior distributions: uniform, exponential Gaussian prior: 59

Prior Possible prior distributions: uniform, exponential Gaussian prior: log P(Y,λ|X) = log P(λ)+log P(Y|X,λ) 60

Maximize P(Y|X,λ) Maximize P(Y, λ|X) In practice, μ=0; 2σ 2 =1 61

L1 and L2 Regularization 62

Smoothing: POS Example 63

Advantages of Smoothing Smooths distributions 64

Advantages of Smoothing Smooths distributions Moves weight onto more informative features 65

Advantages of Smoothing Smooths distributions Moves weight onto more informative features Enables effective use of larger numbers of features 66

Advantages of Smoothing Smooths distributions Moves weight onto more informative features Enables effective use of larger numbers of features Can speed up convergence 67

Summary: Training Many training methods: Generalized Iterative Scaling (GIS) Smoothing: Early stopping, feature selection, regularization Regularization: Change objective function – add prior Common prior: Gaussian prior Maximizing posterior not equivalent to max ent 68

MaxEnt POS Tagging 69

Notation (Ratnaparkhi, 1996) h: history  x Word and tag history t: tag  y 70

POS Tagging Model P(t 1,…,t n |w 1,…,w n ) where h i ={w i,w i-1,w i-2,w i+1,w i+2,t i-1,t i-2 } 71

MaxEnt Feature Set 72

Example Feature for ‘about’ Exclude features seen < 10 times 73

Training GIS Training time: O(NTA) N: training set size T: number of tags A: average number of features active for event (h,t) 24 hours on a ‘96 machine 74

Finding Features In training, where do features come from? Where do features come from in testing? w -1 w0w0 w -1 w 0 w +1 t -1 y x1(Time ) Time fliesBOSN x2 (flies) TimefliesTime flieslikeNN x3 (like)flieslikeflies likeanNV 75

Finding Features In training, where do features come from? Where do features come from in testing? tag features come from classification of prior word w -1 w0w0 w -1 w 0 w +1 t -1 y x1(Time ) Time fliesBOSN x2 (flies) TimefliesTime flieslikeNN x3 (like)flieslikeflies likeanNV 76

Decoding Goal: Identify highest probability tag sequence 77

Decoding Goal: Identify highest probability tag sequence Issues: Features include tags from previous words Not immediately available 78

Decoding Goal: Identify highest probability tag sequence Issues: Features include tags from previous words Not immediately available Uses tag history Just knowing highest probability preceding tag insufficient 79

Beam Search Intuition: Breadth-first search explores all paths Lots of paths are (pretty obviously) bad Why explore bad paths? Restrict to (apparently best) paths Approach: Perform breadth-first search, but 80

Beam Search Intuition: Breadth-first search explores all paths Lots of paths are (pretty obviously) bad Why explore bad paths? Restrict to (apparently best) paths Approach: Perform breadth-first search, but Retain only k ‘best’ paths thus far k: beam width 81

Beam Search, k=3 time flies like an arrow 82

Beam Search, k=3 time flies like an arrow 83

Beam Search, k=3 time flies like an arrow 84

Beam Search, k=3 time flies like an arrow 85

Beam Search, k=3 time flies like an arrow 86

Beam Search W={w 1,w 2,…,w n }: test sentence 87

Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i 88

Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly 89

Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: 90

Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: Extension: add tags for w i to each s (i-1)j 91

Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: Extension: add tags for w i to each s (i-1)j Beam selection: Sort sequences by probability Keep only top k sequences 92

Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep topN, set s 1j accordingly for i=2 to n: For each s (i-1)j for w i form vector, keep topN tags for w i Beam selection: Sort sequences by probability Keep only top sequences, using pruning on next slide Return highest probability sequence s n1 93

Beam Search Pruning and storage: W = beam width For each node, store: Tag for w i Probability of sequence so far, prob i,j = For each candidate j, s i,j Keep the node if prob i,j in topK, and prob i,j is sufficiently high e.g. lg(prob i,j )+W>=lg(max_prob) 94

Decoding Tag dictionary: known word: returns tags seen with word in training unknown word: returns all tags Beam width = 5 Running time: O(NTAB) N,T,A as before B: beam width 95

POS Tagging Overall accuracy: 96.3+% Unseen word accuracy: 86.2% Comparable to HMM tagging accuracy or TBL Provides Probabilistic framework Better able to model different info sources Topline accuracy 96-97% Consistency issues 96

Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: 97

Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% 98

Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming 99

Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming Running time: 100

Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming Disadvantage: Not guaranteed optimal (or complete) 101

MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words 102

MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words Sequence classification problems: Tag features depend on prior classification 103

MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words Sequence classification problems: Tag features depend on prior classification Beam search decoding Efficient, but inexact Near optimal in practice 104