ELN – Natural Language Processing

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Maximum Entropy Model (I) LING 572 Fei Xia Week 5: 02/05-02/07/08 1.
Middle Term Exam 03/04, in class. Project It is a team work No more than 2 people for each team Define a project of your own Otherwise, I will assign.
Machine Translation (II): Word-based SMT Ling 571 Fei Xia Week 10: 12/1/05-12/6/05.
Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Conditional Random Fields
Decision List LING 572 Fei Xia 1/18/06. Outline Basic concepts and properties Case study.
Introduction LING 572 Fei Xia Week 1: 1/3/06. Outline Course overview Problems and methods Mathematical foundation –Probability theory –Information theory.
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.
Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Parameter estimate in IBM Models: Ling 572 Fei Xia Week ??
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.
Final review LING572 Fei Xia Week 10: 03/11/
A brief maximum entropy tutorial. Overview Statistical modeling addresses the problem of modeling the behavior of a random process In constructing this.
Graphical models for part of speech tagging
Albert Gatt Corpora and Statistical Methods Lecture 10.
MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7,
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Natural Language Processing Statistical Inference: n-grams
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Maximum Entropy … the fact that a certain prob distribution maximizes entropy subject to certain constraints representing our incomplete information, is.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
MAXIMUM ENTROPY, SUPPORT VECTOR MACHINES, CONDITIONAL RANDOM FIELDS, NEURAL NETWORKS Heng Ji 04/12, 04/15, 2016.
Hidden Markov Models BMI/CS 576
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Learning to Align: a Statistical Approach
Information geometry.
Oliver Schulte Machine Learning 726
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Moran Feldman The Open University of Israel
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
CIS 700 Advanced Machine Learning Structured Machine Learning:   Theory and Applications in Natural Language Processing Shyam Upadhyay Department of.
CSCI 5832 Natural Language Processing
Expectation-Maximization
Hidden Markov Models Part 2: Algorithms
Bayesian Models in Machine Learning
Introduction to EM algorithm
KAIST CS LAB Oh Jong-Hoon
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
N-Gram Model Formulas Word sequences Chain rule of probability
Word-based SMT Ling 580 Fei Xia Week 1: 1/3/06.
Recap: Conditional Exponential Model
The Improved Iterative Scaling Algorithm: A gentle Introduction
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Reinforcement Learning (2)
Reinforcement Learning (2)
Presentation transcript:

ELN – Natural Language Processing Maximum Entropy Model ELN – Natural Language Processing Slides by: Fei Xia

History The concept of Maximum Entropy can be traced back along multiple threads to Biblical times Introduced to NLP by Berger et. al. (1996) Used in many NLP tasks: MT, Tagging, Parsing, PP attachment, LM, …

Outline Modeling: Intuition, basic concepts, … Parameter training Feature selection Case study

Reference papers (Ratnaparkhi, 1997) (Ratnaparkhi, 1996) (Berger et. al., 1996) (Klein and Manning, 2003)  Different notations.

Modeling

The basic idea Goal: estimate p Choose p with maximum entropy (or “uncertainty”) subject to the constraints (or “evidence”).

Setting From training data, collect (a, b) pairs: a: thing to be predicted (e.g., a class in a classification problem) b: the context Ex: POS tagging: a=NN b=the words in a window and previous two tags Learn the probability of each (a, b): p(a, b)

Features in POS tagging (Ratnaparkhi, 1996) Condition Features wi is not rare wi = X & ti = T wi is rare X is prefix of wi, |X| ≤ 4 X is suffix of wi, |X| ≤ 4 wi contains number wi contains uppercase character wi contains hyphen wi ti-1 = X ti-2 ti-1 = X Y wi-1 = X wi-2 = X wi+1 = X wi+2 = X context (a.k.a. history) allowable classes

Maximum Entropy Why maximum entropy? Maximize entropy = Minimize commitment Model all that is known and assume nothing about what is unknown. Model all that is known: satisfy a set of constraints that must hold Assume nothing about what is unknown: choose the most “uniform” distribution  choose the one with maximum entropy

(Maximum) Entropy Entropy: the uncertainty of a distribution Quantifying uncertainty (“surprise”) Event x Probability px “surprise” log(1/px) Entropy: expected surprise (over p): H p(HEADS)

Ex1: Coin-flip example (Klein & Manning 2003) Toss a coin: p(H)=p1, p(T)=p2 Constraint: p1 + p2 = 1 Question: what’s your estimation of p=(p1, p2)? Answer: choose the p that maximizes H(p) H p1 p1=0.3

Convexity Constrained H(p) = – Σ x log x is convex: – Σ x log x is convex (sum of convex functions is convex) The feasible region of constrained H is a linear subspace (which is convex) The constrained entropy surface is therefore convex The maximum likelihood exponential model (dual) formulation is also convex

Ex2: An MT example Constraint: Intuitive answer: (Berger et. al., 1996) Possible translation for the word “in” is: {dans, en, à, au cours de, pendant} Constraint: p(dans) + p(en) + p(à) + p(au cours de) + p(pendant) = 1 Intuitive answer: p(dans) = 1/5 p(en) = 1/5 p(à) = 1/5 p(au cours de) = 1/5 p(pendant) = 1/5

p(dans) + p(en) + p(à) + p(au cours de) + p(pendant) = 1 An MT example (cont) Constraint: p(dans) + p(en) = 1/5 p(dans) + p(en) + p(à) + p(au cours de) + p(pendant) = 1 Intuitive answer: p(dans) = 1/10 p(en) = 1/10 p(à) = 8/30 p(au cours de) = 8/30 p(pendant) = 8/30

p(dans) + p(en) + p(à) + p(au cours de) + p(pendant) = 1 An MT example (cont) Constraint: p(dans) + p(en) = 1/5 p(dans) + p(à) = 1/2 p(dans) + p(en) + p(à) + p(au cours de) + p(pendant) = 1 Intuitive answer: p(dans) = ?? p(en) = ?? p(à) = ?? p(au cours de) = ?? p(pendant) = ??

Ex3: POS tagging Lets say we have the following event space: (Klein and Manning, 2003) Lets say we have the following event space: … and the following empirical data: Maximize H: … want probabilities: E[NN, NNS, NNP, NNPS, VBZ, VBD] = 1 NN NNS NNP NNPS VBZ VBD 3 5 11 13 1 1/e 1/6

Ex3 (cont) Too uniform! N* are more common than V*, so we add the feature fN = {NN, NNS, NNP, NNPS}, with E[fN] = 32/36 … and proper nouns are more frequent than common nouns, so we add fP = {NNP, NNPS}, with E[fP] = 24/36 … we could keep refining the models. E.g. by adding a feature to distinguish singular vs. plural nouns, or verb types. NN NNS NNP NNPS VBZ VBD 8/36 2/36 4/36 12/36 2/36

Ex4: overlapping features (Klein and Manning, 2003) Maxent models handle overlapping features Unlike a NB model, there is no double counting! But do not automatically model feature interactions.

Modeling the problem Objective function: H(p) Goal: Among all the distributions that satisfy the constraints, choose the one, p*, that maximizes H(p). Question: How to represent constraints?

Features fj : e  {0, 1}, e = A  B Feature (a.k.a. feature function, Indicator function) is a binary-valued function on events: fj : e  {0, 1}, e = A  B A: the set of possible classes (e.g., tags in POS tagging) B: space of contexts (e.g., neighboring words/ tags in POS tagging) Example:

Some notations Finite training sample of events: Observed probability of x in S: The model p’s probability of x: The jth feature: Observed expectation of fi : (empirical count of fi) Model expectation of fi

Constraints Model’s feature expectation = observed feature expectation How to calculate ?

Training data  observed events

Restating the problem The task: find p* s.t. where Objective function: -H(p) Constraints: Add a feature

Questions Is P empty? Does p* exist? Is p* unique? What is the form of p*? How to find p*?

What is the form of p*? Theorem: if p*  P  Q then (Ratnaparkhi, 1997) Theorem: if p*  P  Q then Furthermore, p* is unique.

Using Lagrangian multipliers Minimize A(p): Derivative = 0

Two equivalent forms

Relation to Maximum Likelihood The log-likelihood of the empirical distribution as predicted by a model q is defined as Theorem: if then Furthermore, p* is unique.

Summary (so far) Goal: find p* in P, which maximizes H(p). It can be proved that when p* exists it is unique. The model p* in P with maximum entropy is the model in Q that maximizes the likelihood of the training sample

Summary (cont) Adding constraints (features): (Klein and Manning, 2003) Lower maximum entropy Raise maximum likelihood of data Bring the distribution further from uniform Bring the distribution closer to data

Parameter estimation

Algorithms Generalized Iterative Scaling (GIS): (Darroch and Ratcliff, 1972) Improved Iterative Scaling (IIS): (Della Pietra et al., 1995)

GIS: setup Requirements for running GIS: Obey form of model and constraints: An additional constraint: Add a new feature fk+1:

GIS algorithm Compute dj, j=1, …, k+1 Initialize (any values, e.g., 0) Repeat until converge for each j compute where update

Approximation for calculating feature expectation

Properties of GIS L(pn+1) >= L(pn) The sequence is guaranteed to converge to p*. The converge can be very slow. The running time of each iteration is O(NPA): N: the training set size P: the number of classes A: the average number of features that are active for a given event (a, b).

Feature selection

Feature selection Throw in many features and let the machine select the weights Manually specify feature templates Problem: too many features An alternative: greedy algorithm Start with an empty set S Add a feature at each iteration

Notation With the feature set S: After adding a feature: The gain in the log-likelihood of the training data:

Feature selection algorithm (Berger et al., 1996) Start with S being empty; thus ps is uniform. Repeat until the gain is small enough For each candidate feature f Computer the model using IIS Calculate the log-likelihood gain Choose the feature with maximal gain, and add it to S  Problem: too expensive

Approximating gains (Berger et. al., 1996) Instead of recalculating all the weights, calculate only the weight of the new feature.

Training a MaxEnt Model Scenario #1: Define features templates Create the feature set Determine the optimum feature weights via GIS or IIS Scenario #2: Define feature templates Create candidate feature set S At every iteration, choose the feature from S (with max gain) and determine its weight (or choose top-n features and their weights).

Case study

POS tagging Notation variation: History: (Ratnaparkhi, 1996) Notation variation: fj(a, b): a: class, b: context fj(hi, ti): h: history for ith word, t: tag for ith word History: hi = {wi, wi-1, wi-2, wi+1, wi+2, ti-2, ti-1} Training data: Treat it as a list of (hi, ti) pairs How many pairs are there?

Using a MaxEnt Model Modeling: Training: Decoding: Define features templates Create the feature set Determine the optimum feature weights via GIS or IIS Decoding:

Modeling

Training step 1: define feature templates Condition Features wi is not rare wi = X & ti = T wi is rare X is prefix of wi, |X| ≤ 4 X is suffix of wi, |X| ≤ 4 wi contains number wi contains uppercase character wi contains hyphen wi ti-1 = X ti-2 ti-1 = X Y wi-1 = X wi-2 = X wi+1 = X wi+2 = X History hi Tag ti

Step 2: Create feature set Collect all the features from the training data Throw away features that appear less than 10 times

Step 3: determine the feature weights GIS Training time: Each iteration: O(NTA): N: the training set size T: the number of allowable tags A: average number of features that are active for a (h, t). How many features?

Decoding: Beam search Generate tags for w1, find top N, set s1j accordingly, j =1, 2, …, N for i = 2 to n (n is the sentence length) for j =1 to N generate tags for wi, given s(i-1)j as previous tag context append each tag to s(i-1)j to make a new sequence find N highest prob sequences generated above, and set sij accordingly, j = 1, …, N Return highest prob sequence sn1.

Beam search Beam inference: At each position keep the top k complete sequences Extend each sequence in each local way The extensions compete for the k slots at the next position Advantages Fast: and beam sizes of 3-5 are as good or almost as good as exact inference in many cases Easy to implement (no dynamic programming required) Disadvantage: Inexact: the global best sequence can fall off the beam

Viterbi search Viterbi inference Dynamic programming or memoization Requires small window of state influence (e.g., past two states are relevant) Advantages Exact: the global best sequence is returned Disadtvantages Harder to implement long-distance state-state interactions (but beam inference tends not to allow long-distance resurrection of sequences anyway)

Decoding (cont) Tags for words: Ex: “time flies like an arrow” Known words: use tag dictionary Unknown words: try all possible tags Ex: “time flies like an arrow” Running time: O(NTAB) N: sentence length B: beam size T: tagset size A: average number of features that are active for a given event

Experiment results: Error rate

Comparison with other learners HMM: MaxEnt uses more context SDT: MaxEnt does not split data TBL: MaxEnt is statistical and it provides probability distributions.

MaxEnt Summary Concept: choose the p* that maximizes entropy while satisfying all the constraints. Max likelihood: p* is also the model within a model family that maximizes the log-likelihood of the training data. Training: GIS or IIS, which can be slow. MaxEnt handles overlapping features well. In general, MaxEnt achieves good performances on many NLP tasks.