Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSC 594 Topics in AI – Natural Language Processing

Similar presentations


Presentation on theme: "CSC 594 Topics in AI – Natural Language Processing"— Presentation transcript:

1 CSC 594 Topics in AI – Natural Language Processing
Spring 2018 13. Maximum Entropty and Loglinear Models (Most slides adapted from Ralph Grishman at NYU) CSCI-GA.2590 lecture by Ralph Grishman at NYU

2 Maximum Entropy Models and Feature Engineering CSCI-GA
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman

3 So Far … So far we have relied primarily on HMMs as our models for language phenomena simple and fast to train and to use effective for POS tagging (one POS  one state) can be made effective for name tagging (can capture context) by splitting states but further splitting could lead to sparse data problems

4 We want … We want to have a more flexible means of capturing our linguistic intuition that certain conditions lead to the increased likelihood of certain outcomes that a name on a ‘common first name’ list increases the chance that this is the beginning of a person name that being in a sports story increases the chance of team (organization) names Maximum entropy modeling (logistic regression) provides one mathematically well-founded method for combining such features in a probabilistic model.

5 Maximum Entropy The features provide constraints on the model. We'd like to have a probability distribution which, outside of these constraints, is as uniform as possible -- has the maximum entropy among all models which satisfy these constraints.

6 Indicator Functions Suppose we have a tagging task, where we want to assign a tag t to a word w based on the 'context' h of w (the words around w, including w itself).  In other words, we want to compute p(h,t).  We will specify a set of K features in the form of binary-valued indicator functions fi (h, t).  Example: f1(h, t) = 1 if the preceding word in h is "to" and t = "VB"           = 0 otherwise

7 Speech and Language Processing - Jurafsky and Martin
POS Features 11/9/2018 Speech and Language Processing - Jurafsky and Martin

8 Speech and Language Processing - Jurafsky and Martin
Sentiment Features 11/9/2018 Speech and Language Processing - Jurafsky and Martin

9

10

11

12 A log-linear model p(h, t) = (1/Z) Πi=1 to K αifi(h, t)
We will use a log-linear model p(h, t) = (1/Z) Πi=1 to K αifi(h, t) where αi is the weight for feature i, and Z is a normalizing constant. If αi> 1, the feature makes the outcome t more likely; If αi< 1, the feature makes the outcome t less likely;

13 Logistic Regression Model
11/9/2018 Speech and Language Processing - Jurafsky and Martin

14 The goal of the learning procedure is to determine the values of the αi's so that the expected value of each fi Σh,t p(h, t) fi(h, t) is equal to its average value over the training set of N words (whose contexts are h1, ..., hN): (1/N) Σj fi(hj, t)

15 Sentiment Features w/ Weights
1.9 .9 .7 -.8 11/9/2018 Speech and Language Processing - Jurafsky and Martin

16 Training Training a ME model involves finding the αi's. Unlike HMM training, there is no closed-form solution; an iterative solver is required. The first ME packages used generalized iterative scaling. Faster solvers such as BFGS and L-BFGS are now available.

17 Overfitting and Regularization
If a feature appears only a few times, and by chance each time with the same outcome, it will get a high weight, possibly leading to poor predictions on the test data this is an example of overfitting not enough data to train many features a simple solution is a threshold: a minimum count of a feature—outcome pair a fancier approach is regularization—favoring solutions with smaller weights, even if the result is not as good a fit to the training data

18 Using MaxENT MaxEnt is typically used for a multi-class classifier.
We are given a set of training data, where each datum is labeled with a set of features and a class (tag).  Each feature-class pair constitutes an indicator function.  We train a classifier using this data, computing the αs.  We can then classify new data by selecting the class (tag) which maximizes p(h ,t).

19 Using MaxENT Typical training data format: f1 f2 f3 … outcome

20 Discriminative Models Maximum Entropy Markov Models (MEMMs)
Exponential model Given training set X with label sequence Y: Train a model θ that maximizes P(Y|X, θ) For a new data sequence x, the predicted label y maximizes P(y|x, θ) Notice the per-state normalization

21 Maximum Entropy Markov Model
MEMM Maximum Entropy Markov Model a type of Hidden Markov Model (a sequence model) next-state probabilities P(ti |t i-1, wi) computed by MaxEnt model MaxEnt model has access to entire sentence, but only to immediate prior state (not to earlier states) first-order HMM use Viterbi for tagging time still O(s2n), but larger factor for MaxEnt eval

22 Feature Engineering The main task when using a MaxEnt classifier is to select an appropriate set of features words in the immediate neighborhood are typical basic features: wi-1, wi, wi+1 patterns constructed for rule-based taggers are likely candidates: wi+1 is an initial membership on word lists: wi is a common first name (from Census)

23 FE and log-linear models
MaxEnt model combines features multiplicatively you may want to include the conjunction of features as a separate feature treat bigrams as separate features: wi-1 × wi

24 Combining MaxEnt classifiers
One can even treat the output of individual classifiers as features (“system combination”), potentially producing better performance than any individual classifier weight systems based on overall accuracy confidence of individual classifications (margin = probability of most likely class – probability of second most likely class

25 HMM vs MEMM HMM: a generative model si-1 si si+1 wi wi+1

26 HMM vs MEMM MEMM: a discriminative model si-1 si si+1 wi wi+1

27 HMMs vs. MEMMs (II) HMMs MEMMs
αt(s) the probability of producing o1, , ot and being in s at time t. αt(s) the probability of being in s at time t given o1, , ot . δt(s) the probability of the best path for producing o1, , ot and being in s at time t. δt(s) the probability of the best path that reaches s at time t given o1, , ot .

28 MaxEnt vs. Neural Network
simple form for combining inputs (log linear) developer must define set of features to be used as inputs Neural Network much richer form for combining inputs can use simpler inputs (in limiting case, words) useful features generated internally as part of training

29 CRF MEMMs are subject to label bias, particularly if there are states with only one outgoing arc this problem is avoided by conditional random fields (CRFs), but at a cost of higher training and decoding times linear-chain CRFs reduce decoding time but still have high training times

30 Random Field

31 Conditional Random Fields (CRFs)
CRFs have all the advantages of MEMMs without label bias problem MEMM uses per-state exponential model for the conditional probabilities of next states given the current state CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence Undirected acyclic graph Allow some transitions “vote” more strongly than others depending on the corresponding observations

32 Definition of CRFs X is a random variable over data sequences to be labeled Y is a random variable over corresponding label sequences

33 Example of CRFs

34 Graphical comparison among HMMs, MEMMs and CRFs
HMM MEMM CRF

35 Conditional Distribution
If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is: x is a data sequence y is a label sequence v is a vertex from vertex set V = set of label random variables e is an edge from edge set E over V fk and gk are given and fixed. gk is a Boolean vertex feature; fk is a Boolean edge feature k is the number of features are parameters to be estimated y|e is the set of components of y defined by edge e y|v is the set of components of y defined by vertex v

36 Conditional Distribution (cont’d)
CRFs use the observation-dependent normalization Z(x) for the conditional distributions: Z(x) is a normalization over the data sequence x


Download ppt "CSC 594 Topics in AI – Natural Language Processing"

Similar presentations


Ads by Google