CSC 594 Topics in AI – Natural Language Processing

Slides:



Advertisements
Similar presentations
Neural Networks and Kernel Methods
Advertisements

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
Supervised Learning Recap
John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields - A probabilistic graphical model Stefan Mutter Machine Learning Group Conditional Random Fields - A probabilistic graphical.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Data Mining Techniques Outline
Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.
Conditional Random Fields
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Today Logistic Regression Decision Trees Redux Graphical Models
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Machine Learning CUNY Graduate Center Lecture 21: Graphical Models.
Conditional Random Fields   A form of discriminative modelling   Has been used successfully in various domains such as part of speech tagging and other.
Graphical models for part of speech tagging
Albert Gatt Corpora and Statistical Methods Lecture 10.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Today Ensemble Methods. Recap of the course. Classifier Fusion
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
MAXIMUM ENTROPY MARKOV MODEL Adapted From: Heshaam Faili University of Tehran – Dikkala Sai Nishanth – Ashwin P. Paranjape
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
John Lafferty Andrew McCallum Fernando Pereira
Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
Hidden Markov Models BMI/CS 576
Lecture 7: Constrained Conditional Models
Learning Deep Generative Models by Ruslan Salakhutdinov
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Conditional Random Fields
Relation Extraction CSCI-GA.2591
Statistical Models for Automatic Speech Recognition
CSC 594 Topics in AI – Natural Language Processing
Hidden Markov Models - Training
CSC 594 Topics in AI – Natural Language Processing
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18
Statistical Models for Automatic Speech Recognition
An Introduction to Variational Methods for Graphical Models
Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.
N-Gram Model Formulas Word sequences Chain rule of probability
CSCI 5832 Natural Language Processing
Conditional Random Fields model
LECTURE 23: INFORMATION THEORY REVIEW
Parametric Methods Berlin Chen, 2005 References:
Neural networks (1) Traditional multi-layer perceptrons
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18
The Improved Iterative Scaling Algorithm: A gentle Introduction
Sequential Learning with Dependency Nets
Presentation transcript:

CSC 594 Topics in AI – Natural Language Processing Spring 2018 13. Maximum Entropty and Loglinear Models (Most slides adapted from Ralph Grishman at NYU) CSCI-GA.2590 lecture by Ralph Grishman at NYU

Maximum Entropy Models and Feature Engineering CSCI-GA Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman

So Far … So far we have relied primarily on HMMs as our models for language phenomena simple and fast to train and to use effective for POS tagging (one POS  one state) can be made effective for name tagging (can capture context) by splitting states but further splitting could lead to sparse data problems

We want … We want to have a more flexible means of capturing our linguistic intuition that certain conditions lead to the increased likelihood of certain outcomes that a name on a ‘common first name’ list increases the chance that this is the beginning of a person name that being in a sports story increases the chance of team (organization) names Maximum entropy modeling (logistic regression) provides one mathematically well-founded method for combining such features in a probabilistic model.

Maximum Entropy The features provide constraints on the model. We'd like to have a probability distribution which, outside of these constraints, is as uniform as possible -- has the maximum entropy among all models which satisfy these constraints.

Indicator Functions Suppose we have a tagging task, where we want to assign a tag t to a word w based on the 'context' h of w (the words around w, including w itself).  In other words, we want to compute p(h,t).  We will specify a set of K features in the form of binary-valued indicator functions fi (h, t).  Example: f1(h, t) = 1 if the preceding word in h is "to" and t = "VB"           = 0 otherwise

Speech and Language Processing - Jurafsky and Martin POS Features 11/9/2018 Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin Sentiment Features 11/9/2018 Speech and Language Processing - Jurafsky and Martin

A log-linear model p(h, t) = (1/Z) Πi=1 to K αifi(h, t) We will use a log-linear model p(h, t) = (1/Z) Πi=1 to K αifi(h, t) where αi is the weight for feature i, and Z is a normalizing constant. If αi> 1, the feature makes the outcome t more likely; If αi< 1, the feature makes the outcome t less likely;

Logistic Regression Model 11/9/2018 Speech and Language Processing - Jurafsky and Martin

The goal of the learning procedure is to determine the values of the αi's so that the expected value of each fi Σh,t p(h, t) fi(h, t) is equal to its average value over the training set of N words (whose contexts are h1, ..., hN): (1/N) Σj fi(hj, t)

Sentiment Features w/ Weights 1.9 .9 .7 -.8 11/9/2018 Speech and Language Processing - Jurafsky and Martin

Training Training a ME model involves finding the αi's. Unlike HMM training, there is no closed-form solution; an iterative solver is required. The first ME packages used generalized iterative scaling. Faster solvers such as BFGS and L-BFGS are now available.

Overfitting and Regularization If a feature appears only a few times, and by chance each time with the same outcome, it will get a high weight, possibly leading to poor predictions on the test data this is an example of overfitting not enough data to train many features a simple solution is a threshold: a minimum count of a feature—outcome pair a fancier approach is regularization—favoring solutions with smaller weights, even if the result is not as good a fit to the training data

Using MaxENT MaxEnt is typically used for a multi-class classifier. We are given a set of training data, where each datum is labeled with a set of features and a class (tag).  Each feature-class pair constitutes an indicator function.  We train a classifier using this data, computing the αs.  We can then classify new data by selecting the class (tag) which maximizes p(h ,t).

Using MaxENT Typical training data format: f1 f2 f3 … outcome

Discriminative Models Maximum Entropy Markov Models (MEMMs) Exponential model Given training set X with label sequence Y: Train a model θ that maximizes P(Y|X, θ) For a new data sequence x, the predicted label y maximizes P(y|x, θ) Notice the per-state normalization

Maximum Entropy Markov Model MEMM Maximum Entropy Markov Model a type of Hidden Markov Model (a sequence model) next-state probabilities P(ti |t i-1, wi) computed by MaxEnt model MaxEnt model has access to entire sentence, but only to immediate prior state (not to earlier states) first-order HMM use Viterbi for tagging time still O(s2n), but larger factor for MaxEnt eval

Feature Engineering The main task when using a MaxEnt classifier is to select an appropriate set of features words in the immediate neighborhood are typical basic features: wi-1, wi, wi+1 patterns constructed for rule-based taggers are likely candidates: wi+1 is an initial membership on word lists: wi is a common first name (from Census)

FE and log-linear models MaxEnt model combines features multiplicatively you may want to include the conjunction of features as a separate feature treat bigrams as separate features: wi-1 × wi

Combining MaxEnt classifiers One can even treat the output of individual classifiers as features (“system combination”), potentially producing better performance than any individual classifier weight systems based on overall accuracy confidence of individual classifications (margin = probability of most likely class – probability of second most likely class

HMM vs MEMM HMM: a generative model si-1 si si+1 wi wi+1

HMM vs MEMM MEMM: a discriminative model si-1 si si+1 wi wi+1

HMMs vs. MEMMs (II) HMMs MEMMs αt(s) the probability of producing o1, . . . , ot and being in s at time t. αt(s) the probability of being in s at time t given o1, . . . , ot . δt(s) the probability of the best path for producing o1, . . . , ot and being in s at time t. δt(s) the probability of the best path that reaches s at time t given o1, . . . , ot .

MaxEnt vs. Neural Network simple form for combining inputs (log linear) developer must define set of features to be used as inputs Neural Network much richer form for combining inputs can use simpler inputs (in limiting case, words) useful features generated internally as part of training

CRF MEMMs are subject to label bias, particularly if there are states with only one outgoing arc this problem is avoided by conditional random fields (CRFs), but at a cost of higher training and decoding times linear-chain CRFs reduce decoding time but still have high training times

Random Field

Conditional Random Fields (CRFs) CRFs have all the advantages of MEMMs without label bias problem MEMM uses per-state exponential model for the conditional probabilities of next states given the current state CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence Undirected acyclic graph Allow some transitions “vote” more strongly than others depending on the corresponding observations

Definition of CRFs X is a random variable over data sequences to be labeled Y is a random variable over corresponding label sequences

Example of CRFs

Graphical comparison among HMMs, MEMMs and CRFs HMM MEMM CRF

Conditional Distribution If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is: x is a data sequence y is a label sequence v is a vertex from vertex set V = set of label random variables e is an edge from edge set E over V fk and gk are given and fixed. gk is a Boolean vertex feature; fk is a Boolean edge feature k is the number of features are parameters to be estimated y|e is the set of components of y defined by edge e y|v is the set of components of y defined by vertex v

Conditional Distribution (cont’d) CRFs use the observation-dependent normalization Z(x) for the conditional distributions: Z(x) is a normalization over the data sequence x