Maximum Entropy Models and Feature Engineering CSCI-GA.2591

Slides:



Advertisements
Similar presentations
Neural Networks and Kernel Methods
Advertisements

CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
Linear Regression.
Data Mining Classification: Alternative Techniques
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Chapter 4: Linear Models for Classification
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Ensemble Learning: An Introduction
Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.
Conditional Random Fields
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Graphical models for part of speech tagging
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Albert Gatt Corpora and Statistical Methods Lecture 10.
Mastering the Pipeline CSCI-GA.2590 Ralph Grishman NYU.
Discriminative Models for Spoken Language Understanding Ye-Yi Wang, Alex Acero Microsoft Research, Redmond, Washington USA ICSLP 2006.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Today Ensemble Methods. Recap of the course. Classifier Fusion
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
MAXIMUM ENTROPY MARKOV MODEL Adapted From: Heshaam Faili University of Tehran – Dikkala Sai Nishanth – Ashwin P. Paranjape
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
John Lafferty Andrew McCallum Fernando Pereira
Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
Mastering the Pipeline CSCI-GA.2590 Ralph Grishman NYU.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Learning Deep Generative Models by Ruslan Salakhutdinov
Deep Feedforward Networks
Relation Extraction CSCI-GA.2591
Conditional Random Fields for ASR
(Entity and) Event Extraction CSCI-GA.2591
Hidden Markov Models - Training
CSC 594 Topics in AI – Natural Language Processing
Klein and Manning on CRFs vs CMMs
Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.
N-Gram Model Formulas Word sequences Chain rule of probability
CSCI 5832 Natural Language Processing
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Mathematical Foundations of BME
Parametric Methods Berlin Chen, 2005 References:
Neural networks (1) Traditional multi-layer perceptrons
CSCI 5832 Natural Language Processing
Multivariate Methods Berlin Chen
Logistic Regression Chapter 7.
Gene Structure Prediction Using Neural Networks and Hidden Markov Models June 18, 권동섭 신수용 조동연.
The Improved Iterative Scaling Algorithm: A gentle Introduction
Sequential Learning with Dependency Nets
Presentation transcript:

Maximum Entropy Models and Feature Engineering CSCI-GA.2591 NYU Maximum Entropy Models and Feature Engineering CSCI-GA.2591 Ralph Grishman

So Far … So far we have relied primarily on HMMs as our models for language phenomena simple and fast to train and to use effective for POS tagging (one POS  one state) can be made effective for name tagging (can capture context) by splitting states but further splitting could lead to sparse data problems 3/3/15 NYU

We want … We want to have a more flexible means of capturing our linguistic intuition that certain conditions lead to the increased likelihood of certain outcomes that a name on a ‘common first name’ list increases the chance that this is the beginning of a person name that being in a sports story increases the chance of team (organization) names Maximum entropy modeling (logistic regression)provides one mathematically well-founded method for combining such features in a probabilistic model. 3/3/15 NYU

Maximum Entropy The features provide constraints on the model. We'd like to have a probability distribution which, outside of these constraints, is as uniform as possible -- has the maximum entropy among all models which satisfy these constraints. 3/3/15 NYU

Indicator Functions Suppose we have a tagging task, where we want to assign a tag t to a word w based on the 'context' h of w (the words around w, including w itself).  In other words, we want to compute p(h,t).  We will specify a set of K features in the form of binary-valued indicator functions fi (h, t).  Example: f1(h, t) = 1 if the preceding word in h is "to" and t = "VB"           = 0 otherwise 3/3/15 NYU

A log-linear model p(h, t) = (1/Z) Πi=1 to K αifi(h, t) We will use a log-linear model p(h, t) = (1/Z) Πi=1 to K αifi(h, t) where αi is the weight for feature i, and Z is a normalizing constant. If αi> 1, the feature makes the outcome t more likely; If αi< 1, the feature makes the outcome t less likely; 3/3/15 NYU

The goal of the learning procedure is to determine the values of the αi's so that the expected value of each fi Σh,t p(h, t) fi(h, t) is equal to its average value over the training set of N words (whose contexts are h1, ..., hN): (1/N) Σj fi(hj, t) 3/3/15 NYU

Training Training a ME model involves finding the αi's. Unlike HMM training, there is no closed-form solution; an iterative solver is required. We seek weights to maximize where x are the features, y the labels, and the sum is over the training corpus. The first ME packages used generalized iterative scaling. Faster solvers such as BFGS and L-BFGS are now available. 3/3/15 NYU

Overfitting and Regularization If a feature appears only a few times, and by chance each time with the same outcome, it will get a high weight, possibly leading to poor predictions on the test data this is an example of overfitting not enough data to train many features a simple solution is a threshold: a minimum count of a feature—outcome pair a fancier approach is regularization—favoring solutions with smaller weights, even if the result is not as good a fit to the training data 3/3/15 NYU

Regularization Regularization introduces an additional term into the objective function: R(αi) penalizes large weights L1 regularization uses a Manhattan distance norm (sum of absolute values of αi ) L2 regularization uses a Euclidean norm (sum of squares of αi ) 3/3/15 NYU

Using MaxENT MaxEnt is typically used for a multi-class classifier. We are given a set of training data, where each datum is labeled with a set of features and a class (tag).  Each feature-class pair constitutes an indicator function.  We train a classifier using this data, computing the αs.  We can then classify new data by selecting the class (tag) which maximizes p(h ,t). 3/3/15 NYU

Using MaxENT Typical training data format: f1 f2 f3 … outcome 3/3/15 NYU

Maximum Entropy Markov Model MEMM Maximum Entropy Markov Model a type of Hidden Markov Model (a sequence model) next-state probabilities P(ti |t i-1, wi) computed by MaxEnt model MaxEnt model has access to entire sentence, but only to immediate prior state (not to earlier states) first-order HMM use Viterbi for tagging time still O(s2n), but larger factor for MaxEnt eval 3/3/15 NYU

Feature Engineering The main task when using a MaxEnt classifier is to select an appropriate set of features words in the immediate neighborhood are typical basic features: wi-1, wi, wi+1 patterns constructed for rule-based taggers are likely candidates: wi+1 is an initial membership on word lists: wi is a common first name (from Census) 3/3/15 NYU

FE and log-linear models MaxEnt model combines features multiplicatively you may want to include the conjunction of features as a separate feature treat bigrams as separate features: wi-1 × wi 3/3/15 NYU

Combining MaxEnt classifiers One can even treat the output of individual classifiers as features (“system combination”), potentially producing better performance than any individual classifier weight systems based on overall accuracy confidence of individual classifications (margin = probability of most likely class – probability of second most likely class 3/3/15 NYU

HMM vs MEMM HMM: a generative model si-1 si si+1 wi wi+1 3/3/15 NYU

HMM vs MEMM MEMM: a discriminative model si-1 si wi wi+1 si+1 3/3/15 NYU

MaxEnt vs. Neural Network simple form for combining inputs (log linear) developer must define set of features to be used as inputs Neural Network much richer form for combining inputs can use simpler inputs (in limiting case, words) useful features generated internally as part of training 3/3/15 NYU

CRF MEMMs are subject to label bias, particularly if there are states with only one outgoing arc this problem is avoided by conditional random fields (CRFs), but at a cost of higher training and decoding times linear-chain CRFs reduce decoding time but still have high training times 3/3/15 NYU