1 LING 696B: Maximum-Entropy and Random Fields. 2 Review: two worlds Statistical model and OT seem to ask different questions about learning UG: what.

Slides:

Advertisements

Similar presentations

Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.

Advertisements

Lecture 11 (Chapter 9).

Linear Regression.

Supervised Learning Recap

Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Econ 140 Lecture 61 Inference about a Mean Lecture 6.

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.

Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.

Instructor: Mircea Nicolescu Lecture 13 CS 485 / 685 Computer Vision.

Multivariate linear models for regression and classification Outline: 1) multivariate linear regression 2) linear classification (perceptron) 3) logistic.

1 Gholamreza Haffari Anoop Sarkar Presenter: Milan Tofiloski Natural Language Lab Simon Fraser university Homotopy-based Semi- Supervised Hidden Markov.

Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi.

Maximum Entropy Model (I) LING 572 Fei Xia Week 5: 02/05-02/07/08 1.

x – independent variable (input)

Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.

Statistical Methods Chichang Jou Tamkang University.

Today Linear Regression Logistic Regression Bayesians v. Frequentists

Speaker Adaptation for Vowel Classification

Conditional Random Fields

Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

Today Logistic Regression Decision Trees Redux Graphical Models

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.

Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.

Modern Navigation Thomas Herring

Classification and Prediction: Regression Analysis

Review of Lecture Two Linear Regression Normal Equation

Crash Course on Machine Learning

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 6: Conditional Random Fields 1.

More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.

Graphical models for part of speech tagging

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

Random Sampling, Point Estimation and Maximum Likelihood.

General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison.

Tuesday August 27, 2013 Distributions: Measures of Central Tendency & Variability.

Texture. Texture is an innate property of all surfaces (clouds, trees, bricks, hair etc…). It refers to visual patterns of homogeneity and does not result.

Logistic Regression Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September 1, 2014.

1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.

Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)

Models of Linguistic Choice Christopher Manning. 2 Explaining more: How do people choose to express things? What people do say has two parts: Contingent.

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.

Insight: Steal from Existing Supervised Learning Methods! Training = {X,Y} Error = target output – actual output.

John Lafferty Andrew McCallum Fernando Pereira

Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.

Logistic Regression William Cohen.

Intro to NLP - J. Eisner1 A MAXENT viewpoint.

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

Maximum Entropy … the fact that a certain prob distribution maximizes entropy subject to certain constraints representing our incomplete information, is.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Motion Estimation Today’s Readings Trucco & Verri, 8.3 – 8.4 (skip 8.3.3, read only top half of p. 199) Newton's method Wikpedia page

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Deep Feedforward Networks

Maximum Entropy Models and Feature Engineering CSCI-GA.2591

CSC 594 Topics in AI – Natural Language Processing

Statistical Learning Dong Liu Dept. EEIS, USTC.

CSCI 5822 Probabilistic Models of Human and Machine Learning

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Learning From Observed Data

Logistic Regression [Many of the slides were originally created by Prof. Dan Jurafsky from Stanford.]

Generating Random Variates

Presentation transcript:

1 LING 696B: Maximum-Entropy and Random Fields

2 Review: two worlds Statistical model and OT seem to ask different questions about learning UG: what is possible/impossible? Hard-coded generalizations Combinatorial optimization (sorting) Statistical: among the things that are possible, what is likely/unlikely? Soft-coded generalizations Numerical optimization Marriage of the two?

3 Review: two worlds OT: relate possible/impossible patterns in different languages through constraint reranking Stochastic OT: consider a distribution over all possible grammars to generate variation Today: model frequency of input/output pairs (among the possible) directly using a powerful model

4 Maximum entropy and OT Imaginary data: Stochastic OT: let *[+voice]>>Ident(voice) and Ident(voice)>>*[+voice] 50% of the time each Maximum-Entropy (using positive weights): p([bab]|/bap/) ~ (1/Z) exp{-(2*w 1 )} p([pap]|/bap/) ~ (1/Z) exp{-(w 2 )} /bap/P(.)*[+voice]Ident(#voi) Bab.5 2 pap.5 1

5 Maximum entropy Why have Z? Need to be a conditional distribution: p([bab]|/bap/) + p([pap]|/bap/) = 1 So Z = exp{-(2*w 1 )} + exp{-(w 2 )} (same for all candidates) -- called a normalization constant Z can quickly become difficult to compute, when number of candidates is large Very similar proposal in Smolensky, 86 How to get w 1, w 2 ? Learned from data (by calculating gradients) Need: frequency counts, violation vectors (same as stochastic OT)

6 Maximum entropy Why do exp{.}? It’s like take maximum, but “soft” -- easy to differentiate and optimize

7 Maximum entropy and OT Inputs are violation vectors: e.g. x=(2,0) and (0,1) Outputs are one of K winners -- essentially a classification problem Violating a constraint works against the candidate (prob ~ exp{-(x 1 *w 1 + x 2 *w 2 )} Crucial difference: ordering candidates by one score, not by lexico-graphic orders /bap/P(.)*[+voice]Ident(voice) Bab.5 2 Pap.5 1

8 Maximum entropy Ordering discrete outputs from input vectors is a common problem: Also called Logistic Regression (recall Nearey) Explaining the name: Let P= p([bab]|/bap/), then log[P/(1-P)] = w 2 - 2*w 1 Linear regressionLogistic transform

9 The power of Maximum Entropy Max Eng/logistic regression is widely used in many areas with interacting, correlated inputs Recall Nearey: phones, diphones, … NLP: tagging, labeling, parsing … (anything with a discrete output) Easy to learn: only a global maximum, optimization efficient Isn’t this the greatest thing in the world? Need to understand the story behind the exp{} (in a few minutes)

10 Demo: Spanish diminutives Data from Arbisi-Kelm Constraints: ALIGN(TE,Word,R), MAX- OO(V), DEP-IO and BaseTooLittle

11 Stochastic OT and Max-Ent Is better fit always a good thing?

12 Stochastic OT and Max-Ent Is better fit always a good thing? Should model-fitting become a new fashion in phonology?

13 The crucial difference What are the possible distributions of p(.|/bap/) in this case? /bap/P(.)*[+voice]Ident(voice) Bab 2 Pap 1 Bap1 pab11

14 The crucial difference What are the possible distributions of p(.|/bap/) in this case? Max-Ent considers a much wider range of distributions /bap/P(.)*[+voice]Ident(voice) Bab 2 Pap 1 Bap1 pab11

15 What is Maximum Entropy anyway? Jaynes, 53: the most ignorant state corresponds to the distribution with the most entropy Given a dice, which distribution has the largest entropy?

16 What is Maximum Entropy anyway? Jaynes, 53: the most ignorant state corresponds to the distribution with the most entropy Given a dice, which distribution has the largest entropy? Add constraints to distributions: the average of some feature functions is assumed to be fixed: Observed value

17 What is Maximum Entropy anyway? Example of features: violations, word counts, N-grams, co-occurrences, … The constraints change the shape of the maximum entropy distribution Solve constrained optimization problem This leads to p(x) ~ exp{  k w k *f k (x)} Very general (see later), many choices of f k

18 The basic intuition Begin “ignorant” as much as possible (with maximum entropy), as far as the chosen distribution matches certain “descriptions” of the empirical data (statistics of f k (x)) Approximation property: any distribution can be approximated with a max-ent distribution with sufficient number of features (Cramer and Wold) Common practice in NLP This is better seen as a “descriptive” model

19 Going towards Markov random fields Maximum entropy applied to conditional/joint distribution p(y|x) or p(x,y) ~ exp{  k w k *f k (x,y)} There can be many creative ways of extracting features f k (x,y) One way is to let a graph structure guide the calculation of features. E.g. neighborhood/clique Known as Markov network/random field

20 Conditional random field Impose a chain-structured graph, and assign features to edges Still a max-ent, same calculation f(x i, y i ) m(y i, y i+1 )

21 Wilson’s idea Isn’t this a familiar picture in phonology? m(y i, y i+1 ) -- Markedness f(x i, y i ) Faithfulness Surface form Underlying form

22 The story of smoothing In Max-Ent models, the weights can get very large and “over-fit” the data (see demo) Common to penalize (smooth) this with a new objective function: new objective = old objective + parameter * magnitude of weights Wilson’s claim: this smoothing parameter has to do with substantive bias in phonological learning Constraints that force less similarity --> a higher penalty for them to change value

23 Wilson’s model fitting to the velar palatalization data