600.465 - Intro to NLP - J. Eisner1 A MAXENT viewpoint.

Slides:



Advertisements
Similar presentations
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Advertisements

ST3236: Stochastic Process Tutorial 3 TA: Mar Choong Hock Exercises: 4.
CS626: NLP, Speech and the Web
Natural Language Learning: MaxImum entropy
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Copyright © Cengage Learning. All rights reserved.
THE CENTRAL LIMIT THEOREM The “World is Normal” Theorem.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
Segmentation and Fitting Using Probabilistic Methods
Planning under Uncertainty
Visual Recognition Tutorial
Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi.
Maximum Entropy Model (I) LING 572 Fei Xia Week 5: 02/05-02/07/08 1.
Probabilistic inference
Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.
Descriptive statistics Experiment  Data  Sample Statistics Sample mean Sample variance Normalize sample variance by N-1 Standard deviation goes as square-root.
Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Today Today: Chapter 9 Assignment: 9.2, 9.4, 9.42 (Geo(p)=“geometric distribution”), 9-R9(a,b) Recommended Questions: 9.1, 9.8, 9.20, 9.23, 9.25.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.
1 Advanced Smoothing, Evaluation of Language Models.
P. STATISTICS LESSON 7.2 ( DAY 2)
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
A Monte Carlo exploration of methods to determine the UHECR composition with the Pierre Auger Observatory D.D’Urso for the Pierre Auger Collaboration
Random Sampling, Point Estimation and Maximum Likelihood.
Intro to NLP - J. Eisner1 Smoothing There are more principled smoothing methods, too. We’ll look next at log-linear models, which are a good.
Text Classification, Active/Interactive learning.
Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)
Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.
Materials Process Design and Control Laboratory Sethuraman Sankaran and Nicholas Zabaras Materials Process Design and Control Laboratory Sibley School.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
1 Bayesian Methods. 2 Naïve Bayes New data point to classify: X=(x 1,x 2,…x m ) Strategy: – Calculate P(C i /X) for each class C i. – Select C i for which.
Classification Techniques: Bayesian Classification
1 CMSC 671 Fall 2001 Class #25-26 – Tuesday, November 27 / Thursday, November 29.
PRIORS David Kauchak CS159 Fall Admin Assignment 7 due Friday at 5pm Project proposals due Thursday at 11:59pm Grading update.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
Intro to NLP - J. Eisner1 (Conditional) Log-Linear Modeling.
Intro to NLP - J. Eisner1 The Maximum-Entropy Stewpot.
Bayesian Reasoning: Maximum Entropy A/Prof Geraint F. Lewis Rm 560:
5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004.
Today's Specials ● Detailed look at Lagrange Multipliers ● Forward-Backward and Viterbi algorithms for HMMs ● Intro to EM as a concept [ Motivation, Insights]
MAXIMUM ENTROPY David Kauchak CS457, Spring 2011 Some material derived from Jason Eisner.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Maximum Entropy … the fact that a certain prob distribution maximizes entropy subject to certain constraints representing our incomplete information, is.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Computacion Inteligente Least-Square Methods for System Identification.
Updating Probabilities Ariel Caticha and Adom Giffin Department of Physics University at Albany - SUNY MaxEnt 2006.
1 LING 696B: Maximum-Entropy and Random Fields. 2 Review: two worlds Statistical model and OT seem to ask different questions about learning UG: what.
Conditional Expectation
AP Statistics From Randomness to Probability Chapter 14.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
Ranking: Compare, Don’t Score Ammar Ammar, Devavrat Shah (LIDS – MIT) Poster ( No preprint), WIDS 2011.
Chapter 3: Maximum-Likelihood Parameter Estimation
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Bayes Net Learning: Bayesian Approaches
Latent Variables, Mixture Models and EM
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Recap: Conditional Exponential Model
On Maxent Jorge Soberon University of Kansas.
David Kauchak CS159 Spring 2019
Presentation transcript:

Intro to NLP - J. Eisner1 A MAXENT viewpoint

Intro to NLP - J. Eisner2 Information Theoretic Scheme: the MAXENT principle Input: Given statistical correlation or lineal path functions Obtain: microstructures that satisfy the given properties Constraints are viewed as expectations of features over a random field. Problem is viewed as finding that distribution whose ensemble properties match those that are given. Since, problem is ill-posed, we choose the distribution that has the maximum entropy. Additional statistical information is available using this scheme.

Intro to NLP - J. Eisner3 The MAXENT Principle The principle of maximum entropy (MAXENT) states that amongst the probability distributions that satisfy our incomplete information about the system, the probability distribution that maximizes entropy is the least-biased estimate that can be made. It agrees with everything that is known but carefully avoids anything that is unknown. E.T. Jaynes 1957

Intro to NLP - J. Eisner4 Maximum Entropy  Suppose there are 10 classes, A through J.  I don’t give you any other information.  Question: Given message m: what is your guess for p(C | m)?  Suppose I tell you that 55% of all messages are in class A.  Question: Now what is your guess for p(C | m)?  Suppose I also tell you that 10% of all messages contain Buy and 80% of these are in class A or C.  Question: Now what is your guess for p(C | m), if m contains Buy ?  OUCH!

Intro to NLP - J. Eisner5 Maximum Entropy ABCDEFGHIJ Buy Other  Column A sums to 0.55 (“55% of all messages are in class A”)

Intro to NLP - J. Eisner6 Maximum Entropy ABCDEFGHIJ Buy Other  Column A sums to 0.55  Row Buy sums to 0.1 (“10% of all messages contain Buy ”)

Intro to NLP - J. Eisner7 Maximum Entropy ABCDEFGHIJ Buy Other  Column A sums to 0.55  Row Buy sums to 0.1  ( Buy, A) and ( Buy, C) cells sum to 0.08 (“80% of the 10%”)  Given these constraints, fill in cells “as equally as possible”: maximize the entropy (related to cross-entropy, perplexity) Entropy = log log log … Largest if probabilities are evenly distributed

Intro to NLP - J. Eisner8 Maximum Entropy ABCDEFGHIJ Buy Other  Column A sums to 0.55  Row Buy sums to 0.1  ( Buy, A) and ( Buy, C) cells sum to 0.08 (“80% of the 10%”)  Given these constraints, fill in cells “as equally as possible”: maximize the entropy  Now p( Buy, C) =.029 and p(C | Buy ) =.29  We got a compromise: p(C | Buy ) < p(A | Buy ) <.55

Intro to NLP - J. Eisner9 Generalizing to More Features ABCDEFGH… Buy Other <Rs.100 Other

Intro to NLP - J. Eisner10 What we just did  For each feature (“contains Buy”), see what fraction of training data has it  Many distributions p(c,m) would predict these fractions (including the unsmoothed one where all mass goes to feature combos we’ve actually seen)  Of these, pick distribution that has max entropy  Amazing Theorem: This distribution has the form p(m,c) = (1/Z()) exp  i i f i (m,c) So it is log-linear. In fact it is the same log-linear distribution that maximizes  j p(m j, c j ) as before!  Gives another motivation for our log-linear approach.

Intro to NLP - J. Eisner11 Overfitting  If we have too many features, we can choose weights to model the training data perfectly.  If we have a feature that only appears in spam training, not ling training, it will get weight  to maximize p(spam | feature) at 1.  These behaviors overfit the training data.  Will probably do poorly on test data.

Intro to NLP - J. Eisner12 Solutions to Overfitting 1. Throw out rare features. Require every feature to occur > 4 times, and > 0 times with ling, and > 0 times with spam. 2. Only keep 1000 features. Add one at a time, always greedily picking the one that most improves performance on held-out data. 3. Smooth the observed feature counts. 4. Smooth the weights by using a prior. max p(|data) = max p(, data) =p()p(data|) decree p() to be high when most weights close to 0