Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.

Slides:



Advertisements
Similar presentations
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Supervised Learning Recap
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Planning under Uncertainty
Middle Term Exam 03/04, in class. Project It is a team work No more than 2 people for each team Define a project of your own Otherwise, I will assign.
Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Maximum likelihood (ML) and likelihood ratio (LR) test
Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.
Conditional Random Fields
Expectation Maximization Algorithm
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Linear Discriminant Functions Chapter 5 (Duda et al.)
Thanks to Nir Friedman, HU
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Statistical Analysis of Loads
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Albert Gatt Corpora and Statistical Methods Lecture 10.
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
1 Bayesian Methods. 2 Naïve Bayes New data point to classify: X=(x 1,x 2,…x m ) Strategy: – Calculate P(C i /X) for each class C i. – Select C i for which.
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
PRIORS David Kauchak CS159 Fall Admin Assignment 7 due Friday at 5pm Project proposals due Thursday at 11:59pm Grading update.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.
Intelligent Database Systems Lab Advisor : Dr.Hsu Graduate : Keng-Wei Chang Author : Lian Yan and David J. Miller 國立雲林科技大學 National Yunlin University of.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004.
CSE 517 Natural Language Processing Winter 2015
John Lafferty Andrew McCallum Fernando Pereira
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Lecture 1.31 Criteria for optimal reception of radio signals.
CS479/679 Pattern Recognition Dr. George Bebis
Deep Feedforward Networks
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
CSC 594 Topics in AI – Natural Language Processing
Lecture 15: Text Classification & Naive Bayes
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.
N-Gram Model Formulas Word sequences Chain rule of probability
CS 188: Artificial Intelligence
CONTEXT DEPENDENT CLASSIFICATION
Recap: Conditional Exponential Model
LECTURE 23: INFORMATION THEORY REVIEW
Recap: Naïve Bayes classifier
Maximum Likelihood We have studied the OLS estimator. It only applies under certain assumptions In particular,  ~ N(0, 2 ) But what if the sampling distribution.
The Improved Iterative Scaling Algorithm: A gentle Introduction
Presentation transcript:

Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007

Statistical Classification  Statistical Classification problems  Task is to find probability of class “a” occurring with context “b” or P(a,b)  Context depends on nature of task Eg. In NLP tasks, the context may consist of several words and associated syntactic labels

Training Data  Gather information from training data  Large training data will contain some information about the co-occurrence of a’s and b’s  Information never enough to completely specify P(a,b) for all possible (a,b) pairs

Task formulation  Find a method using the sparse evidence about a’s and b’s to reliably estimate a probability model, P(a,b)  Principle of Maximum Entropy: the correct distribution P(a,b) is that which maximizes entropy, or “uncertainity” subject to constraints  Constraints represent evidence

The Philosophy  Making inferences on the basis of partial information without biasing the assignment that would amount to arbitrary assumptions of information that we do not have  Maximize where remaining consistent with evidence

Representing evidence  Encode useful facts as features and impose constraints on the expectations of these features  A feature is a binary valued function, g(f,h)=1 if current_token_capitalized(h)=true and f=location_start =0 otherwise  Given k features, constraints have the form, i.e. the model’s expectation for each feature should match the observed expectation

Maximum Entropy Model  Maximum entropy solution allows computation of P(f|h) for any f (a possible future/class) from every h (a possible history/context)  “History” or the “context” is the conditioning data which enables the decision

The Model  In the model produced by M.E. estimation, every feature has an associated parameter  Conditional probability is calculated as : where  M.E estimation technique guarantees for every feature g i, the expected value of g i according to the M.E. model will equal the empirical expectation of g i in the training corpus

Generalized Iterative Scaling  Generalized Iterative Scaling, GIS, finds parameters of the distribution P  GIS requires the constraint that  If not, choose and add a correctional feature, such that

The GIS procedure It can be proven that a probability sequence whose parameters are defined by this procedure converges to a unique and positive solution

Computation  Each iteration requires computation of and  Given training sample,, calculation of is straightforward  The computation of the model’s feature expectation can be intractable

Computation of  We have,  Baye’s rule,  We use an approximation, summing over all the histories in the training sample, and not

Termination and Running Time  Termination after a fixed number of iterations (e.g. 100) or negligible change in log likelihood  Running time of each iteration dominated by computation of which is O(NPA)