Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Slides:

Advertisements

Similar presentations

Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012.

Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Chi Squared Tests. Introduction Two statistical techniques are presented. Both are used to analyze nominal data. –A goodness-of-fit test for a multinomial.

Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests

Hypothesis: It is an assumption of population parameter ( mean, proportion, variance) There are two types of hypothesis : 1) Simple hypothesis :A statistical.

Naïve Bayes Advanced Statistical Methods in NLP Ling572 January 19,

Hypothesis Testing IV Chi Square.

Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Chapter 13: The Chi-Square Test

What is Statistical Modeling

Learning for Text Categorization

Maximum Entropy Model (I) LING 572 Fei Xia Week 5: 02/05-02/07/08 1.

Middle Term Exam 03/04, in class. Project It is a team work No more than 2 people for each team Define a project of your own Otherwise, I will assign.

Final review LING572 Fei Xia Week 10: 03/13/08 1.

Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.

Today Linear Regression Logistic Regression Bayesians v. Frequentists

Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.

CJ 526 Statistical Analysis in Criminal Justice

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.

Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(

Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?

Sequence labeling and beam search LING 572 Fei Xia 2/15/07.

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.

The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.

SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.

Bayesian Decision Theory Making Decisions Under uncertainty 1.

Advanced Multimedia Text Classification Tamara Berg.

Final review LING572 Fei Xia Week 10: 03/11/

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

A brief maximum entropy tutorial. Overview Statistical modeling addresses the problem of modeling the behavior of a random process In constructing this.

ECSE 6610 Pattern Recognition Professor Qiang Ji Spring, 2011.

Bayesian Networks. Male brain wiring Female brain wiring.

CJ 526 Statistical Analysis in Criminal Justice

© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 13: Nominal Variables: The Chi-Square and Binomial Distributions.

ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.

Text Classification, Active/Interactive learning.

Chi-Square as a Statistical Test Chi-square test: an inferential statistics technique designed to test for significant relationships between two variables.

MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7,

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

Chapter 16 The Chi-Square Statistic

Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

Education 793 Class Notes Presentation 10 Chi-Square Tests and One-Way ANOVA.

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)

1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

Chi Square Classifying yourself as studious or not. YesNoTotal Are they significantly different? YesNoTotal Read ahead Yes.

CHAPTER 6 Naive Bayes Models for Classification. QUESTION????

The table shows a random sample of 100 hikers and the area of hiking preferred. Are hiking area preference and gender independent? Hiking Preference Area.

Chapter 11: Chi-Square  Chi-Square as a Statistical Test  Statistical Independence  Hypothesis Testing with Chi-Square The Assumptions Stating the Research.

Copyright © Cengage Learning. All rights reserved. Chi-Square and F Distributions 10.

CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.

Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.

Chi Square & Correlation

A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05

Chapter 14 – 1 Chi-Square Chi-Square as a Statistical Test Statistical Independence Hypothesis Testing with Chi-Square The Assumptions Stating the Research.

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.

Qualitative data – tests of association

Roberto Battiti, Mauro Brunato

From frequency to meaning: vector space models of semantics

Chapter 10 Analyzing the Association Between Categorical Variables

Recap: Conditional Exponential Model

Presentation transcript:

Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26,

Roadmap Feature selection and weighting Feature weighting Chi-square feature selection Chi-square feature selection example HW #4 Maximum Entropy Introduction: Maximum Entropy Principle Maximum Entropy NLP examples 2

Feature Selection Recap Problem: Curse of dimensionality Data sparseness, computational cost, overfitting 3

Feature Selection Recap Problem: Curse of dimensionality Data sparseness, computational cost, overfitting Solution: Dimensionality reduction New feature set r’ s.t. |r’| < |r| 4

Feature Selection Recap Problem: Curse of dimensionality Data sparseness, computational cost, overfitting Solution: Dimensionality reduction New feature set r’ s.t. |r’| < |r| Approaches: Global & local approaches Feature extraction: New features in r’ transformations of features in r 5

Feature Selection Recap Problem: Curse of dimensionality Data sparseness, computational cost, overfitting Solution: Dimensionality reduction New feature set r’ s.t. |r’| < |r| Approaches: Global & local approaches Feature extraction: New features in r’ transformations of features in r Feature selection: Wrapper techniques 6

Feature Selection Recap Problem: Curse of dimensionality Data sparseness, computational cost, overfitting Solution: Dimensionality reduction New feature set r’ s.t. |r’| < |r| Approaches: Global & local approaches Feature extraction: New features in r’ transformations of features in r Feature selection: Wrapper techniques Feature scoring 7

Feature Weighting For text classification, typical weights include: 8

Feature Weighting For text classification, typical weights include: Binary: weights in {0,1} 9

Feature Weighting For text classification, typical weights include: Binary: weights in {0,1} Term frequency (tf): # occurrences of t k in document d i 10

Feature Weighting For text classification, typical weights include: Binary: weights in {0,1} Term frequency (tf): # occurrences of t k in document d i Inverse document frequency (idf): df k : # of docs in which t k appears; N: # docs idf = log (N/(1+df k )) 11

Feature Weighting For text classification, typical weights include: Binary: weights in {0,1} Term frequency (tf): # occurrences of t k in document d i Inverse document frequency (idf): df k : # of docs in which t k appears; N: # docs idf = log (N/(1+df k )) tfidf = tf*idf 12

Chi Square Tests for presence/absence of relation between random variables 13

Chi Square Tests for presence/absence of relation between random variables Bivariate analysis tests 2 random variables Can test strength of relationship (Strictly speaking) doesn’t test direction 14

Chi Square Tests for presence/absence of relation between random variables Bivariate analysis tests 2 random variables Can test strength of relationship 15

Chi Square Tests for presence/absence of relation between random variables Bivariate analysis tests 2 random variables Can test strength of relationship (Strictly speaking) doesn’t test direction 16

Chi Square Example Can gender predict shoe choice? Due to F. Xia 17

Chi Square Example Can gender predict shoe choice? A: male/female  Features Due to F. Xia 18

Chi Square Example Can gender predict shoe choice? A: male/female  Features B: shoe choice  Classes: {sandal, sneaker,…} Due to F. Xia 19

Chi Square Example Can gender predict shoe choice? A: male/female  Features B: shoe choice  Classes: {sandal, sneaker,…} Due to F. Xia sandalsneakerleather shoe bootother Male Female

Comparing Distributions Observed distribution (O): sandalsneakerleather shoe bootother Male Female Due to F. Xia 21

Comparing Distributions Observed distribution (O): Expected distribution (E): sandalsneakerleather shoe bootother Male Female Due to F. Xia 22

Comparing Distributions Observed distribution (O): Expected distribution (E): sandalsneakerleather shoe bootother Male Female sandalsneakerleather shoe boototherTotal Male50 Female50 Total Due to F. Xia 23

Comparing Distributions Observed distribution (O): Expected distribution (E): sandalsneakerleather shoe bootother Male Female sandalsneakerleather shoe boototherTotal Male9.550 Female9.550 Total Due to F. Xia 24

Comparing Distributions Observed distribution (O): Expected distribution (E): sandalsneakerleather shoe bootother Male Female sandalsneakerleather shoe boototherTotal Male Female Total Due to F. Xia 25

Comparing Distributions Observed distribution (O): Expected distribution (E): sandalsneakerleather shoe bootother Male Female sandalsneakerleather shoe boototherTotal Male Female Total Due to F. Xia 26

Comparing Distributions Observed distribution (O): Expected distribution (E): sandalsneakerleather shoe bootother Male Female sandalsneakerleather shoe boototherTotal Male Female Total Due to F. Xia 27

Comparing Distributions Observed distribution (O): Expected distribution (E): sandalsneakerleather shoe bootother Male Female sandalsneakerleather shoe boototherTotal Male Female Total Due to F. Xia 28

Computing Chi Square Expected value for cell= row_total*column_total/table_total 29

Computing Chi Square Expected value for cell= row_total*column_total/table_total 30

Computing Chi Square Expected value for cell= row_total*column_total/table_total X 2 =(6-9.5) 2 /

Computing Chi Square Expected value for cell= row_total*column_total/table_total X 2 =(6-9.5) 2 /9.5+(17-11) 2 /11 32

Computing Chi Square Expected value for cell= row_total*column_total/table_total X 2 =(6-9.5) 2 /9.5+(17-11) 2 /11+.. =

Calculating X 2 Tabulate contigency table of observed values: O 34

Calculating X 2 Tabulate contigency table of observed values: O Compute row, column totals 35

Calculating X 2 Tabulate contigency table of observed values: O Compute row, column totals Compute table of expected values, given row/col Assuming no association 36

Calculating X 2 Tabulate contigency table of observed values: O Compute row, column totals Compute table of expected values, given row/col Assuming no association Compute X 2 37

For 2x2 Table O: E: !c i cici !t k ab tktk cd 38

For 2x2 Table O: E: !c i cici !t k ab tktk cd !c i cici Total !t k tktk total 39

For 2x2 Table O: E: !c i cici !t k ab tktk cd !c i cici Total !t k a+b tktk c+d totala+cb+dN 40

For 2x2 Table O: E: !c i cici !t k ab tktk cd !c i cici Total !t k (a+b)(a+c)/Na+b tktk c+d totala+cb+dN 41

For 2x2 Table O: E: !c i cici !t k ab tktk cd !c i cici Total !t k (a+b)(a+c)/N(a+b)(b+d)/Na+b tktk c+d totala+cb+dN 42

For 2x2 Table O: E: !c i cici !t k ab tktk cd !c i cici Total !t k (a+b)(a+c)/N(a+b)(b+d)/Na+b tktk (c+d)(a+c)/Nc+d totala+cb+dN 43

For 2x2 Table O: E: !c i cici !t k ab tktk cd !c i cici Total !t k (a+b)(a+c)/N(a+b)(b+d)/Na+b tktk (c+d)(a+c)/N(c+d)(b+d)/Nc+d totala+cb+dN 44

X 2 Test Test whether random variables are independent 45

X 2 Test Test whether random variables are independent Null hypothesis: R.V.s are independent 46

X 2 Test Test whether random variables are independent Null hypothesis: 2 R.V.s are independent Compute X 2 statistic: 47

X 2 Test Test whether random variables are independent Null hypothesis: 2 R.V.s are independent Compute X 2 statistic: Compute degrees of freedom 48

X 2 Test Test whether random variables are independent Null hypothesis: 2 R.V.s are independent Compute X 2 statistic: Compute degrees of freedom df = (# rows -1)(# cols -1) 49

X 2 Test Test whether random variables are independent Null hypothesis: 2 R.V.s are independent Compute X 2 statistic: Compute degrees of freedom df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4 50

X 2 Test Test whether random variables are independent Null hypothesis: 2 R.V.s are independent Compute X 2 statistic: Compute degrees of freedom df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4 Test probability of X 2 statistic value X 2 table 51

X 2 Test Test whether random variables are independent Null hypothesis: 2 R.V.s are independent Compute X 2 statistic: Compute degrees of freedom df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4 Test probability of X 2 statistic value X 2 table If probability is low – below some significance level Can reject null hypothesis 52

Requirements for X 2 Test Events assumed independent, same distribution 53

Requirements for X 2 Test Events assumed independent, same distribution Outcomes must be mutually exclusive 54

Requirements for X 2 Test Events assumed independent, same distribution Outcomes must be mutually exclusive Raw frequencies, not percentages 55

Requirements for X 2 Test Events assumed independent, same distribution Outcomes must be mutually exclusive Raw frequencies, not percentages Sufficient values per cell: > 5 56

X 2 Example 57

X 2 Example Shared Task Evaluation: Topic Detection and Tracking (aka TDT) 58

X 2 Example Shared Task Evaluation: Topic Detection and Tracking (aka TDT) Sub-task: Topic Tracking Task Given a small number of exemplar documents (1-4) Define a topic Create a model that allows tracking of the topic I.e. find all subsequent documents on this topic 59

X 2 Example Shared Task Evaluation: Topic Detection and Tracking (aka TDT) Sub-task: Topic Tracking Task Given a small number of exemplar documents (1-4) Define a topic Create a model that allows tracking of the topic I.e. find all subsequent documents on this topic Exemplars: 1-4 newswire articles words each 60

Challenges Many news articles look alike Create a profile (feature representation) Highlights terms strongly associated with current topic Differentiate from all other topics 61

Challenges Many news articles look alike Create a profile (feature representation) Highlights terms strongly associated with current topic Differentiate from all other topics Not all documents labeled Only a small subset belong to topics of interest Differentiate from other topics AND ‘background’ 62

Approach X 2 feature selection: 63

Approach X 2 feature selection: Assume terms have binary representation 64

Approach X 2 feature selection: Assume terms have binary representation Positive class term occurrences from exemplar docs 65

Approach X 2 feature selection: Assume terms have binary representation Positive class term occurrences from exemplar docs Negative class term occurrences from other class exemplars, ‘earlier’ uncategorized docs 66

Approach X 2 feature selection: Assume terms have binary representation Positive class term occurrences from exemplar docs Negative class term occurrences from other class exemplars, ‘earlier’ uncategorized docs Compute X 2 for terms Retain terms with highest X 2 scores Keep top N terms 67

Approach X 2 feature selection: Assume terms have binary representation Positive class term occurrences from exemplar docs Negative class term occurrences from other class exemplars, ‘earlier’ uncategorized docs Compute X 2 for terms Retain terms with highest X 2 scores Keep top N terms Create one feature set per topic to be tracked 68

Tracking Approach Build vector space model 69

Tracking Approach Build vector space model Feature weighting: tf*idf with some modifications 70

Tracking Approach Build vector space model Feature weighting: tf*idf with some modifications Distance measure: Cosine similarity 71

Tracking Approach Build vector space model Feature weighting: tf*idf with some modifications Distance measure: Cosine similarity Select documents scoring above threshold For each topic 72

Tracking Approach Build vector space model Feature weighting: tf*idf with some modifications Distance measure: Cosine similarity Select documents scoring above threshold For each topic Result: Improved retrieval 73

HW #4 Topic: Feature Selection for kNN Build a kNN classifier using: Euclidean distance, Cosine Similarity Write a program to compute X 2 on a data set Use X 2 at different significance levels to filter Compare the effects of different feature filtering on kNN classification 74

Maximum Entropy 75

Maximum Entropy “MaxEnt”: Popular machine learning technique for NLP First uses in NLP circa 1996 – Rosenfeld, Berger Applied to a wide range of tasks Sentence boundary detection (MxTerminator, Ratnaparkhi), POS tagging (Ratnaparkhi, Berger), topic segmentation (Berger), Language modeling (Rosenfeld), prosody labeling, etc…. 76

Readings & Comments Several readings: (Berger, 1996), (Ratnaparkhi, 1997) (Klein & Manning, 2003): Tutorial 77

Readings & Comments Several readings: (Berger, 1996), (Ratnaparkhi, 1997) (Klein & Manning, 2003): Tutorial Note: Some of these are very ‘dense’ Don’t spend huge amounts of time on every detail Take a first pass before class, review after lecture 78

Readings & Comments Several readings: (Berger, 1996), (Ratnaparkhi, 1997) (Klein & Manning, 2003): Tutorial Note: Some of these are very ‘dense’ Don’t spend huge amounts of time on every detail Take a first pass before class, review after lecture Going forward: Techniques more complex Goal: Understand basic model, concepts Training esp. complex – we’ll discuss, but not implement 79

Notation Note Not entirely consistent: We’ll use: input = x; output=y; pair = (x,y) Consistent with Berger, 1996 Ratnaparkhi, 1996: input = h; output=t; pair = (h,t) Klein/Manning, ‘03: input = d; output=c; pair = (c,d) 80

Joint vs Conditional Models Assuming some training data {(x,y)}, need to learn a model Θ s.t. given a new x, can predict label y. 81

Joint vs Conditional Models Assuming some training data {(x,y)}, need to learn a model Θ s.t. given a new x, can predict label y. Different types of models: Joint models (aka generative models) estimate P(x,y) by maximizing P(X,Y|Θ) 82

Joint vs Conditional Models Assuming some training data {(x,y)}, need to learn a model Θ s.t. given a new x, can predict label y. Different types of models: Joint models (aka generative models) estimate P(x,y) by maximizing P(X,Y|Θ) Most models so far: n-gram, Naïve Bayes, HMM, etc 83

Joint vs Conditional Models Assuming some training data {(x,y)}, need to learn a model Θ s.t. given a new x, can predict label y. Different types of models: Joint models (aka generative models) estimate P(x,y) by maximizing P(X,Y|Θ) Most models so far: n-gram, Naïve Bayes, HMM, etc Conceptually easy to compute weights: relative frequency 84

Joint vs Conditional Models Assuming some training data {(x,y)}, need to learn a model Θ s.t. given a new x, can predict label y. Different types of models: Joint models (aka generative models) estimate P(x,y) by maximizing P(X,Y|Θ) Most models so far: n-gram, Naïve Bayes, HMM, etc Conceptually easy to compute weights: relative frequency Conditional (aka discriminative) models estimate P(y|x), by maximizing P(Y|X, Θ) Models going forward: MaxEnt, SVM, CRF, … 85

Joint vs Conditional Models Assuming some training data {(x,y)}, need to learn a model Θ s.t. given a new x, can predict label y. Different types of models: Joint models (aka generative models) estimate P(x,y) by maximizing P(X,Y|Θ) Most models so far: n-gram, Naïve Bayes, HMM, etc Conceptually easy to compute weights: relative frequency Conditional (aka discriminative) models estimate P(y|x), by maximizing P(Y|X, Θ) Models going forward: MaxEnt, SVM, CRF, … Computing weights more complex 86

Naïve Bayes Model Naïve Bayes Model assumes features f are independent of each other, given the class C c f1f1 f2f2 f3f3 fkfk

Naïve Bayes Model Makes assumption of conditional independence of features given class 88

Naïve Bayes Model Makes assumption of conditional independence of features given class However, this is generally unrealistic 89

Naïve Bayes Model Makes assumption of conditional independence of features given class However, this is generally unrealistic P(“cuts”|politics) = p cuts 90

Naïve Bayes Model Makes assumption of conditional independence of features given class However, this is generally unrealistic P(“cuts”|politics) = p cuts What about P(“cuts”|politics,”budget”) ?= p cuts 91

Naïve Bayes Model Makes assumption of conditional independence of features given class However, this is generally unrealistic P(“cuts”|politics) = p cuts What about P(“cuts”|politics,”budget”) ?= p cuts Would like a model that doesn’t assume 92

Model Parameters Our model: c * = argmax c P(c)Π j P(f j |c) Types of parameters Two: P(C): Class priors P(f j |c): Class conditional feature probabilities Features in total |C|+|VC|, if features are words in vocabulary V

Weights in Naïve Bayes c1c1 c2c2 c3c3 …ckck f1f1 P(f 1 |c 1 )P(f 1 |c 2 )P(f 1 |c 3 )P(f 1 |c k ) f2f2 P(f 2 |c 1 )P(f 2 |c 2 )… …… f |V| P(f |V| |,c 1 ) 94

Weights in Naïve Bayes and Maximum Entropy Naïve Bayes: P(f|y) are probabilities in [0,1], weights 95

Weights in Naïve Bayes and Maximum Entropy Naïve Bayes: P(f|y) are probabilities in [0,1], weights P(y|x) = 96

Weights in Naïve Bayes and Maximum Entropy Naïve Bayes: P(f|y) are probabilities in [0,1], weights P(y|x) = 97

Weights in Naïve Bayes and Maximum Entropy Naïve Bayes: P(f|y) are probabilities in [0,1], weights P(y|x) = 98

Weights in Naïve Bayes and Maximum Entropy Naïve Bayes: P(f|y) are probabilities in [0,1], weights P(y|x) = MaxEnt: Weights are real numbers; any magnitude, sign 99

Weights in Naïve Bayes and Maximum Entropy Naïve Bayes: P(f|y) are probabilities in [0,1], weights P(y|x) = MaxEnt: Weights are real numbers; any magnitude, sign P(y|x) = 100

MaxEnt Overview Prediction: P(y|x) 101

MaxEnt Overview Prediction: P(y|x) f j (x,y): binary feature function, indicating presence of feature in instance x of class y 102

MaxEnt Overview Prediction: P(y|x) f j (x,y): binary feature function, indicating presence of feature in instance x of class y λ j : feature weights, learned in training 103

MaxEnt Overview Prediction: P(y|x) f j (x,y): binary feature function, indicating presence of feature in instance x of class y λ j : feature weights, learned in training Prediction: Compute P(y|x), pick highest y 104

Weights in MaxEnt c1c1 c2c2 c3c3 …ckck f1f1 λ1λ1 λ8λ8 … f2f2 λ2λ2 … …… f |V| λ6λ6 105

Maximum Entropy Principle 106

Maximum Entropy Principle Intuitively, model all that is known, and assume as little as possible about what is unknown 107

Maximum Entropy Principle Intuitively, model all that is known, and assume as little as possible about what is unknown Maximum entropy = minimum commitment 108

Maximum Entropy Principle Intuitively, model all that is known, and assume as little as possible about what is unknown Maximum entropy = minimum commitment Related to concepts like Occam’s razor 109

Maximum Entropy Principle Intuitively, model all that is known, and assume as little as possible about what is unknown Maximum entropy = minimum commitment Related to concepts like Occam’s razor Laplace’s “Principle of Insufficient Reason”: When one has no information to distinguish between the probability of two events, the best strategy is to consider them equally likely 110

Example I: (K&M 2003) Consider a coin flip H(X) 111

Example I: (K&M 2003) Consider a coin flip H(X) What values of P(X=H), P(X=T) maximize H(X)? 112

Example I: (K&M 2003) Consider a coin flip H(X) What values of P(X=H), P(X=T) maximize H(X)? P(X=H)=P(X=T)=1/2 If no prior information, best guess is fair coin 113

Example I: (K&M 2003) Consider a coin flip H(X) What values of P(X=H), P(X=T) maximize H(X)? P(X=H)=P(X=T)=1/2 If no prior information, best guess is fair coin What if you know P(X=H) =0.3? 114

Example I: (K&M 2003) Consider a coin flip H(X) What values of P(X=H), P(X=T) maximize H(X)? P(X=H)=P(X=T)=1/2 If no prior information, best guess is fair coin What if you know P(X=H) =0.3? P(X=T)=

Example II: MT (Berger, 1996) Task: English  French machine translation Specifically, translating ‘in’ Suppose we’ve seen in translated as: {dans, en, à, au cours de, pendant} Constraint: 116

Example II: MT (Berger, 1996) Task: English  French machine translation Specifically, translating ‘in’ Suppose we’ve seen in translated as: {dans, en, à, au cours de, pendant} Constraint: p(dans)+p(en)+p(à)+p(au cours de)+p(pendant)=1 If no other constraint, what is maxent model? 117

Example II: MT (Berger, 1996) Task: English  French machine translation Specifically, translating ‘in’ Suppose we’ve seen in translated as: {dans, en, à, au cours de, pendant} Constraint: p(dans)+p(en)+p(à)+p(au cours de)+p(pendant)=1 If no other constraint, what is maxent model? p(dans)=p(en)=p(à)=p(au cours de)=p(pendant)=1/5 118

Example II: MT (Berger, 1996) What we find out that translator uses dans or en 30%? Constraint 119

Example II: MT (Berger, 1996) What we find out that translator uses dans or en 30%? Constraint: p(dans)+p(en)=3/10 Now what is maxent model? 120

Example II: MT (Berger, 1996) What we find out that translator uses dans or en 30%? Constraint: p(dans)+p(en)=3/10 Now what is maxent model? p(dans)=p(en)= 121

Example II: MT (Berger, 1996) What we find out that translator uses dans or en 30%? Constraint: p(dans)+p(en)=3/10 Now what is maxent model? p(dans)=p(en)=3/20 p(à)=p(au cours de)=p(pendant)= 122

Example II: MT (Berger, 1996) What we find out that translator uses dans or en 30%? Constraint: p(dans)+p(en)=3/10 Now what is maxent model? p(dans)=p(en)=3/20 p(à)=p(au cours de)=p(pendant)=7/30 What if we also know translate picks à or dans 50%? Add new constraint: p(à)+p(dans)=0.5 Now what is maxent model?? 123

Example II: MT (Berger, 1996) What we find out that translator uses dans or en 30%? Constraint: p(dans)+p(en)=3/10 Now what is maxent model? p(dans)=p(en)=3/20 p(à)=p(au cours de)=p(pendant)=7/30 What if we also know translate picks à or dans 50%? Add new constraint: p(à)+p(dans)=0.5 Now what is maxent model?? Not intuitively obvious… 124

Example III: POS (K&M, 2003) 125

Example III: POS (K&M, 2003) 126

Example III: POS (K&M, 2003) 127

Example III: POS (K&M, 2003) 128

Example III Problem: Too uniform What else do we know? Nouns more common than verbs 129

Example III Problem: Too uniform What else do we know? Nouns more common than verbs So f N ={NN,NNS,NNP,NNPS}, and E[f N ]=32/36 Also, proper nouns more frequent than common, so E[NNP,NNPS]=24/36 Etc 130