Download presentation
Presentation is loading. Please wait.
Published byNancy Poole Modified over 9 years ago
1
Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1
2
Roadmap Feature selection and weighting Feature weighting Chi-square feature selection Chi-square feature selection example HW #4 Maximum Entropy Introduction: Maximum Entropy Principle Maximum Entropy NLP examples 2
3
Feature Selection Recap Problem: Curse of dimensionality Data sparseness, computational cost, overfitting 3
4
Feature Selection Recap Problem: Curse of dimensionality Data sparseness, computational cost, overfitting Solution: Dimensionality reduction New feature set r’ s.t. |r’| < |r| 4
5
Feature Selection Recap Problem: Curse of dimensionality Data sparseness, computational cost, overfitting Solution: Dimensionality reduction New feature set r’ s.t. |r’| < |r| Approaches: Global & local approaches Feature extraction: New features in r’ transformations of features in r 5
6
Feature Selection Recap Problem: Curse of dimensionality Data sparseness, computational cost, overfitting Solution: Dimensionality reduction New feature set r’ s.t. |r’| < |r| Approaches: Global & local approaches Feature extraction: New features in r’ transformations of features in r Feature selection: Wrapper techniques 6
7
Feature Selection Recap Problem: Curse of dimensionality Data sparseness, computational cost, overfitting Solution: Dimensionality reduction New feature set r’ s.t. |r’| < |r| Approaches: Global & local approaches Feature extraction: New features in r’ transformations of features in r Feature selection: Wrapper techniques Feature scoring 7
8
Feature Weighting For text classification, typical weights include: 8
9
Feature Weighting For text classification, typical weights include: Binary: weights in {0,1} 9
10
Feature Weighting For text classification, typical weights include: Binary: weights in {0,1} Term frequency (tf): # occurrences of t k in document d i 10
11
Feature Weighting For text classification, typical weights include: Binary: weights in {0,1} Term frequency (tf): # occurrences of t k in document d i Inverse document frequency (idf): df k : # of docs in which t k appears; N: # docs idf = log (N/(1+df k )) 11
12
Feature Weighting For text classification, typical weights include: Binary: weights in {0,1} Term frequency (tf): # occurrences of t k in document d i Inverse document frequency (idf): df k : # of docs in which t k appears; N: # docs idf = log (N/(1+df k )) tfidf = tf*idf 12
13
Chi Square Tests for presence/absence of relation between random variables 13
14
Chi Square Tests for presence/absence of relation between random variables Bivariate analysis tests 2 random variables Can test strength of relationship (Strictly speaking) doesn’t test direction 14
15
Chi Square Tests for presence/absence of relation between random variables Bivariate analysis tests 2 random variables Can test strength of relationship 15
16
Chi Square Tests for presence/absence of relation between random variables Bivariate analysis tests 2 random variables Can test strength of relationship (Strictly speaking) doesn’t test direction 16
17
Chi Square Example Can gender predict shoe choice? Due to F. Xia 17
18
Chi Square Example Can gender predict shoe choice? A: male/female Features Due to F. Xia 18
19
Chi Square Example Can gender predict shoe choice? A: male/female Features B: shoe choice Classes: {sandal, sneaker,…} Due to F. Xia 19
20
Chi Square Example Can gender predict shoe choice? A: male/female Features B: shoe choice Classes: {sandal, sneaker,…} Due to F. Xia sandalsneakerleather shoe bootother Male6171395 Female1357169 20
21
Comparing Distributions Observed distribution (O): sandalsneakerleather shoe bootother Male6171395 Female1357169 Due to F. Xia 21
22
Comparing Distributions Observed distribution (O): Expected distribution (E): sandalsneakerleather shoe bootother Male6171395 Female1357169 Due to F. Xia 22
23
Comparing Distributions Observed distribution (O): Expected distribution (E): sandalsneakerleather shoe bootother Male6171395 Female1357169 sandalsneakerleather shoe boototherTotal Male50 Female50 Total1922202514100 Due to F. Xia 23
24
Comparing Distributions Observed distribution (O): Expected distribution (E): sandalsneakerleather shoe bootother Male6171395 Female1357169 sandalsneakerleather shoe boototherTotal Male9.550 Female9.550 Total1922202514100 Due to F. Xia 24
25
Comparing Distributions Observed distribution (O): Expected distribution (E): sandalsneakerleather shoe bootother Male6171395 Female1357169 sandalsneakerleather shoe boototherTotal Male9.51150 Female9.51150 Total1922202514100 Due to F. Xia 25
26
Comparing Distributions Observed distribution (O): Expected distribution (E): sandalsneakerleather shoe bootother Male6171395 Female1357169 sandalsneakerleather shoe boototherTotal Male9.5111050 Female9.5111050 Total1922202514100 Due to F. Xia 26
27
Comparing Distributions Observed distribution (O): Expected distribution (E): sandalsneakerleather shoe bootother Male6171395 Female1357169 sandalsneakerleather shoe boototherTotal Male9.5111012.550 Female9.5111012.550 Total1922202514100 Due to F. Xia 27
28
Comparing Distributions Observed distribution (O): Expected distribution (E): sandalsneakerleather shoe bootother Male6171395 Female1357169 sandalsneakerleather shoe boototherTotal Male9.5111012.5750 Female9.5111012.5750 Total1922202514100 Due to F. Xia 28
29
Computing Chi Square Expected value for cell= row_total*column_total/table_total 29
30
Computing Chi Square Expected value for cell= row_total*column_total/table_total 30
31
Computing Chi Square Expected value for cell= row_total*column_total/table_total X 2 =(6-9.5) 2 /9.5+ 31
32
Computing Chi Square Expected value for cell= row_total*column_total/table_total X 2 =(6-9.5) 2 /9.5+(17-11) 2 /11 32
33
Computing Chi Square Expected value for cell= row_total*column_total/table_total X 2 =(6-9.5) 2 /9.5+(17-11) 2 /11+.. = 14.026 33
34
Calculating X 2 Tabulate contigency table of observed values: O 34
35
Calculating X 2 Tabulate contigency table of observed values: O Compute row, column totals 35
36
Calculating X 2 Tabulate contigency table of observed values: O Compute row, column totals Compute table of expected values, given row/col Assuming no association 36
37
Calculating X 2 Tabulate contigency table of observed values: O Compute row, column totals Compute table of expected values, given row/col Assuming no association Compute X 2 37
38
For 2x2 Table O: E: !c i cici !t k ab tktk cd 38
39
For 2x2 Table O: E: !c i cici !t k ab tktk cd !c i cici Total !t k tktk total 39
40
For 2x2 Table O: E: !c i cici !t k ab tktk cd !c i cici Total !t k a+b tktk c+d totala+cb+dN 40
41
For 2x2 Table O: E: !c i cici !t k ab tktk cd !c i cici Total !t k (a+b)(a+c)/Na+b tktk c+d totala+cb+dN 41
42
For 2x2 Table O: E: !c i cici !t k ab tktk cd !c i cici Total !t k (a+b)(a+c)/N(a+b)(b+d)/Na+b tktk c+d totala+cb+dN 42
43
For 2x2 Table O: E: !c i cici !t k ab tktk cd !c i cici Total !t k (a+b)(a+c)/N(a+b)(b+d)/Na+b tktk (c+d)(a+c)/Nc+d totala+cb+dN 43
44
For 2x2 Table O: E: !c i cici !t k ab tktk cd !c i cici Total !t k (a+b)(a+c)/N(a+b)(b+d)/Na+b tktk (c+d)(a+c)/N(c+d)(b+d)/Nc+d totala+cb+dN 44
45
X 2 Test Test whether random variables are independent 45
46
X 2 Test Test whether random variables are independent Null hypothesis: R.V.s are independent 46
47
X 2 Test Test whether random variables are independent Null hypothesis: 2 R.V.s are independent Compute X 2 statistic: 47
48
X 2 Test Test whether random variables are independent Null hypothesis: 2 R.V.s are independent Compute X 2 statistic: Compute degrees of freedom 48
49
X 2 Test Test whether random variables are independent Null hypothesis: 2 R.V.s are independent Compute X 2 statistic: Compute degrees of freedom df = (# rows -1)(# cols -1) 49
50
X 2 Test Test whether random variables are independent Null hypothesis: 2 R.V.s are independent Compute X 2 statistic: Compute degrees of freedom df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4 50
51
X 2 Test Test whether random variables are independent Null hypothesis: 2 R.V.s are independent Compute X 2 statistic: Compute degrees of freedom df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4 Test probability of X 2 statistic value X 2 table 51
52
X 2 Test Test whether random variables are independent Null hypothesis: 2 R.V.s are independent Compute X 2 statistic: Compute degrees of freedom df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4 Test probability of X 2 statistic value X 2 table If probability is low – below some significance level Can reject null hypothesis 52
53
Requirements for X 2 Test Events assumed independent, same distribution 53
54
Requirements for X 2 Test Events assumed independent, same distribution Outcomes must be mutually exclusive 54
55
Requirements for X 2 Test Events assumed independent, same distribution Outcomes must be mutually exclusive Raw frequencies, not percentages 55
56
Requirements for X 2 Test Events assumed independent, same distribution Outcomes must be mutually exclusive Raw frequencies, not percentages Sufficient values per cell: > 5 56
57
X 2 Example 57
58
X 2 Example Shared Task Evaluation: Topic Detection and Tracking (aka TDT) 58
59
X 2 Example Shared Task Evaluation: Topic Detection and Tracking (aka TDT) Sub-task: Topic Tracking Task Given a small number of exemplar documents (1-4) Define a topic Create a model that allows tracking of the topic I.e. find all subsequent documents on this topic 59
60
X 2 Example Shared Task Evaluation: Topic Detection and Tracking (aka TDT) Sub-task: Topic Tracking Task Given a small number of exemplar documents (1-4) Define a topic Create a model that allows tracking of the topic I.e. find all subsequent documents on this topic Exemplars: 1-4 newswire articles 300-600 words each 60
61
Challenges Many news articles look alike Create a profile (feature representation) Highlights terms strongly associated with current topic Differentiate from all other topics 61
62
Challenges Many news articles look alike Create a profile (feature representation) Highlights terms strongly associated with current topic Differentiate from all other topics Not all documents labeled Only a small subset belong to topics of interest Differentiate from other topics AND ‘background’ 62
63
Approach X 2 feature selection: 63
64
Approach X 2 feature selection: Assume terms have binary representation 64
65
Approach X 2 feature selection: Assume terms have binary representation Positive class term occurrences from exemplar docs 65
66
Approach X 2 feature selection: Assume terms have binary representation Positive class term occurrences from exemplar docs Negative class term occurrences from other class exemplars, ‘earlier’ uncategorized docs 66
67
Approach X 2 feature selection: Assume terms have binary representation Positive class term occurrences from exemplar docs Negative class term occurrences from other class exemplars, ‘earlier’ uncategorized docs Compute X 2 for terms Retain terms with highest X 2 scores Keep top N terms 67
68
Approach X 2 feature selection: Assume terms have binary representation Positive class term occurrences from exemplar docs Negative class term occurrences from other class exemplars, ‘earlier’ uncategorized docs Compute X 2 for terms Retain terms with highest X 2 scores Keep top N terms Create one feature set per topic to be tracked 68
69
Tracking Approach Build vector space model 69
70
Tracking Approach Build vector space model Feature weighting: tf*idf with some modifications 70
71
Tracking Approach Build vector space model Feature weighting: tf*idf with some modifications Distance measure: Cosine similarity 71
72
Tracking Approach Build vector space model Feature weighting: tf*idf with some modifications Distance measure: Cosine similarity Select documents scoring above threshold For each topic 72
73
Tracking Approach Build vector space model Feature weighting: tf*idf with some modifications Distance measure: Cosine similarity Select documents scoring above threshold For each topic Result: Improved retrieval 73
74
HW #4 Topic: Feature Selection for kNN Build a kNN classifier using: Euclidean distance, Cosine Similarity Write a program to compute X 2 on a data set Use X 2 at different significance levels to filter Compare the effects of different feature filtering on kNN classification 74
75
Maximum Entropy 75
76
Maximum Entropy “MaxEnt”: Popular machine learning technique for NLP First uses in NLP circa 1996 – Rosenfeld, Berger Applied to a wide range of tasks Sentence boundary detection (MxTerminator, Ratnaparkhi), POS tagging (Ratnaparkhi, Berger), topic segmentation (Berger), Language modeling (Rosenfeld), prosody labeling, etc…. 76
77
Readings & Comments Several readings: (Berger, 1996), (Ratnaparkhi, 1997) (Klein & Manning, 2003): Tutorial 77
78
Readings & Comments Several readings: (Berger, 1996), (Ratnaparkhi, 1997) (Klein & Manning, 2003): Tutorial Note: Some of these are very ‘dense’ Don’t spend huge amounts of time on every detail Take a first pass before class, review after lecture 78
79
Readings & Comments Several readings: (Berger, 1996), (Ratnaparkhi, 1997) (Klein & Manning, 2003): Tutorial Note: Some of these are very ‘dense’ Don’t spend huge amounts of time on every detail Take a first pass before class, review after lecture Going forward: Techniques more complex Goal: Understand basic model, concepts Training esp. complex – we’ll discuss, but not implement 79
80
Notation Note Not entirely consistent: We’ll use: input = x; output=y; pair = (x,y) Consistent with Berger, 1996 Ratnaparkhi, 1996: input = h; output=t; pair = (h,t) Klein/Manning, ‘03: input = d; output=c; pair = (c,d) 80
81
Joint vs Conditional Models Assuming some training data {(x,y)}, need to learn a model Θ s.t. given a new x, can predict label y. 81
82
Joint vs Conditional Models Assuming some training data {(x,y)}, need to learn a model Θ s.t. given a new x, can predict label y. Different types of models: Joint models (aka generative models) estimate P(x,y) by maximizing P(X,Y|Θ) 82
83
Joint vs Conditional Models Assuming some training data {(x,y)}, need to learn a model Θ s.t. given a new x, can predict label y. Different types of models: Joint models (aka generative models) estimate P(x,y) by maximizing P(X,Y|Θ) Most models so far: n-gram, Naïve Bayes, HMM, etc 83
84
Joint vs Conditional Models Assuming some training data {(x,y)}, need to learn a model Θ s.t. given a new x, can predict label y. Different types of models: Joint models (aka generative models) estimate P(x,y) by maximizing P(X,Y|Θ) Most models so far: n-gram, Naïve Bayes, HMM, etc Conceptually easy to compute weights: relative frequency 84
85
Joint vs Conditional Models Assuming some training data {(x,y)}, need to learn a model Θ s.t. given a new x, can predict label y. Different types of models: Joint models (aka generative models) estimate P(x,y) by maximizing P(X,Y|Θ) Most models so far: n-gram, Naïve Bayes, HMM, etc Conceptually easy to compute weights: relative frequency Conditional (aka discriminative) models estimate P(y|x), by maximizing P(Y|X, Θ) Models going forward: MaxEnt, SVM, CRF, … 85
86
Joint vs Conditional Models Assuming some training data {(x,y)}, need to learn a model Θ s.t. given a new x, can predict label y. Different types of models: Joint models (aka generative models) estimate P(x,y) by maximizing P(X,Y|Θ) Most models so far: n-gram, Naïve Bayes, HMM, etc Conceptually easy to compute weights: relative frequency Conditional (aka discriminative) models estimate P(y|x), by maximizing P(Y|X, Θ) Models going forward: MaxEnt, SVM, CRF, … Computing weights more complex 86
87
Naïve Bayes Model Naïve Bayes Model assumes features f are independent of each other, given the class C c f1f1 f2f2 f3f3 fkfk
88
Naïve Bayes Model Makes assumption of conditional independence of features given class 88
89
Naïve Bayes Model Makes assumption of conditional independence of features given class However, this is generally unrealistic 89
90
Naïve Bayes Model Makes assumption of conditional independence of features given class However, this is generally unrealistic P(“cuts”|politics) = p cuts 90
91
Naïve Bayes Model Makes assumption of conditional independence of features given class However, this is generally unrealistic P(“cuts”|politics) = p cuts What about P(“cuts”|politics,”budget”) ?= p cuts 91
92
Naïve Bayes Model Makes assumption of conditional independence of features given class However, this is generally unrealistic P(“cuts”|politics) = p cuts What about P(“cuts”|politics,”budget”) ?= p cuts Would like a model that doesn’t assume 92
93
Model Parameters Our model: c * = argmax c P(c)Π j P(f j |c) Types of parameters Two: P(C): Class priors P(f j |c): Class conditional feature probabilities Features in total |C|+|VC|, if features are words in vocabulary V
94
Weights in Naïve Bayes c1c1 c2c2 c3c3 …ckck f1f1 P(f 1 |c 1 )P(f 1 |c 2 )P(f 1 |c 3 )P(f 1 |c k ) f2f2 P(f 2 |c 1 )P(f 2 |c 2 )… …… f |V| P(f |V| |,c 1 ) 94
95
Weights in Naïve Bayes and Maximum Entropy Naïve Bayes: P(f|y) are probabilities in [0,1], weights 95
96
Weights in Naïve Bayes and Maximum Entropy Naïve Bayes: P(f|y) are probabilities in [0,1], weights P(y|x) = 96
97
Weights in Naïve Bayes and Maximum Entropy Naïve Bayes: P(f|y) are probabilities in [0,1], weights P(y|x) = 97
98
Weights in Naïve Bayes and Maximum Entropy Naïve Bayes: P(f|y) are probabilities in [0,1], weights P(y|x) = 98
99
Weights in Naïve Bayes and Maximum Entropy Naïve Bayes: P(f|y) are probabilities in [0,1], weights P(y|x) = MaxEnt: Weights are real numbers; any magnitude, sign 99
100
Weights in Naïve Bayes and Maximum Entropy Naïve Bayes: P(f|y) are probabilities in [0,1], weights P(y|x) = MaxEnt: Weights are real numbers; any magnitude, sign P(y|x) = 100
101
MaxEnt Overview Prediction: P(y|x) 101
102
MaxEnt Overview Prediction: P(y|x) f j (x,y): binary feature function, indicating presence of feature in instance x of class y 102
103
MaxEnt Overview Prediction: P(y|x) f j (x,y): binary feature function, indicating presence of feature in instance x of class y λ j : feature weights, learned in training 103
104
MaxEnt Overview Prediction: P(y|x) f j (x,y): binary feature function, indicating presence of feature in instance x of class y λ j : feature weights, learned in training Prediction: Compute P(y|x), pick highest y 104
105
Weights in MaxEnt c1c1 c2c2 c3c3 …ckck f1f1 λ1λ1 λ8λ8 … f2f2 λ2λ2 … …… f |V| λ6λ6 105
106
Maximum Entropy Principle 106
107
Maximum Entropy Principle Intuitively, model all that is known, and assume as little as possible about what is unknown 107
108
Maximum Entropy Principle Intuitively, model all that is known, and assume as little as possible about what is unknown Maximum entropy = minimum commitment 108
109
Maximum Entropy Principle Intuitively, model all that is known, and assume as little as possible about what is unknown Maximum entropy = minimum commitment Related to concepts like Occam’s razor 109
110
Maximum Entropy Principle Intuitively, model all that is known, and assume as little as possible about what is unknown Maximum entropy = minimum commitment Related to concepts like Occam’s razor Laplace’s “Principle of Insufficient Reason”: When one has no information to distinguish between the probability of two events, the best strategy is to consider them equally likely 110
111
Example I: (K&M 2003) Consider a coin flip H(X) 111
112
Example I: (K&M 2003) Consider a coin flip H(X) What values of P(X=H), P(X=T) maximize H(X)? 112
113
Example I: (K&M 2003) Consider a coin flip H(X) What values of P(X=H), P(X=T) maximize H(X)? P(X=H)=P(X=T)=1/2 If no prior information, best guess is fair coin 113
114
Example I: (K&M 2003) Consider a coin flip H(X) What values of P(X=H), P(X=T) maximize H(X)? P(X=H)=P(X=T)=1/2 If no prior information, best guess is fair coin What if you know P(X=H) =0.3? 114
115
Example I: (K&M 2003) Consider a coin flip H(X) What values of P(X=H), P(X=T) maximize H(X)? P(X=H)=P(X=T)=1/2 If no prior information, best guess is fair coin What if you know P(X=H) =0.3? P(X=T)=0.7 115
116
Example II: MT (Berger, 1996) Task: English French machine translation Specifically, translating ‘in’ Suppose we’ve seen in translated as: {dans, en, à, au cours de, pendant} Constraint: 116
117
Example II: MT (Berger, 1996) Task: English French machine translation Specifically, translating ‘in’ Suppose we’ve seen in translated as: {dans, en, à, au cours de, pendant} Constraint: p(dans)+p(en)+p(à)+p(au cours de)+p(pendant)=1 If no other constraint, what is maxent model? 117
118
Example II: MT (Berger, 1996) Task: English French machine translation Specifically, translating ‘in’ Suppose we’ve seen in translated as: {dans, en, à, au cours de, pendant} Constraint: p(dans)+p(en)+p(à)+p(au cours de)+p(pendant)=1 If no other constraint, what is maxent model? p(dans)=p(en)=p(à)=p(au cours de)=p(pendant)=1/5 118
119
Example II: MT (Berger, 1996) What we find out that translator uses dans or en 30%? Constraint 119
120
Example II: MT (Berger, 1996) What we find out that translator uses dans or en 30%? Constraint: p(dans)+p(en)=3/10 Now what is maxent model? 120
121
Example II: MT (Berger, 1996) What we find out that translator uses dans or en 30%? Constraint: p(dans)+p(en)=3/10 Now what is maxent model? p(dans)=p(en)= 121
122
Example II: MT (Berger, 1996) What we find out that translator uses dans or en 30%? Constraint: p(dans)+p(en)=3/10 Now what is maxent model? p(dans)=p(en)=3/20 p(à)=p(au cours de)=p(pendant)= 122
123
Example II: MT (Berger, 1996) What we find out that translator uses dans or en 30%? Constraint: p(dans)+p(en)=3/10 Now what is maxent model? p(dans)=p(en)=3/20 p(à)=p(au cours de)=p(pendant)=7/30 What if we also know translate picks à or dans 50%? Add new constraint: p(à)+p(dans)=0.5 Now what is maxent model?? 123
124
Example II: MT (Berger, 1996) What we find out that translator uses dans or en 30%? Constraint: p(dans)+p(en)=3/10 Now what is maxent model? p(dans)=p(en)=3/20 p(à)=p(au cours de)=p(pendant)=7/30 What if we also know translate picks à or dans 50%? Add new constraint: p(à)+p(dans)=0.5 Now what is maxent model?? Not intuitively obvious… 124
125
Example III: POS (K&M, 2003) 125
126
Example III: POS (K&M, 2003) 126
127
Example III: POS (K&M, 2003) 127
128
Example III: POS (K&M, 2003) 128
129
Example III Problem: Too uniform What else do we know? Nouns more common than verbs 129
130
Example III Problem: Too uniform What else do we know? Nouns more common than verbs So f N ={NN,NNS,NNP,NNPS}, and E[f N ]=32/36 Also, proper nouns more frequent than common, so E[NNP,NNPS]=24/36 Etc 130
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.