Download presentation
Presentation is loading. Please wait.
Published byColin Bertram Henry Modified over 8 years ago
1
1 LING 696B: Maximum-Entropy and Random Fields
2
2 Review: two worlds Statistical model and OT seem to ask different questions about learning UG: what is possible/impossible? Hard-coded generalizations Combinatorial optimization (sorting) Statistical: among the things that are possible, what is likely/unlikely? Soft-coded generalizations Numerical optimization Marriage of the two?
3
3 Review: two worlds OT: relate possible/impossible patterns in different languages through constraint reranking Stochastic OT: consider a distribution over all possible grammars to generate variation Today: model frequency of input/output pairs (among the possible) directly using a powerful model
4
4 Maximum entropy and OT Imaginary data: Stochastic OT: let *[+voice]>>Ident(voice) and Ident(voice)>>*[+voice] 50% of the time each Maximum-Entropy (using positive weights): p([bab]|/bap/) ~ (1/Z) exp{-(2*w 1 )} p([pap]|/bap/) ~ (1/Z) exp{-(w 2 )} /bap/P(.)*[+voice]Ident(#voi) Bab.5 2 pap.5 1
5
5 Maximum entropy Why have Z? Need to be a conditional distribution: p([bab]|/bap/) + p([pap]|/bap/) = 1 So Z = exp{-(2*w 1 )} + exp{-(w 2 )} (same for all candidates) -- called a normalization constant Z can quickly become difficult to compute, when number of candidates is large Very similar proposal in Smolensky, 86 How to get w 1, w 2 ? Learned from data (by calculating gradients) Need: frequency counts, violation vectors (same as stochastic OT)
6
6 Maximum entropy Why do exp{.}? It’s like take maximum, but “soft” -- easy to differentiate and optimize
7
7 Maximum entropy and OT Inputs are violation vectors: e.g. x=(2,0) and (0,1) Outputs are one of K winners -- essentially a classification problem Violating a constraint works against the candidate (prob ~ exp{-(x 1 *w 1 + x 2 *w 2 )} Crucial difference: ordering candidates by one score, not by lexico-graphic orders /bap/P(.)*[+voice]Ident(voice) Bab.5 2 Pap.5 1
8
8 Maximum entropy Ordering discrete outputs from input vectors is a common problem: Also called Logistic Regression (recall Nearey) Explaining the name: Let P= p([bab]|/bap/), then log[P/(1-P)] = w 2 - 2*w 1 Linear regressionLogistic transform
9
9 The power of Maximum Entropy Max Eng/logistic regression is widely used in many areas with interacting, correlated inputs Recall Nearey: phones, diphones, … NLP: tagging, labeling, parsing … (anything with a discrete output) Easy to learn: only a global maximum, optimization efficient Isn’t this the greatest thing in the world? Need to understand the story behind the exp{} (in a few minutes)
10
10 Demo: Spanish diminutives Data from Arbisi-Kelm Constraints: ALIGN(TE,Word,R), MAX- OO(V), DEP-IO and BaseTooLittle
11
11 Stochastic OT and Max-Ent Is better fit always a good thing?
12
12 Stochastic OT and Max-Ent Is better fit always a good thing? Should model-fitting become a new fashion in phonology?
13
13 The crucial difference What are the possible distributions of p(.|/bap/) in this case? /bap/P(.)*[+voice]Ident(voice) Bab 2 Pap 1 Bap1 pab11
14
14 The crucial difference What are the possible distributions of p(.|/bap/) in this case? Max-Ent considers a much wider range of distributions /bap/P(.)*[+voice]Ident(voice) Bab 2 Pap 1 Bap1 pab11
15
15 What is Maximum Entropy anyway? Jaynes, 53: the most ignorant state corresponds to the distribution with the most entropy Given a dice, which distribution has the largest entropy?
16
16 What is Maximum Entropy anyway? Jaynes, 53: the most ignorant state corresponds to the distribution with the most entropy Given a dice, which distribution has the largest entropy? Add constraints to distributions: the average of some feature functions is assumed to be fixed: Observed value
17
17 What is Maximum Entropy anyway? Example of features: violations, word counts, N-grams, co-occurrences, … The constraints change the shape of the maximum entropy distribution Solve constrained optimization problem This leads to p(x) ~ exp{ k w k *f k (x)} Very general (see later), many choices of f k
18
18 The basic intuition Begin “ignorant” as much as possible (with maximum entropy), as far as the chosen distribution matches certain “descriptions” of the empirical data (statistics of f k (x)) Approximation property: any distribution can be approximated with a max-ent distribution with sufficient number of features (Cramer and Wold) Common practice in NLP This is better seen as a “descriptive” model
19
19 Going towards Markov random fields Maximum entropy applied to conditional/joint distribution p(y|x) or p(x,y) ~ exp{ k w k *f k (x,y)} There can be many creative ways of extracting features f k (x,y) One way is to let a graph structure guide the calculation of features. E.g. neighborhood/clique Known as Markov network/random field
20
20 Conditional random field Impose a chain-structured graph, and assign features to edges Still a max-ent, same calculation f(x i, y i ) m(y i, y i+1 )
21
21 Wilson’s idea Isn’t this a familiar picture in phonology? m(y i, y i+1 ) -- Markedness f(x i, y i ) Faithfulness Surface form Underlying form
22
22 The story of smoothing In Max-Ent models, the weights can get very large and “over-fit” the data (see demo) Common to penalize (smooth) this with a new objective function: new objective = old objective + parameter * magnitude of weights Wilson’s claim: this smoothing parameter has to do with substantive bias in phonological learning Constraints that force less similarity --> a higher penalty for them to change value
23
23 Wilson’s model fitting to the velar palatalization data
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.