Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 LING 696B: Maximum-Entropy and Random Fields. 2 Review: two worlds Statistical model and OT seem to ask different questions about learning UG: what.

Similar presentations


Presentation on theme: "1 LING 696B: Maximum-Entropy and Random Fields. 2 Review: two worlds Statistical model and OT seem to ask different questions about learning UG: what."— Presentation transcript:

1 1 LING 696B: Maximum-Entropy and Random Fields

2 2 Review: two worlds Statistical model and OT seem to ask different questions about learning UG: what is possible/impossible? Hard-coded generalizations Combinatorial optimization (sorting) Statistical: among the things that are possible, what is likely/unlikely? Soft-coded generalizations Numerical optimization Marriage of the two?

3 3 Review: two worlds OT: relate possible/impossible patterns in different languages through constraint reranking Stochastic OT: consider a distribution over all possible grammars to generate variation Today: model frequency of input/output pairs (among the possible) directly using a powerful model

4 4 Maximum entropy and OT Imaginary data: Stochastic OT: let *[+voice]>>Ident(voice) and Ident(voice)>>*[+voice] 50% of the time each Maximum-Entropy (using positive weights): p([bab]|/bap/) ~ (1/Z) exp{-(2*w 1 )} p([pap]|/bap/) ~ (1/Z) exp{-(w 2 )} /bap/P(.)*[+voice]Ident(#voi) Bab.5 2 pap.5 1

5 5 Maximum entropy Why have Z? Need to be a conditional distribution: p([bab]|/bap/) + p([pap]|/bap/) = 1 So Z = exp{-(2*w 1 )} + exp{-(w 2 )} (same for all candidates) -- called a normalization constant Z can quickly become difficult to compute, when number of candidates is large Very similar proposal in Smolensky, 86 How to get w 1, w 2 ? Learned from data (by calculating gradients) Need: frequency counts, violation vectors (same as stochastic OT)

6 6 Maximum entropy Why do exp{.}? It’s like take maximum, but “soft” -- easy to differentiate and optimize

7 7 Maximum entropy and OT Inputs are violation vectors: e.g. x=(2,0) and (0,1) Outputs are one of K winners -- essentially a classification problem Violating a constraint works against the candidate (prob ~ exp{-(x 1 *w 1 + x 2 *w 2 )} Crucial difference: ordering candidates by one score, not by lexico-graphic orders /bap/P(.)*[+voice]Ident(voice) Bab.5 2 Pap.5 1

8 8 Maximum entropy Ordering discrete outputs from input vectors is a common problem: Also called Logistic Regression (recall Nearey) Explaining the name: Let P= p([bab]|/bap/), then log[P/(1-P)] = w 2 - 2*w 1 Linear regressionLogistic transform

9 9 The power of Maximum Entropy Max Eng/logistic regression is widely used in many areas with interacting, correlated inputs Recall Nearey: phones, diphones, … NLP: tagging, labeling, parsing … (anything with a discrete output) Easy to learn: only a global maximum, optimization efficient Isn’t this the greatest thing in the world? Need to understand the story behind the exp{} (in a few minutes)

10 10 Demo: Spanish diminutives Data from Arbisi-Kelm Constraints: ALIGN(TE,Word,R), MAX- OO(V), DEP-IO and BaseTooLittle

11 11 Stochastic OT and Max-Ent Is better fit always a good thing?

12 12 Stochastic OT and Max-Ent Is better fit always a good thing? Should model-fitting become a new fashion in phonology?

13 13 The crucial difference What are the possible distributions of p(.|/bap/) in this case? /bap/P(.)*[+voice]Ident(voice) Bab 2 Pap 1 Bap1 pab11

14 14 The crucial difference What are the possible distributions of p(.|/bap/) in this case? Max-Ent considers a much wider range of distributions /bap/P(.)*[+voice]Ident(voice) Bab 2 Pap 1 Bap1 pab11

15 15 What is Maximum Entropy anyway? Jaynes, 53: the most ignorant state corresponds to the distribution with the most entropy Given a dice, which distribution has the largest entropy?

16 16 What is Maximum Entropy anyway? Jaynes, 53: the most ignorant state corresponds to the distribution with the most entropy Given a dice, which distribution has the largest entropy? Add constraints to distributions: the average of some feature functions is assumed to be fixed: Observed value

17 17 What is Maximum Entropy anyway? Example of features: violations, word counts, N-grams, co-occurrences, … The constraints change the shape of the maximum entropy distribution Solve constrained optimization problem This leads to p(x) ~ exp{  k w k *f k (x)} Very general (see later), many choices of f k

18 18 The basic intuition Begin “ignorant” as much as possible (with maximum entropy), as far as the chosen distribution matches certain “descriptions” of the empirical data (statistics of f k (x)) Approximation property: any distribution can be approximated with a max-ent distribution with sufficient number of features (Cramer and Wold) Common practice in NLP This is better seen as a “descriptive” model

19 19 Going towards Markov random fields Maximum entropy applied to conditional/joint distribution p(y|x) or p(x,y) ~ exp{  k w k *f k (x,y)} There can be many creative ways of extracting features f k (x,y) One way is to let a graph structure guide the calculation of features. E.g. neighborhood/clique Known as Markov network/random field

20 20 Conditional random field Impose a chain-structured graph, and assign features to edges Still a max-ent, same calculation f(x i, y i ) m(y i, y i+1 )

21 21 Wilson’s idea Isn’t this a familiar picture in phonology? m(y i, y i+1 ) -- Markedness f(x i, y i ) Faithfulness Surface form Underlying form

22 22 The story of smoothing In Max-Ent models, the weights can get very large and “over-fit” the data (see demo) Common to penalize (smooth) this with a new objective function: new objective = old objective + parameter * magnitude of weights Wilson’s claim: this smoothing parameter has to do with substantive bias in phonological learning Constraints that force less similarity --> a higher penalty for them to change value

23 23 Wilson’s model fitting to the velar palatalization data


Download ppt "1 LING 696B: Maximum-Entropy and Random Fields. 2 Review: two worlds Statistical model and OT seem to ask different questions about learning UG: what."

Similar presentations


Ads by Google