Download presentation
Presentation is loading. Please wait.
Published byTheodora Adams Modified over 9 years ago
1
Maximum Likelihood-Maximum Entropy Duality : Session 1 Pushpak Bhattacharyya Scribed by Aditya Joshi Presented in NLP-AI talk on 14 th January, 2014
2
Phenomenon/Event could be a linguistic process such as POS tagging or sentiment prediction. Model uses data in order to “predict” future observations w.r.t. a phenomenon Data/Observation Phenomenon/Event Model
3
Notations X : x 1, x 2, x 3.... x m (m observations) A: Random variable with n possible outcomes such as a 1, a 2, a 3... a n e.g. One coin throw : a 1 = 0, a 2 =1 One dice throw: a 1 =1, a 2 =2, a 3 =3, a 4 =4, a 5 =5, a 6 =6
4
Goal Goal: Estimate P(a i ) = P i Two paths ML ME Are they equivalent?
5
Calculating probability from data Suppose in X: x 1, x 2, x 3.... x m (m observations), a i occurs f(a i ) = f i times e.g. Dice: If outcomes are 1 1 2 3 1 5 3 4 2 1 F(1) = 4, f(2) = 2, f(3) = 2, f(4) = 1, f(5) = 1, f(6)=0 and m = 10 Hence, P 1 = 4/10, P 2 =2/10, P 3 =2/10, P 4 =1/10, P 5 =1/10, P 6 =1/10
6
In general, the task is... Task: Get θ : the probability vector from X
7
MLE MLE: θ* = argmax Pr (X; θ) With i.i.d. (identical independence) assumption, θ* = argmax π Pr (X; θ) Where, θ : θ θ i=1 m P(a 1 )P(a 2 ) P(a n )
8
What is known about: θ : Σ P i = 1 P i >= 0 for all i Introducing Entropy: H(θ)= - Σ P i ln P i i=1 n n Entropy of distribution
9
Some intuition Example with dice Outcomes = 1,2,3,4,5,6 P(1) + P(2)+P(3)+...P(6) = 1 Entropy(Dice) = H(θ)= - Σ P(i) ln P(i) Now, there is a principle called Laplace’s Principle of Unbiased(?) reasoning i=1 6
10
The best estimate for the dice P(1) = P(2) =... P(6) = 1/6 We will now prove it assuming: NO KNOWLEDGE about the dice except that it has six outcomes, each with probability >= 0 and Σ P i = 1
11
What does “best” mean? “BEST” means most consistent with the situiation. “Best” means that these Pi values should be such that they maximize the entropy.
12
Optimization formulation Max. - Σ P(i) log P(i) Subject to: i=1 6 Σ P i = 1 P i >= 0 for i = 1 to 6 i=1 6
13
Solving the optimization (1/2) Using Lagrangian multipliers, the optimization can be written as: Q = - Σ P(i) log P(i) – λ ( Σ P(i) – 1) - Σ βi P(i) i=1 6 6 6 For now, let us ignore the last term. We will come to it later.
14
Solving the optimization (2/2) Differentiating Q w.r.t. P(i), we get δQ/δP(i) = - log (P(i) – 1 – λ Equating to zero, log P(i) + 1 + λ = 0 log P(i) = -(1+ λ) P(i) = e -(1+ λ) This means that to maximize entropy, every P(i) must be equal.
15
This shows that P(1) = P(2) =... P(6) But, P(1) + P(2) +.. + P(6) = 1 Therefore P(1) = P(2) =... P(6) = 1/6
16
Introducing data in the notion of entropy Now, we introduce data: X : x 1, x 2, x 3.... x m (m observations) A: a 1, a 2, a 3... a n (n outcomes) e.g. For a coin, In absence of data: P(H) = P(T) = 1/2... (As shown in the previous proof) However, if data X is observed as follows: Obs-1: H H T H T T H H H (m=10) (n=2) P(H) = 6/10, P(T) = 4/10 Obs-2: T T H T H T H T T T (m=10) (n=2) P(H) = 3/10, P(T) = 7/10 Which of these is a valid estimate?
17
Change in entropy Entropy reduces as data is observed! Emax (uniform distribution) E2 : P(H) = 3/10 E1 : P(H) = 6/10
18
Start of Duality Maximizing entropy in this situation is same as minimizing the `entropy reduction’ distance. i.e. – Minimizing “relative entropy” Emax (Maximum entropy: uniform distribution) Edata (Entropy when Observations are made) Entropy reduction
19
Concluding remarks Thus, in this discussion of ML-ME duality, we will show that: MLE minimizes relative entropy distance from uniform distribution. Question: The entropy corresponds to probability vectors. The distance can be measured by squared distances. WHY relative entropy?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.