Presentation is loading. Please wait.

Presentation is loading. Please wait.

If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all.

Similar presentations


Presentation on theme: "If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all."— Presentation transcript:

1

2 If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all possible trees find the one closest to P Distance Measurement: Kullback–Leibler cross –entropy measure Operators/Procedure

3 Problem definition X 1 …X n are random variables P is unknown Given independent samples x 1,…,x s drawn from distribution P Estimate P Solution 1 - independence Assume X 1 …X n are independent P(x) = Π P(x i ) Solution 2 - trees P(x) = Π P(x i |x j ) x j - The parent of x i in some

4 Kullback–Leibler cross–enthropy measure For probability distributions P and Q of a discrete random variable the K–L divergence of Q from P is defined to beK–L divergence It can be seen from the definition of the Kullback-Leibler divergence that where H(P,Q) is called the cross entropy of P and Q, and H(P) is the entropy of P.cross entropyentropy Non negative measure (by Gibb’s inequality)

5 5 Entropy is a measure for Uncertainty Fair coin: –H(½, ½) = – ½ log 2 (½) – ½ log 2 (½) = 1 bit –(ie, need 1 bit to convey the outcome of coin flip) Biased coin: H( 1/100, 99/100) = – 1/100 log 2 (1/100) – 99/100 log 2 (99/100) = 0.08 bit As P( heads )  1, info of actual outcome  0 H(0, 1) = H(1, 0) = 0 bits ie, no uncertainty left in source (0  log 2 (0) = 0)

6 Optimization Task Init: Fix the structure of some tree t Assign Probabilities: What conditional probabilities P t (x|y) would yield the best approximation of P? Procedure: vary the structure of t over all possible spanning trees Goal: among all trees with probabilities- which is the closest to P?

7 What Probabilities to assign? Theorem 1 : If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P

8 How to vary over all trees? How to move in the search space? Theorem 2 : The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurementthe mutual information measurement

9 Mutual information measures how much knowing one of these variables reduces our uncertainty about the other the mutual information is the same as the uncertainty contained in Y (or X) alone, namely the entropy of Y or Xentropy Mutual information is a measure of dependence Mutual information is nonnegative (i.e. I(X;Y) ≥ 0) and symmetric (i.e. I(X;Y) = I(Y;X)). symmetric

10 The algorithm Find Maximum spanning tree with weights given by : Compute P t –Select an arbitrary root node and compute

11 11 0.3126 0.0229 0.0172 0.0230 0.0183 0.2603 AB AC AD BC BD CD (0.56, 0.11, 0.02, 0.31) (0.51, 0.17, 0.17, 0.15) (0.53, 0.15, 0.19, 0.13) (0.44, 0.14, 0.23, 0.19) (0.46, 0.12, 0.26, 0.16) (0.64, 0.04, 0.08, 0.24) A C B D A C B D AB AC AD BC BD CD 0.3126 0.0229 0.0172 0.0230 0.0183 0.2603 (0.56, 0.11, 0.02, 0.31) (0.51, 0.17, 0.17, 0.15) (0.53, 0.15, 0.19, 0.13) (0.44, 0.14, 0.23, 0.19) (0.46, 0.12, 0.26, 0.16) (0.64, 0.04, 0.08, 0.24) Illustration of CL-Tree Learning A C B D

12 Theorem 1 : If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P

13 By Gibb’s inequality the Expression is maximized when P’(x)=P(x) => all the expression is maximal when P’(xi|xj)=P(xi|xj) Q.E.D.

14 Theorem 1 : If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P Gibbs' inequality

15 Theorem 1 : If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P Gibbs' inequality

16 Theorem 2 : The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurement: From theorem 1, we get: After assignment, and Bayes rule: maximizes D KL

17 Theorem 2 : The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurement: From theorem 1, we get: After assignment, and Bayes rule: maximizes D KL

18 Theorem 2 : The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurement: The second and third term are independent of t D(P,P t ) is nonnegative (Gibb’s inequality) Thus, minimizing the distance D(P,P t ) is equivalent to maximizing the sum of branch weights Q.E.D.

19 19 Chow-Liu (CL) Results If distribution P is tree-structured, CL finds CORRECT one If distribution P is NOT tree-structured, CL finds tree structured Q that has min’l KL-divergence – argmin Q KL(P; Q) Even though 2  (n log n) trees, CL finds BEST one in poly time O(n 2 [m + log n])

20 Chow-Liu Trees -Summary Approximation of a joint distribution with a tree- structured distribution [Chow and Liu 68] Learning the structure and the probabilities –Compute individual and pairwise marginal distributions for all pairs of variables –Compute mutual information (MI) for each pair of variables –Build maximum spanning tree with for a complete graph with variables as nodes and MIs as weights Properties –Efficient: O(#samples×(#variables) 2 ×(#values per variable) 2 ) –Optimal

21 S. Kullback (1959) Information theory and statistics (John Wiley and Sons, NY). Chow, C. K. and Liu, C. N. (1968). Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, IT-14(3):462{467.


Download ppt "If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all."

Similar presentations


Ads by Google