Presentation is loading. Please wait.

Presentation is loading. Please wait.

Problem Introduction Chow’s Problem Solution Example Proof of correctness.

Similar presentations


Presentation on theme: "Problem Introduction Chow’s Problem Solution Example Proof of correctness."— Presentation transcript:

1 Problem Introduction Chow’s Problem Solution Example Proof of correctness

2 If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible spanning trees Goal: From all possible spanning trees find the one closest to P Distance Measurement: Kullback–Leibler cross –entropy measure Operators/Procedure

3 If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible spanning trees Goal: From all possible spanning trees find the one closest to P Distance Measurement: Kullback–Leibler cross –entropy measure Operators/Procedure

4 Why use trees? Problem Definition X random variable taking r values P is unknown Given independent samples x 1,…,x s with distribution P Estimate P Solution 1 Calculate relative frequency of x among the observations x 1,…,x s Requires storing r n values Solution 2 - independence Assume X 1 …X n are independent P(x) = Π P(x i ) Requires nr calculations stored Solution 3 - trees P(x) = Π P(x i |x j ) x j - The parent of x i in some orientation of the tree

5 If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible spanning trees Goal: From all possible spanning trees find the one closest to P Distance Measurement: Kullback–Leibler cross –entropy measure Operators/Procedure

6 Kullback–Leibler cross–enthropy measure For probability distributions P and Q of a discrete random variable the K–L divergence of Q from P is defined to beK–L divergence It can be seen from the definition of the Kullback-Leibler divergence that where H(P,Q) is called the cross entropy of P and Q, and H(P) is the entropy of P.cross entropyentropy Non negative measure (by Gibb’s inequality)

7 7 Entropy of V = [p(V = 1), p(V = 0)] : H(V) = -  v i P( V = v i ) log 2 P( V = v i )  # of bits needed to obtain full info … average surprise of result of one “trial" of V Entropy ¼ measure of uncertainty Entropy

8 8 Examples of Entropy Fair coin: –H(½, ½) = – ½ log 2 (½) – ½ log 2 (½) = 1 bit –(ie, need 1 bit to convey the outcome of coin flip) Biased coin: H( 1/100, 99/100) = – 1/100 log 2 (1/100) – 99/100 log 2 (99/100) = 0.08 bit As P( heads )  1, info of actual outcome  0 H(0, 1) = H(1, 0) = 0 bits ie, no uncertainty left in source (0  log 2 (0) = 0)

9 If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible spanning trees Goal: From all possible spanning trees find the one closest to P Distance Measurement: Kullback–Leibler cross –entropy measure Operators/Procedure

10 Optimization Task Init: Fix the structure of some tree t Assign Probabilities: What conditional probabilities P t (x|y) would yield the best approximation of P? Procedure: vary the structure of t over all possible spanning trees Goal: among all trees with probabilities- which is the closest to P?

11 Optimization Task Init: Fix the structure of some tree t Assign Probabilities: What conditional probabilities P t (x|y) would yield the best approximation of P? Procedure: vary the structure of t over all possible spanning trees Goal: among all trees with probabilities- which is the closest to P?

12 What Probabilities to assign? Theorem 1 : If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P

13 Optimization Task Init: Fix the structure of some tree t Assign Probabilities: What conditional probabilities Pt(x|y) would yield the best approximation of P? Procedure: vary the structure of t over all possible spanning trees Goal: among all trees with probabilities- which is the closest to P?

14 How to vary the t? How to move in the search space? Theorem 2 : The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurementthe mutual information measurement

15 Mutual information the mutual information of 2 random variables is a quantity that measures the mutual dependence of the two variables Intuitively, mutual information measures the information that X and Y share.

16 Mutual information it measures how much knowing one of these variables reduces our uncertainty about the other the mutual information is the same as the uncertainty contained in Y (or X) alone, namely the entropy of Y or Xentropy Mutual information is a measure of dependence Mutual information is nonnegative (i.e. I(X;Y) ≥ 0) and symmetric (i.e. I(X;Y) = I(Y;X)). symmetric

17 Input Data: p1(1) = 0:53, p2(1) = 0:42, p3(1) = 0:39 X1 X2 X3P( X1, X2, X3)P( X1)P(X2)P( X3) 0000.150.166 0010.140.106 01000.12 0110.180.077 1000.290.188 10100.12 1100.170.135 1110.070.087 nan0.0040.242 0.004nan0.093 0.2420.093nan Mutual information matrix: Mutual information - Example

18 The algorithm Use Kruskal to find Maximum spanning tree with weights given by : Compute P t –Select an arbitrary root node and compute

19 C FE AB D 2 13 4 3 5 65 4 5

20 C FE AB D 2 13 4 3 5 65 4 5

21 C FE AB D 2 13 4 3 5 65 4 5

22 C FE AB D 2 13 4 3 5 65 4 5

23 C FE AB D 2 13 4 3 5 65 4 5

24 C FE AB D 2 13 4 3 5 65 4 5 cycle!!

25 C FE AB D 2 13 4 3 5 65 4 5

26 C FE AB D 2 13 4 3 5 65 4 5

27 C FE AB D 4 5 65 5 Maximum-Spanning Tree

28 28 0.3126 0.0229 0.0172 0.0230 0.0183 0.2603 AB AC AD BC BD CD (0.56, 0.11, 0.02, 0.31) (0.51, 0.17, 0.17, 0.15) (0.53, 0.15, 0.19, 0.13) (0.44, 0.14, 0.23, 0.19) (0.46, 0.12, 0.26, 0.16) (0.64, 0.04, 0.08, 0.24) A C B D A C B D AB AC AD BC BD CD 0.3126 0.0229 0.0172 0.0230 0.0183 0.2603 (0.56, 0.11, 0.02, 0.31) (0.51, 0.17, 0.17, 0.15) (0.53, 0.15, 0.19, 0.13) (0.44, 0.14, 0.23, 0.19) (0.46, 0.12, 0.26, 0.16) (0.64, 0.04, 0.08, 0.24) Illustration of CL-Tree Learning A C B D

29 Theorem 1 : If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P

30 By Gibb’s inequality the Expression is maximized when P’(x)=P(x) => all the expression is maximal when P’(xi|xj)=P(xi|xj) Q.E.D.

31 Theorem 1 : If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P Gibbs' inequality

32 Theorem 1 : If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P Gibbs' inequality

33 Theorem 2 : The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurement: From theorem 1, we get: After assignment, and Bayes rule: maximizes D KL

34 Theorem 2 : The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurement: From theorem 1, we get: After assignment, and Bayes rule: maximizes D KL

35 Theorem 2 : The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurement: The second and third term are independent of t D(P,P t ) is nonnegative (Gibb’s inequality) Thus, minimizing the distance D(P,P t ) is equivalent to maximizing the sum of branch weights Q.E.D.

36 36 Chow-Liu (CL) Results If distribution P is tree-structured, CL finds CORRECT one If distribution P is NOT tree-structured, CL finds tree structured Q that has min’l KL-divergence – argmin Q KL(P; Q) Even though 2  (n log n) trees, CL finds BEST one in poly time O(n 2 [m + log n])

37 Chow-Liu Trees -Summary Approximation of a joint distribution with a tree- structured distribution [Chow and Liu 68] Learning the structure and the probabilities –Compute individual and pairwise marginal distributions for all pairs of variables –Compute mutual information (MI) for each pair of variables –Build maximum spanning tree with for a complete graph with variables as nodes and MIs as weights Properties –Efficient: O(#samples×(#variables) 2 ×(#values per variable) 2 ) –Optimal

38 S. Kullback (1959) Information theory and statistics (John Wiley and Sons, NY). Peng, H.C., Long, F., and Ding, C., "Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 8, pp.1226-1238, 2005 Cormen et al., 1990] Cormen, T. H., Leiserson, C. E., and Rivest, R. R. (1990). Introduction to Algorithms. MIT Press. Chow, C. K. and Liu, C. N. (1968). Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, IT-14(3):462{467.

39

40

41 41 Chow-Liu tree learning algorithm 1 For each pair of variables X i,X j –Compute empirical distribution: –Compute mutual information: Define a graph –Nodes X 1,…,X n –Edge (i,j) gets weight Find maximal Spanning Tree Pick a node for root, dangle…

42 42 Chow-Liu tree learning algorithm… 2 Optimal tree BN –… –Compute maximum weight spanning tree –Directions in BN: pick any node as root, breadth-first-search defines directions Doesn’t matter which! Score Equivalence: If G and G’ are I-equiv, then scores are same

43 TRUE or FALSE? The projection of P on any given tree is unique?

44 TRUE or FALSE? MSWT is not unique?

45 45 Improving on Chow-Liu Trees Tree edges with low MI add little to the approximation. Observations from the previous time point can be more relevant than from the current one. Idea: Build Chow-Liu tree allowing to include variables from the current and the previous time point.

46 46 Conditional Chow-Liu Forests Extension of Chow-Liu trees to conditional distributions –Approximation of conditional multivariate distribution with a tree-structured distribution –Uses MI to build maximum spanning trees (forest) Variables of two consecutive time points as nodes All nodes corresponding to the earlier time point considered connected before the tree construction –Same asymptotic complexity as Chow-Liu trees O(#samples×(#variables) 2 ×(#values per variable) 2 ) –Optimal

47 47 Conditional Chow-Liu Forests Extension of Chow-Liu trees to conditional distributions –Approximation of conditional multivariate distribution with a tree-structured distribution –Uses MI to build maximum spanning trees (forest) Variables of two consecutive time points as nodes All nodes corresponding to the earlier time point considered connected before the tree construction –Same asymptotic complexity as Chow-Liu trees O(#samples×(#variables) 2 ×(#values per variable) 2 ) –Optimal

48 48 B’A’ C’ BA C 0.3126 0.0229 0.0230 0.1207 0.1253 0.0623 0.1392 0.1700 0.0559 0.0033 0.0030 0.0625 AB AC BC A’A A’B A’C B’A B’B B’C C’A C’B C’C (0.56, 0.11, 0.02, 0.31) (0.51, 0.17, 0.17, 0.15) (0.44, 0.14, 0.23, 0.19) (0.57, 0.11, 0.11, 0.21) (0.51, 0.17, 0.07, 0.25) (0.54, 0.14, 0.14, 0.18) (0.52, 0.07, 0.16, 0.25) (0.48, 0.10, 0.11, 0.31) (0.47, 0.11, 0.21, 0.21) (0.48, 0.20, 0.20, 0.12) (0.41, 0.26, 0.17, 0.16) (0.53, 0.14, 0.14, 0.19) AB AC BC A’A A’B A’C B’A B’B B’C C’A C’B C’C 0.3126 0.0229 0.0230 0.1207 0.1253 0.0623 0.1392 0.1700 0.0559 0.0033 0.0030 0.0625 B’A’ C’ BA C Example of CCL-Forest Learning B’A’ C’ BA C B’A’ C’ BA C


Download ppt "Problem Introduction Chow’s Problem Solution Example Proof of correctness."

Similar presentations


Ads by Google