Learning Tree Structures

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

A Tutorial on Learning with Bayesian Networks
Pattern Recognition and Machine Learning
Random Variable A random variable X is a function that assign a real number, X(ζ), to each outcome ζ in the sample space of a random experiment. Domain.
Machine Learning Week 2 Lecture 1.
1. Variance of Probability Distribution 2. Spread 3. Standard Deviation 4. Unbiased Estimate 5. Sample Variance and Standard Deviation 6. Alternative Definitions.
Middle Term Exam 03/04, in class. Project It is a team work No more than 2 people for each team Define a project of your own Otherwise, I will assign.
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.
Machine Learning CMPT 726 Simon Fraser University
Required Sample size for Bayesian network Structure learning
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model.
Lecture 3. Relation with Information Theory and Symmetry of Information Shannon entropy of random variable X over sample space S: H(X) = ∑ P(X=x) log 1/P(X=x)‏,
If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all.
Basic Concepts in Information Theory
2. Mathematical Foundations
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
1 Foundations of Statistical Natural Language Processing By Christopher Manning & Hinrich Schutze Course Book.
1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:
Problem Introduction Chow’s Problem Solution Example Proof of correctness.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Information Theory Basics What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel.
Mathematical Foundations Elementary Probability Theory Essential Information Theory Updated 11/11/2005.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Lecture 2: Statistical learning primer for biologists
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
ENTROPY Entropy measures the uncertainty in a random experiment. Let X be a discrete random variable with range S X = { 1,2,3,... k} and pmf p k = P X.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
1 BN Semantics 3 – Now it’s personal! Parameter Learning 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 22 nd, 2006 Readings:
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Bayesian Estimation and Confidence Intervals Lecture XXII.
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
Statistical methods in NLP Course 2
Data Modeling Patrice Koehl Department of Biological Sciences
Information geometry.
Oliver Schulte Machine Learning 726
CS479/679 Pattern Recognition Dr. George Bebis
Chapter 7. Classification and Prediction
Bayesian Estimation and Confidence Intervals
Probability Theory and Parameter Estimation I
Chapter 4 Using Probability and Probability Distributions
CS 2750: Machine Learning Probability Review Density Estimation
Corpora and Statistical Methods
The Greedy Method and Text Compression
Special Topics In Scientific Computing
Data Mining Lecture 11.
Latent Variables, Mixture Models and EM
Statistical Learning Dong Liu Dept. EEIS, USTC.
Still More Uncertainty
Bayesian Models in Machine Learning
Discrete Event Simulation - 4
Generalized Belief Propagation
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
CONTEXT DEPENDENT CLASSIFICATION
Parameter Learning 2 Structure Learning 1: The good
Machine learning overview
Biointelligence Laboratory, Seoul National University
BN Semantics 3 – Now it’s personal! Parameter Learning 1
Entropy CSCI284/162 Spring 2009 GWU.
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
Data Exploration and Pattern Recognition © R. El-Yaniv
Presentation transcript:

Learning Tree Structures

Which Pt will be closest to P? If we measured a distribution P, what is the tree-dependent distribution Pt that best approximates P? Which Pt will be closest to P? Search Space: All possible trees Goal: From all possible trees find the one closest to P Distance Measurement: Kullback–Leibler Divergence Operators/Procedure

Problem definition X1…Xn are random variables P is unknown Given independent samples x1,…,xs drawn from distribution P Estimate P Possible Solution – best tree P(x) = Π P(xi|xj) xj- The parent of xi in some Solution 3 : Requires (r-1)r(n-1) + (r-1) r(r-1) parameters for each of the n-1 link matrices and r-1 parameters for the root node

Kullback–Leibler Divergence For probability distributions P and Q of a discrete random variable the K–L divergence of Q from P is defined to be Rewriting the definition of Kullback-Leibler divergence yields where H(P,Q) is called the cross entropy of P and Q, and H(P) is the entropy of P. Non negative measure

Entropy is a measure for Uncertainty Fair coin: H(½, ½) = – ½ log2(½) – ½ log2(½) = 1 bit (ie, need 1 bit to convey the outcome of coin flip) Biased coin: H( 1/100, 99/100) = – 1/100 log2(1/100) – 99/100 log2(99/100) = 0.08 bit As P( heads )  1, info of actual outcome  0 H(0, 1) = H(1, 0) = 0 bits ie, no uncertainty left in source (0  log2(0) = 0)

Two Phases Optimization Task Assign Probabilities: What conditional probabilities Pt(x|y) would yield the best approximation of P for a given tree t ? Procedure: vary the structure of t over all possible spanning trees Goal: among all trees with probabilities- which is the closest to P in terms of KL divergence?

What Probabilities to assign? Pt(x|y) = P(x|y) Theorem 1 (Chow-Liu 1968; See Pearl’s textbook 8.2.1: Given a fixed tree t. Setting the probabilities along the branches of the tree t to coincide with the conditional probabilities computed from P yields A distribution Pt that minimizes the KL divergence with P.

How to vary over all trees? How to move in the search space? Maximum Weight Spanning Tree Theorem 2 (Chow-Liu 1968; See Pearl’s textbook 8.2.1: The Kullback–Leibler divergence is minimized across all trees by a maximum weight spanning tree where the weight on an edge (x, y) is given by the mutual information measurement. 1. Create maximum spanning tree t 2. Project P on it (the best way you can) 3. Done!

Mutual information Mutual information is a measure of dependence Mutual information is nonnegative, I(X;Y) ≥ 0, and symmetric I(X;Y) = I(Y;X).

The algorithm Find Maximum spanning tree with weights given by : Compute Pt Select an arbitrary root node and compute

Illustration of CL-Tree Learning B D AB AC AD BC BD CD 0.3126 0.0229 0.0172 0.0230 0.0183 0.2603 (0.56, 0.11, 0.02, 0.31) (0.51, 0.17, 0.17, 0.15) (0.53, 0.15, 0.19, 0.13) (0.44, 0.14, 0.23, 0.19) (0.46, 0.12, 0.26, 0.16) (0.64, 0.04, 0.08, 0.24) A C B D A C B D AB AC AD BC BD CD (0.56, 0.11, 0.02, 0.31) (0.51, 0.17, 0.17, 0.15) (0.53, 0.15, 0.19, 0.13) (0.44, 0.14, 0.23, 0.19) (0.46, 0.12, 0.26, 0.16) (0.64, 0.04, 0.08, 0.24) 0.3126 0.0229 0.0172 0.0230 0.0183 0.2603

Theorem 1 : If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P

Theorem 1 : If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P By Gibb’s inequality the Expression is maximized when P’(x)=P(x) => all the expression is maximal when P’(xi|xj)=P(xi|xj) Q.E.D.

Theorem 1 : If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P Gibbs' inequality

Theorem 1 : If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P Gibbs' inequality

Theorem 2 : The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurement: From theorem 1, we get: maximizes DKL After assignment, and Bayes rule: +

Theorem 2 : The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurement: From theorem 1, we get: maximizes DKL After assignment, and Bayes rule:

Theorem 2 : The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurement: + The second and third term are independent of t D(P,Pt) is nonnegative (Gibb’s inequality) Thus, minimizing the distance D(P,Pt) is equivalent to maximizing the sum of branch weights Q.E.D.

Chow-Liu Results If distribution P is tree-structured, Chow-Liu finds a CORRECT tree-structured distribution If distribution P is NOT tree-structured, Chow-Liu finds a tree-structured distribution Q that minimizes KL-divergence – argminQ KL(P; Q) Even though 2(n log n) trees, Chow-Liu finds a BEST one in poly time O(n2 [m + log n])

References S. Kullback (1959) Information theory and statistics (John Wiley and Sons, NY). Chow, C. K. and Liu, C. N. (1968). Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, IT-14(3):462-467. See summary also Chpater 8.2.1 in Pearl’s book

Scores for General DAGs Minimize: DKL(P,PG) =− 𝑖 𝐼( 𝑋 𝑖 ;𝑃 𝑎 𝑖 𝐺 )−𝐻( 𝑋 𝑖 Minimize log-likelihood: Problem: overfitting: “best solution” can be always be chosen as the complete graph. It fits the data best, but generalizes poorly. Solution: Bayesian scores, BIC or MDL .