Learning Tree Structures

Slides:

Advertisements

Similar presentations

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.

Advertisements

A Tutorial on Learning with Bayesian Networks

Pattern Recognition and Machine Learning

Random Variable A random variable X is a function that assign a real number, X(ζ), to each outcome ζ in the sample space of a random experiment. Domain.

Machine Learning Week 2 Lecture 1.

1. Variance of Probability Distribution 2. Spread 3. Standard Deviation 4. Unbiased Estimate 5. Sample Variance and Standard Deviation 6. Alternative Definitions.

Middle Term Exam 03/04, in class. Project It is a team work No more than 2 people for each team Define a project of your own Otherwise, I will assign.

Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.

. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.

A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.

Machine Learning CMPT 726 Simon Fraser University

Required Sample size for Bayesian network Structure learning

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model.

Lecture 3. Relation with Information Theory and Symmetry of Information Shannon entropy of random variable X over sample space S: H(X) = ∑ P(X=x) log 1/P(X=x)‏,

If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all.

Basic Concepts in Information Theory

2. Mathematical Foundations

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.

1 Foundations of Statistical Natural Language Processing By Christopher Manning & Hinrich Schutze Course Book.

1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:

Problem Introduction Chow’s Problem Solution Example Proof of correctness.

Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.

Information Theory Basics What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel.

Mathematical Foundations Elementary Probability Theory Essential Information Theory Updated 11/11/2005.

Slides for “Data Mining” by I. H. Witten and E. Frank.

Lecture 2: Statistical learning primer for biologists

1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.

Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.

ENTROPY Entropy measures the uncertainty in a random experiment. Let X be a discrete random variable with range S X = { 1,2,3,... k} and pmf p k = P X.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.

1 BN Semantics 3 – Now it’s personal! Parameter Learning 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 22 nd, 2006 Readings:

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Bayesian Estimation and Confidence Intervals Lecture XXII.

Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.

Statistical methods in NLP Course 2

Data Modeling Patrice Koehl Department of Biological Sciences

Information geometry.

Oliver Schulte Machine Learning 726

CS479/679 Pattern Recognition Dr. George Bebis

Chapter 7. Classification and Prediction

Bayesian Estimation and Confidence Intervals

Probability Theory and Parameter Estimation I

Chapter 4 Using Probability and Probability Distributions

CS 2750: Machine Learning Probability Review Density Estimation

Corpora and Statistical Methods

The Greedy Method and Text Compression

Special Topics In Scientific Computing

Data Mining Lecture 11.

Latent Variables, Mixture Models and EM

Statistical Learning Dong Liu Dept. EEIS, USTC.

Still More Uncertainty

Bayesian Models in Machine Learning

Discrete Event Simulation - 4

Generalized Belief Propagation

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

CONTEXT DEPENDENT CLASSIFICATION

Parameter Learning 2 Structure Learning 1: The good

Machine learning overview

Biointelligence Laboratory, Seoul National University

BN Semantics 3 – Now it’s personal! Parameter Learning 1

Entropy CSCI284/162 Spring 2009 GWU.

Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.

Data Exploration and Pattern Recognition © R. El-Yaniv

Presentation transcript:

Learning Tree Structures

Which Pt will be closest to P? If we measured a distribution P, what is the tree-dependent distribution Pt that best approximates P? Which Pt will be closest to P? Search Space: All possible trees Goal: From all possible trees find the one closest to P Distance Measurement: Kullback–Leibler Divergence Operators/Procedure

Problem definition X1…Xn are random variables P is unknown Given independent samples x1,…,xs drawn from distribution P Estimate P Possible Solution – best tree P(x) = Π P(xi|xj) xj- The parent of xi in some Solution 3 : Requires (r-1)r(n-1) + (r-1) r(r-1) parameters for each of the n-1 link matrices and r-1 parameters for the root node

Kullback–Leibler Divergence For probability distributions P and Q of a discrete random variable the K–L divergence of Q from P is defined to be Rewriting the definition of Kullback-Leibler divergence yields where H(P,Q) is called the cross entropy of P and Q, and H(P) is the entropy of P. Non negative measure

Entropy is a measure for Uncertainty Fair coin: H(½, ½) = – ½ log2(½) – ½ log2(½) = 1 bit (ie, need 1 bit to convey the outcome of coin flip) Biased coin: H( 1/100, 99/100) = – 1/100 log2(1/100) – 99/100 log2(99/100) = 0.08 bit As P( heads )  1, info of actual outcome  0 H(0, 1) = H(1, 0) = 0 bits ie, no uncertainty left in source (0  log2(0) = 0)

Two Phases Optimization Task Assign Probabilities: What conditional probabilities Pt(x|y) would yield the best approximation of P for a given tree t ? Procedure: vary the structure of t over all possible spanning trees Goal: among all trees with probabilities- which is the closest to P in terms of KL divergence?

What Probabilities to assign? Pt(x|y) = P(x|y) Theorem 1 (Chow-Liu 1968; See Pearl’s textbook 8.2.1: Given a fixed tree t. Setting the probabilities along the branches of the tree t to coincide with the conditional probabilities computed from P yields A distribution Pt that minimizes the KL divergence with P.

How to vary over all trees? How to move in the search space? Maximum Weight Spanning Tree Theorem 2 (Chow-Liu 1968; See Pearl’s textbook 8.2.1: The Kullback–Leibler divergence is minimized across all trees by a maximum weight spanning tree where the weight on an edge (x, y) is given by the mutual information measurement. 1. Create maximum spanning tree t 2. Project P on it (the best way you can) 3. Done!

Mutual information Mutual information is a measure of dependence Mutual information is nonnegative, I(X;Y) ≥ 0, and symmetric I(X;Y) = I(Y;X).

The algorithm Find Maximum spanning tree with weights given by : Compute Pt Select an arbitrary root node and compute

Illustration of CL-Tree Learning B D AB AC AD BC BD CD 0.3126 0.0229 0.0172 0.0230 0.0183 0.2603 (0.56, 0.11, 0.02, 0.31) (0.51, 0.17, 0.17, 0.15) (0.53, 0.15, 0.19, 0.13) (0.44, 0.14, 0.23, 0.19) (0.46, 0.12, 0.26, 0.16) (0.64, 0.04, 0.08, 0.24) A C B D A C B D AB AC AD BC BD CD (0.56, 0.11, 0.02, 0.31) (0.51, 0.17, 0.17, 0.15) (0.53, 0.15, 0.19, 0.13) (0.44, 0.14, 0.23, 0.19) (0.46, 0.12, 0.26, 0.16) (0.64, 0.04, 0.08, 0.24) 0.3126 0.0229 0.0172 0.0230 0.0183 0.2603

Theorem 1 : If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P

Theorem 1 : If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P By Gibb’s inequality the Expression is maximized when P’(x)=P(x) => all the expression is maximal when P’(xi|xj)=P(xi|xj) Q.E.D.

Theorem 1 : If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P Gibbs' inequality

Theorem 1 : If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P Gibbs' inequality

Theorem 2 : The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurement: From theorem 1, we get: maximizes DKL After assignment, and Bayes rule: +

Theorem 2 : The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurement: From theorem 1, we get: maximizes DKL After assignment, and Bayes rule:

Theorem 2 : The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurement: + The second and third term are independent of t D(P,Pt) is nonnegative (Gibb’s inequality) Thus, minimizing the distance D(P,Pt) is equivalent to maximizing the sum of branch weights Q.E.D.

Chow-Liu Results If distribution P is tree-structured, Chow-Liu finds a CORRECT tree-structured distribution If distribution P is NOT tree-structured, Chow-Liu finds a tree-structured distribution Q that minimizes KL-divergence – argminQ KL(P; Q) Even though 2(n log n) trees, Chow-Liu finds a BEST one in poly time O(n2 [m + log n])

References S. Kullback (1959) Information theory and statistics (John Wiley and Sons, NY). Chow, C. K. and Liu, C. N. (1968). Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, IT-14(3):462-467. See summary also Chpater 8.2.1 in Pearl’s book

Scores for General DAGs Minimize: DKL(P,PG) =− 𝑖 𝐼( 𝑋 𝑖 ;𝑃 𝑎 𝑖 𝐺 )−𝐻( 𝑋 𝑖 Minimize log-likelihood: Problem: overfitting: “best solution” can be always be chosen as the complete graph. It fits the data best, but generalizes poorly. Solution: Bayesian scores, BIC or MDL .