If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all.

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

A Tutorial on Learning with Bayesian Networks
Review of Probability. Definitions (1) Quiz 1.Let’s say I have a random variable X for a coin, with event space {H, T}. If the probability P(X=H) is.
Pattern Recognition and Machine Learning
Section 3: Appendix BP as an Optimization Algorithm 1.
Random Variable A random variable X is a function that assign a real number, X(ζ), to each outcome ζ in the sample space of a random experiment. Domain.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 4-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter.
Machine Learning Week 2 Lecture 1.
Chain Rules for Entropy
Middle Term Exam 03/04, in class. Project It is a team work No more than 2 people for each team Define a project of your own Otherwise, I will assign.
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 4-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
Mutual Information Mathematical Biology Seminar
A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.
Mutual Information for Image Registration and Feature Selection
Machine Learning CMPT 726 Simon Fraser University
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Conditional Chow-Liu Tree Structures for Modeling Discrete-Valued Vector Time Series Sergey Kirshner, UC Irvine Padhraic Smyth, UC Irvine Andrew Robertson,
Information Theory and Security
CEEN-2131 Business Statistics: A Decision-Making Approach CEEN-2130/31/32 Using Probability and Probability Distributions.
Albert Gatt Corpora and Statistical Methods. Probability distributions Part 2.
Lecture 3. Relation with Information Theory and Symmetry of Information Shannon entropy of random variable X over sample space S: H(X) = ∑ P(X=x) log 1/P(X=x)‏,
1 Statistical NLP: Lecture 5 Mathematical Foundations II: Information Theory.
Basic Concepts in Information Theory
Some basic concepts of Information Theory and Entropy
§1 Entropy and mutual information
2. Mathematical Foundations
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1. Entropy as an Information Measure - Discrete variable definition Relationship to Code Length - Continuous Variable Differential Entropy 2. Maximum Entropy.
Some Surprises in the Theory of Generalized Belief Propagation Jonathan Yedidia Mitsubishi Electric Research Labs (MERL) Collaborators: Bill Freeman (MIT)
CS774. Markov Random Field : Theory and Application Lecture 08 Kyomin Jung KAIST Sep
Lab Assignment 1 Environments Search Bayes Nets. Problem 1: Peg Solitaire Is Peg Solitaire: Partially observable? Stochastic? Continuous? Adversarial?
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 1 – Slide 1 of 34 Chapter 11 Section 1 Random Variables.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 6-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter.
Problem Introduction Chow’s Problem Solution Example Proof of correctness.
Chap 4-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 4 Using Probability and Probability.
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory I AI-lab
Information Theory Basics What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel.
CS774. Markov Random Field : Theory and Application Lecture 02
Information Theory Metrics Giancarlo Schrementi. Expected Value Example: Die Roll (1/6)*1+(1/6)*2+(1/6)*3+(1/6)*4+(1/6)*5+( 1/6)*6 = 3.5 The equation.
Mathematical Foundations Elementary Probability Theory Essential Information Theory Updated 11/11/2005.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
Lecture 2: Statistical learning primer for biologists
Lecture 5 Introduction to Sampling Distributions.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Basic Concepts of Information Theory A measure of uncertainty. Entropy. 1.
Chapter 2: Probability. Section 2.1: Basic Ideas Definition: An experiment is a process that results in an outcome that cannot be predicted in advance.
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory II AI-lab
Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
Conditional Expectation
Chap 4-1 Chapter 4 Using Probability and Probability Distributions.
Section 7.3. Why we need Bayes?  How to assess the probability that a particular event occurs on the basis of partial evidence.  The probability p(F)
Estimation Econometría. ADE.. Estimation We assume we have a sample of size T of: – The dependent variable (y) – The explanatory variables (x 1,x 2, x.
ENTROPY Entropy measures the uncertainty in a random experiment. Let X be a discrete random variable with range S X = { 1,2,3,... k} and pmf p k = P X.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate accuracy.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Statistical methods in NLP Course 2
Probability Theory and Parameter Estimation I
Chapter 4 Using Probability and Probability Distributions
Learning Tree Structures
Probability for Machine Learning
Corpora and Statistical Methods
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
LECTURE 23: INFORMATION THEORY REVIEW
Parameter Learning 2 Structure Learning 1: The good
Presentation transcript:

If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all possible trees find the one closest to P Distance Measurement: Kullback–Leibler cross –entropy measure Operators/Procedure

Problem definition X 1 …X n are random variables P is unknown Given independent samples x 1,…,x s drawn from distribution P Estimate P Solution 1 - independence Assume X 1 …X n are independent P(x) = Π P(x i ) Solution 2 - trees P(x) = Π P(x i |x j ) x j - The parent of x i in some

Kullback–Leibler cross–enthropy measure For probability distributions P and Q of a discrete random variable the K–L divergence of Q from P is defined to beK–L divergence It can be seen from the definition of the Kullback-Leibler divergence that where H(P,Q) is called the cross entropy of P and Q, and H(P) is the entropy of P.cross entropyentropy Non negative measure (by Gibb’s inequality)

5 Entropy is a measure for Uncertainty Fair coin: –H(½, ½) = – ½ log 2 (½) – ½ log 2 (½) = 1 bit –(ie, need 1 bit to convey the outcome of coin flip) Biased coin: H( 1/100, 99/100) = – 1/100 log 2 (1/100) – 99/100 log 2 (99/100) = 0.08 bit As P( heads )  1, info of actual outcome  0 H(0, 1) = H(1, 0) = 0 bits ie, no uncertainty left in source (0  log 2 (0) = 0)

Optimization Task Init: Fix the structure of some tree t Assign Probabilities: What conditional probabilities P t (x|y) would yield the best approximation of P? Procedure: vary the structure of t over all possible spanning trees Goal: among all trees with probabilities- which is the closest to P?

What Probabilities to assign? Theorem 1 : If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P

How to vary over all trees? How to move in the search space? Theorem 2 : The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurementthe mutual information measurement

Mutual information measures how much knowing one of these variables reduces our uncertainty about the other the mutual information is the same as the uncertainty contained in Y (or X) alone, namely the entropy of Y or Xentropy Mutual information is a measure of dependence Mutual information is nonnegative (i.e. I(X;Y) ≥ 0) and symmetric (i.e. I(X;Y) = I(Y;X)). symmetric

The algorithm Find Maximum spanning tree with weights given by : Compute P t –Select an arbitrary root node and compute

AB AC AD BC BD CD (0.56, 0.11, 0.02, 0.31) (0.51, 0.17, 0.17, 0.15) (0.53, 0.15, 0.19, 0.13) (0.44, 0.14, 0.23, 0.19) (0.46, 0.12, 0.26, 0.16) (0.64, 0.04, 0.08, 0.24) A C B D A C B D AB AC AD BC BD CD (0.56, 0.11, 0.02, 0.31) (0.51, 0.17, 0.17, 0.15) (0.53, 0.15, 0.19, 0.13) (0.44, 0.14, 0.23, 0.19) (0.46, 0.12, 0.26, 0.16) (0.64, 0.04, 0.08, 0.24) Illustration of CL-Tree Learning A C B D

Theorem 1 : If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P

By Gibb’s inequality the Expression is maximized when P’(x)=P(x) => all the expression is maximal when P’(xi|xj)=P(xi|xj) Q.E.D.

Theorem 1 : If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P Gibbs' inequality

Theorem 1 : If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P Gibbs' inequality

Theorem 2 : The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurement: From theorem 1, we get: After assignment, and Bayes rule: maximizes D KL

Theorem 2 : The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurement: From theorem 1, we get: After assignment, and Bayes rule: maximizes D KL

Theorem 2 : The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurement: The second and third term are independent of t D(P,P t ) is nonnegative (Gibb’s inequality) Thus, minimizing the distance D(P,P t ) is equivalent to maximizing the sum of branch weights Q.E.D.

19 Chow-Liu (CL) Results If distribution P is tree-structured, CL finds CORRECT one If distribution P is NOT tree-structured, CL finds tree structured Q that has min’l KL-divergence – argmin Q KL(P; Q) Even though 2  (n log n) trees, CL finds BEST one in poly time O(n 2 [m + log n])

Chow-Liu Trees -Summary Approximation of a joint distribution with a tree- structured distribution [Chow and Liu 68] Learning the structure and the probabilities –Compute individual and pairwise marginal distributions for all pairs of variables –Compute mutual information (MI) for each pair of variables –Build maximum spanning tree with for a complete graph with variables as nodes and MIs as weights Properties –Efficient: O(#samples×(#variables) 2 ×(#values per variable) 2 ) –Optimal

S. Kullback (1959) Information theory and statistics (John Wiley and Sons, NY). Chow, C. K. and Liu, C. N. (1968). Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, IT-14(3):462{467.