. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger.

Slides:



Advertisements
Similar presentations
Learning with Missing Data
Advertisements

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning Bayesian Networks from Data Nir Friedman U.C.
Supervised Learning Recap
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.
EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at:
. Learning – EM in ABO locus Tutorial #08 © Ydo Wexler & Dan Geiger.
. Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
Multivariate linear models for regression and classification Outline: 1) multivariate linear regression 2) linear classification (perceptron) 3) logistic.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Visual Recognition Tutorial
Overview Full Bayesian Learning MAP learning
HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Learning Hidden Markov Models Tutorial #7 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
. Parameter Estimation For HMM Background Readings: Chapter 3.3 in the book, Biological Sequence Analysis, Durbin et al., 2001.
. Hidden Markov Models Lecture #5 Prepared by Dan Geiger. Background Readings: Chapter 3 in the text book (Durbin et al.).
The EM algorithm (Part 1) LING 572 Fei Xia 02/23/06.
. Learning Bayesian networks Slides by Nir Friedman.
Lecture 5: Learning models using EM
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
. Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger.
Maximum likelihood (ML) and likelihood ratio (LR) test
Parametric Inference.
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Expectation Maximization Algorithm
CASE STUDY: Genetic Linkage Analysis via Bayesian Networks
Maximum Likelihood (ML), Expectation Maximization (EM)
. Learning Parameters of Hidden Markov Models Prepared by Dan Geiger.
Expectation-Maximization
Visual Recognition Tutorial
What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo.
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.
EM and expected complete log-likelihood Mixture of Experts
Model Inference and Averaging
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
. EM and variants of HMM Lecture #9 Background Readings: Chapters 11.2, 11.6, 3.4 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes made.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
CSE 517 Natural Language Processing Winter 2015
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Latent Variables, Mixture Models and EM
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Learning Bayesian networks
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Learning Bayesian networks
Presentation transcript:

. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger

2 Known Structure -- Incomplete Data Inducer E B A.9.1 e b e be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is specified u Data contains missing values l We consider assignments to missing values E, B, A.

3 Learning Parameters from Incomplete Data Incomplete data: u Posterior distributions can become interdependent u Consequence: l ML parameters can not be computed separately for each multinomial l Posterior is not a product of independent posteriors XX  Y|X=H m X[m] Y[m]  Y|X=T

4 Learning Parameters from Incomplete Data (cont.). u In the presence of incomplete data, the likelihood can have multiple global maxima u Example: l We can rename the values of hidden variable H l If H has two values, likelihood has two global maxima u Similarly, local maxima are also replicated u Many hidden variables  a serious problem HY

5 Expectation Maximization (EM) u A general purpose method for learning from incomplete data Intuition: u If we had access to counts, then we can estimate parameters u However, missing values do not allow to perform counts u “Complete” counts using current parameter assignment

6 Expectation Maximization (EM) X Z N (X,Y ) XY # HTHHTHTHHT Y ??HTT??HTT TT?THTT?TH HTHTHTHT HHTTHHTT P(Y=H|X=T, Z=T,  ) = 0.4 Expected Counts P(Y=H|X=H,Z=T,  ) = 0.3 Data Current model These numbers are placed for illustration; they have not been computed. X Y Z

7 EM (cont.) Training Data X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Initial network (G,  0 )  Expected Counts N(X 1 ) N(X 2 ) N(X 3 ) N(H, X 1, X 1, X 3 ) N(Y 1, H) N(Y 2, H) N(Y 3, H) Computation (E-Step) Reparameterize X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Updated network (G,  1 ) (M-Step) Reiterate

8 L(  |D) Expectation Maximization (EM): Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function is better scoring than the current point MLE from Incomplete Data u Finding MLE parameters: nonlinear optimization problem 

9 EM in Practice Initial parameters: u Random parameters setting u “Best” guess from other source Stopping criteria: u Small change in likelihood of data u Small change in parameter values Avoiding bad local maxima: u Multiple restarts u Early “pruning” of unpromising ones

10 The setup of the EM algorithm We start with a likelihood function parameterized by . The observed quantity is denoted X=x. It is often a vector x 1,…,x L of observations (e.g., evidence for some nodes in a Bayesian network). The hidden quantity is a vector Y=y (e.g. states of unobserved variables in a Bayesian network). The quantity y is defined such that if it were known, the likelihood of the completed data point P(x,y|  ) is easy to maximize. The log-likelihood of an observation x has the form: log P(x|  ) = log P(x,y|  ) – log P(y|x,  ) (Because P(x,y|  ) = P(x|  ) P(y|x,  )).

11 The goal of EM algorithm The log-likelihood of an observation x has the form: log P(x|  ) = log P(x,y|  ) – log P(y|x,  ) The goal: Starting with a current parameter vector  ’, EM’s goal is to find a new vector  such that P(x|  ) > P(x|  ’) with the highest possible difference. The result: After enough iterations EM reaches a local maximum of the likelihood P(x|  ). For independent points (x i, y i ), i=1,…,m, we can similarly write:  i log P(x i |  ) =  i log P(x i,y i |  ) –  i log P(y i |x i,  ) We will stick to one observation in our derivation recalling that all derived equations can be modified by summing over x.

12 The Mathematics involved Recall that the expectation of a random variable Y with a pdf P(y) is given by E[Y] =  y y p(y). The expectation of a function L(Y) is given by E[L(Y)] =  y L(y) p(y). A bit harder to comprehend example: E  ’ [log p(x,y|  )] =  y p(y|x,  ’) log p(x,y|  ) The expectation operator E is linear. For two random variables X,, and constants a,b, the following holds E[aX+bY] = a E[X] + b E[Y] Q(  |  ’) 

13 The Mathematics involved (Cont.) Starting with log P(x|  ) = log P(x, y|  ) – log P(y|x,  ), multiplying both sides by P(y|x,  ’), and summing over y, yields Log P(x |  ) =  P(y|x,  ’) log P(x,y|  ) -  P(y|x,  ’) log P(y |x,  ) yy = E  ’ [log p(x,y|  )] = Q(  |  ’) We now observe that  = log P(x|  ) – log P(x|  ’) = Q(  |  ’) – Q(  ’ |  ’) +  P(y|x,  ’) log [P(y |x,  ’) / P(y |x,  )] y  0 (relative entropy) So choosing  * = argmax  Q(  |  ’) maximizes the difference , and repeating this process leads to a local maximum of log P(x|  ).

14 The EM algorithm itself Input: A likelihood function p(x,y|  ) parameterized by . Initialization: Fix an arbitrary starting value  ’ Repeat E-step: Compute Q(  |  ’) = E  ’ [log P(x,y|  )] M-step:  ’  argmax  Q(  |  ’) Until  = log P(x|  ) – log P(x|  ’) <  Comment: At the M-step one can actually choose any  ’ as long as  > 0. This change yields the so called Generalized EM algorithm. It is important when argmax is hard to compute.

15 Haplotyping G1G1 G2G2 G L-1 GLGL H1H1 H2H2 H L-1 HLHL HiHi GiGi H1H1 H2H2 HLHL HiHi Every G i is an unordered pair of letters {aa,ab,bb}. The source of one letter is the first chain and the source of the other letter is the second chain. Which letter comes from which chain ? (Is it a paternal or maternal DNA?(

16 Expectation Maximization (EM) u In practice, EM converges rather quickly at start but converges slowly near the (possibly-local) maximum. u Hence, often EM is used few iterations and then Gradient Ascent steps are applied.

17 Gradient Ascent: Follow gradient of likelihood w.r.t. to parameters L(  |D) MLE from Incomplete Data u Finding MLE parameters: nonlinear optimization problem 

18 MLE from Incomplete Data Both Ideas: Find local maxima only. Require multiple restarts to find approximation to the global maximum.

19 Gradient Ascent u Main result Theorem GA: Requires computation: P(x i,pa i |o[m],  ) for all i, m Inference replaces taking derivatives.

20 Gradient Ascent (cont)      m pax ii moP moP, )|][( )|][( 1        m x x iiii moPDP,, )|][(log)|(  How do we compute ? Proof:

21 Gradient Ascent (cont) Since: ii pax ii o xP ',' ),,','(    =1 ii pax',' ii nd i ii d paxP o Po xoP ),'|'( )|,'(),,','|(    ii ii x x nd iii ii d opaP xPo xoP, ',' )|,(),|(),,,|(     ii iiii x x ii x o xP oP, ','',' )|,,( )|(      

22 Gradient Ascent (cont) u Putting all together we get