. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger
2 Known Structure -- Incomplete Data Inducer E B A.9.1 e b e be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is specified u Data contains missing values l We consider assignments to missing values E, B, A.
3 Learning Parameters from Incomplete Data Incomplete data: u Posterior distributions can become interdependent u Consequence: l ML parameters can not be computed separately for each multinomial l Posterior is not a product of independent posteriors XX Y|X=H m X[m] Y[m] Y|X=T
4 Learning Parameters from Incomplete Data (cont.). u In the presence of incomplete data, the likelihood can have multiple global maxima u Example: l We can rename the values of hidden variable H l If H has two values, likelihood has two global maxima u Similarly, local maxima are also replicated u Many hidden variables a serious problem HY
5 Expectation Maximization (EM) u A general purpose method for learning from incomplete data Intuition: u If we had access to counts, then we can estimate parameters u However, missing values do not allow to perform counts u “Complete” counts using current parameter assignment
6 Expectation Maximization (EM) X Z N (X,Y ) XY # HTHHTHTHHT Y ??HTT??HTT TT?THTT?TH HTHTHTHT HHTTHHTT P(Y=H|X=T, Z=T, ) = 0.4 Expected Counts P(Y=H|X=H,Z=T, ) = 0.3 Data Current model These numbers are placed for illustration; they have not been computed. X Y Z
7 EM (cont.) Training Data X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Initial network (G, 0 ) Expected Counts N(X 1 ) N(X 2 ) N(X 3 ) N(H, X 1, X 1, X 3 ) N(Y 1, H) N(Y 2, H) N(Y 3, H) Computation (E-Step) Reparameterize X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Updated network (G, 1 ) (M-Step) Reiterate
8 L( |D) Expectation Maximization (EM): Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function is better scoring than the current point MLE from Incomplete Data u Finding MLE parameters: nonlinear optimization problem
9 EM in Practice Initial parameters: u Random parameters setting u “Best” guess from other source Stopping criteria: u Small change in likelihood of data u Small change in parameter values Avoiding bad local maxima: u Multiple restarts u Early “pruning” of unpromising ones
10 The setup of the EM algorithm We start with a likelihood function parameterized by . The observed quantity is denoted X=x. It is often a vector x 1,…,x L of observations (e.g., evidence for some nodes in a Bayesian network). The hidden quantity is a vector Y=y (e.g. states of unobserved variables in a Bayesian network). The quantity y is defined such that if it were known, the likelihood of the completed data point P(x,y| ) is easy to maximize. The log-likelihood of an observation x has the form: log P(x| ) = log P(x,y| ) – log P(y|x, ) (Because P(x,y| ) = P(x| ) P(y|x, )).
11 The goal of EM algorithm The log-likelihood of an observation x has the form: log P(x| ) = log P(x,y| ) – log P(y|x, ) The goal: Starting with a current parameter vector ’, EM’s goal is to find a new vector such that P(x| ) > P(x| ’) with the highest possible difference. The result: After enough iterations EM reaches a local maximum of the likelihood P(x| ). For independent points (x i, y i ), i=1,…,m, we can similarly write: i log P(x i | ) = i log P(x i,y i | ) – i log P(y i |x i, ) We will stick to one observation in our derivation recalling that all derived equations can be modified by summing over x.
12 The Mathematics involved Recall that the expectation of a random variable Y with a pdf P(y) is given by E[Y] = y y p(y). The expectation of a function L(Y) is given by E[L(Y)] = y L(y) p(y). A bit harder to comprehend example: E ’ [log p(x,y| )] = y p(y|x, ’) log p(x,y| ) The expectation operator E is linear. For two random variables X,, and constants a,b, the following holds E[aX+bY] = a E[X] + b E[Y] Q( | ’)
13 The Mathematics involved (Cont.) Starting with log P(x| ) = log P(x, y| ) – log P(y|x, ), multiplying both sides by P(y|x, ’), and summing over y, yields Log P(x | ) = P(y|x, ’) log P(x,y| ) - P(y|x, ’) log P(y |x, ) yy = E ’ [log p(x,y| )] = Q( | ’) We now observe that = log P(x| ) – log P(x| ’) = Q( | ’) – Q( ’ | ’) + P(y|x, ’) log [P(y |x, ’) / P(y |x, )] y 0 (relative entropy) So choosing * = argmax Q( | ’) maximizes the difference , and repeating this process leads to a local maximum of log P(x| ).
14 The EM algorithm itself Input: A likelihood function p(x,y| ) parameterized by . Initialization: Fix an arbitrary starting value ’ Repeat E-step: Compute Q( | ’) = E ’ [log P(x,y| )] M-step: ’ argmax Q( | ’) Until = log P(x| ) – log P(x| ’) < Comment: At the M-step one can actually choose any ’ as long as > 0. This change yields the so called Generalized EM algorithm. It is important when argmax is hard to compute.
15 Haplotyping G1G1 G2G2 G L-1 GLGL H1H1 H2H2 H L-1 HLHL HiHi GiGi H1H1 H2H2 HLHL HiHi Every G i is an unordered pair of letters {aa,ab,bb}. The source of one letter is the first chain and the source of the other letter is the second chain. Which letter comes from which chain ? (Is it a paternal or maternal DNA?(
16 Expectation Maximization (EM) u In practice, EM converges rather quickly at start but converges slowly near the (possibly-local) maximum. u Hence, often EM is used few iterations and then Gradient Ascent steps are applied.
17 Gradient Ascent: Follow gradient of likelihood w.r.t. to parameters L( |D) MLE from Incomplete Data u Finding MLE parameters: nonlinear optimization problem
18 MLE from Incomplete Data Both Ideas: Find local maxima only. Require multiple restarts to find approximation to the global maximum.
19 Gradient Ascent u Main result Theorem GA: Requires computation: P(x i,pa i |o[m], ) for all i, m Inference replaces taking derivatives.
20 Gradient Ascent (cont) m pax ii moP moP, )|][( )|][( 1 m x x iiii moPDP,, )|][(log)|( How do we compute ? Proof:
21 Gradient Ascent (cont) Since: ii pax ii o xP ',' ),,','( =1 ii pax',' ii nd i ii d paxP o Po xoP ),'|'( )|,'(),,','|( ii ii x x nd iii ii d opaP xPo xoP, ',' )|,(),|(),,,|( ii iiii x x ii x o xP oP, ','',' )|,,( )|(
22 Gradient Ascent (cont) u Putting all together we get