. Learning Bayesian networks Slides by Nir Friedman.

Slides:



Advertisements
Similar presentations
Learning with Missing Data
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning Bayesian Networks from Data Nir Friedman U.C.
Learning: Parameter Estimation
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at:
Parameter Estimation using likelihood functions Tutorial #1
. Learning – EM in ABO locus Tutorial #08 © Ydo Wexler & Dan Geiger.
Overview Full Bayesian Learning MAP learning
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
. Hidden Markov Models Lecture #5 Prepared by Dan Geiger. Background Readings: Chapter 3 in the text book (Durbin et al.).
The EM algorithm (Part 1) LING 572 Fei Xia 02/23/06.
Lecture 5: Learning models using EM
Most slides from Expectation Maximization (EM) Northwestern University EECS 395/495 Special Topics in Machine Learning.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
. Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger.
Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Parametric Inference.
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
. Maximum Likelihood (ML) Parameter Estimation with applications to reconstructing phylogenetic trees Comput. Genomics, lecture 6b Presentation taken from.
CASE STUDY: Genetic Linkage Analysis via Bayesian Networks
Maximum Likelihood (ML), Expectation Maximization (EM)
. Learning Parameters of Hidden Markov Models Prepared by Dan Geiger.
Expectation-Maximization
Visual Recognition Tutorial
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger.
Learning Bayesian Networks (From David Heckerman’s tutorial)
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.
Maximum likelihood (ML)
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
EM and expected complete log-likelihood Mixture of Experts
.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
1 CMSC 671 Fall 2001 Class #25-26 – Tuesday, November 27 / Thursday, November 29.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
CS498-EA Reasoning in AI Lecture #10 Instructor: Eyal Amir Fall Semester 2009 Some slides in this set were adopted from Eran Segal.
1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:
CSE 517 Natural Language Processing Winter 2015
Maximum Likelihood Estimation
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Crash Course on Machine Learning Part VI Several slides from Derek Hoiem, Ben Taskar, Christopher Bishop, Lise Getoor.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Oliver Schulte Machine Learning 726
More about Posterior Distributions
Bayesian Models in Machine Learning
Learning Bayesian networks
CS498-EA Reasoning in AI Lecture #20
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Bayesian Learning Chapter
Parameter Learning 2 Structure Learning 1: The good
Parametric Methods Berlin Chen, 2005 References:
Markov Networks.
Learning Bayesian networks
Presentation transcript:

. Learning Bayesian networks Slides by Nir Friedman

Learning Bayesian networks Inducer Data + Prior information E R B A C.9.1 e b e be b b e BEP(A | E,B)

Known Structure -- Incomplete Data Inducer E B A.9.1 e b e be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is specified u Data contains missing values l We consider assignments to missing values E, B, A.

Known Structure / Complete Data u Given a network structure G And choice of parametric family for P(X i |Pa i ) u Learn parameters for network from complete data Goal u Construct a network that is “closest” to probability distribution that generated the data

Maximum Likelihood Estimation in Binomial Data u Applying the MLE principle we get (Which coincides with what one would expect) L( :D)L( :D) Example: (N H,N T ) = (3,2) MLE estimate is 3/5 = 0.6

Learning Parameters for a Bayesian Network E B A C u Training data has the form:

Learning Parameters for a Bayesian Network E B A C u Since we assume i.i.d. samples, likelihood function is

Learning Parameters for a Bayesian Network E B A C u By definition of network, we get

Learning Parameters for a Bayesian Network E B A C u Rewriting terms, we get

General Bayesian Networks Generalizing for any Bayesian network: u The likelihood decomposes according to the structure of the network. i.i.d. samples Network factorization

General Bayesian Networks (Cont.) Complete Data  Decomposition  Independent Estimation Problems If the parameters for each family are not related, then they can be estimated independently of each other. (Not true in Genetic Linkage analysis).

Learning Parameters: Summary  For multinomial we collect sufficient statistics which are simply the counts N (x i,pa i ) u Parameter estimation u Bayesian methods also require choice of priors u Both MLE and Bayesian are asymptotically equivalent and consistent. MLE Bayesian (Dirichlet Prior)

Known Structure -- Incomplete Data Inducer E B A.9.1 e b e be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is specified u Data contains missing values l We consider assignments to missing values E, B, A.

Learning Parameters from Incomplete Data Incomplete data: u Posterior distributions can become interdependent u Consequence: l ML parameters can not be computed separately for each multinomial l Posterior is not a product of independent posteriors XX  Y|X=H m X[m] Y[m]  Y|X=T

Learning Parameters from Incomplete Data (cont.). u In the presence of incomplete data, the likelihood can have multiple global maxima u Example: l We can rename the values of hidden variable H l If H has two values, likelihood has two global maxima u Similarly, local maxima are also replicated u Many hidden variables  a serious problem HY

Gradient Ascent: Follow gradient of likelihood w.r.t. to parameters L(  |D) MLE from Incomplete Data u Finding MLE parameters: nonlinear optimization problem 

L(  |D) Expectation Maximization (EM): Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function is better scoring than the current point MLE from Incomplete Data u Finding MLE parameters: nonlinear optimization problem 

MLE from Incomplete Data Both Ideas: Find local maxima only. Require multiple restarts to find approximation to the global maximum.

Gradient Ascent u Main result Theorem GA: Requires computation: P(x i,pa i |o[m],  ) for all i, m Inference replaces taking derivatives.

Gradient Ascent (cont)      m pax ii moP moP, )|][( )|][( 1        m x x iiii moPDP,, )|][(log)|(  How do we compute ? Proof:

=1 Gradient Ascent (cont) Since: ii pax ii o xP ',' ),,','(    ii x',' ii nd i ii d paxP o Po xoP ),'|'( )|,'(),,','|(    ii ii x x nd iii ii d opaP xPo xoP, ',' )|,(),|(),,,|(     ii iiii x x ii x o xP oP, ','',' )|,,( )|(      

Gradient Ascent (cont) u Putting all together we get

Expectation Maximization (EM) u A general purpose method for learning from incomplete data Intuition: u If we had access to counts, then we can estimate parameters u However, missing values do not allow to perform counts u “Complete” counts using current parameter assignment

Expectation Maximization (EM) X Z N (X,Y ) XY # HTHHTHTHHT Y ??HTT??HTT TT?THTT?TH HTHTHTHT HHTTHHTT P(Y=H|X=T, Z=T,  ) = 0.4 Expected Counts P(Y=H|X=H,Z=T,  ) = 0.3 Data Current model These numbers are placed for illustration; they have not been computed. X Y Z

EM (cont.) Training Data X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Initial network (G,  0 )  Expected Counts N(X 1 ) N(X 2 ) N(X 3 ) N(H, X 1, X 1, X 3 ) N(Y 1, H) N(Y 2, H) N(Y 3, H) Computation (E-Step) Reparameterize X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Updated network (G,  1 ) (M-Step) Reiterate

Expectation Maximization (EM) u In practice, EM converges rather quickly at start but converges slowly near the (possibly-local) maximum. u Hence, often EM is used few iterations and then Gradient Ascent steps are applied.

Final Homework Question 1: Develop an algorithm that given a pedigree input, provides the most probably haplotype of each individual in the pedigree. Use the Bayesian network model of superlink to formulate the problem exactly as a query. Specify the algorithm at length discussing as many details as you can. Analyze its efficiency. Devote time to illuminating notation and presentation. Question 2: Specialize the formula given in Theorem GA for  in genetic linkage analysis. In particular, assume exactly 3 loci: Marker 1, Disease 2, Marker 3, with  being the recombination between loci 2 and 1 and 0.1-  being the recombination between loci 3 and Specify the formula for a pedigree with two parents and two children. 2. Extend the formula for arbitrary pedigrees. Note that  is the same in many local probability tables.