Learning Bayesian networks

Slides:



Advertisements
Similar presentations
Learning with Missing Data
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Image Modeling & Segmentation
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning Bayesian Networks from Data Nir Friedman U.C.
Learning: Parameter Estimation
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at:
Parameter Estimation using likelihood functions Tutorial #1
Overview Full Bayesian Learning MAP learning
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
. Learning Bayesian networks Slides by Nir Friedman.
Lecture 5: Learning models using EM
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Parametric Inference.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
. Maximum Likelihood (ML) Parameter Estimation with applications to reconstructing phylogenetic trees Comput. Genomics, lecture 6b Presentation taken from.
CASE STUDY: Genetic Linkage Analysis via Bayesian Networks
Maximum Likelihood (ML), Expectation Maximization (EM)
Expectation-Maximization
Visual Recognition Tutorial
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger.
Learning Bayesian Networks (From David Heckerman’s tutorial)
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
Maximum likelihood (ML)
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.
1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
1 CMSC 671 Fall 2001 Class #25-26 – Tuesday, November 27 / Thursday, November 29.
CS498-EA Reasoning in AI Lecture #10 Instructor: Eyal Amir Fall Semester 2009 Some slides in this set were adopted from Eran Segal.
1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:
CSE 517 Natural Language Processing Winter 2015
Maximum Likelihood Estimation
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Crash Course on Machine Learning Part VI Several slides from Derek Hoiem, Ben Taskar, Christopher Bishop, Lise Getoor.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Oliver Schulte Machine Learning 726
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Approximate Inference
Oliver Schulte Machine Learning 726
Irina Rish IBM T.J.Watson Research Center
Latent Variables, Mixture Models and EM
Expectation-Maximization
Tutorial #3 by Ma’ayan Fishelson
More about Posterior Distributions
Bayesian Models in Machine Learning
Learning Bayesian networks
Learning Markov Networks
Important Distinctions in Learning BNs
CS498-EA Reasoning in AI Lecture #20
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Important Distinctions in Learning BNs
Bayesian Learning Chapter
Parameter Learning 2 Structure Learning 1: The good
Topic Models in Text Processing
Parametric Methods Berlin Chen, 2005 References:
Markov Networks.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

Learning Bayesian networks Slides by Nir Friedman .

Learning Bayesian networks Inducer E R B A C Data + Prior information .9 .1 e b .7 .3 .99 .01 .8 .2 B E P(A | E,B)

Known Structure -- Incomplete Data E, B, A <Y,N,N> <Y,?,Y> <N,N,Y> <N,Y,?> . <?,Y,Y> E B A Inducer .9 .1 e b .7 .3 .99 .01 .8 .2 B E P(A | E,B) ? e b B E P(A | E,B) E B A Network structure is specified Data contains missing values We consider assignments to missing values

Known Structure / Complete Data Given a network structure G And choice of parametric family for P(Xi|Pai) Learn parameters for network from complete data Goal Construct a network that is “closest” to probability distribution that generated the data

Maximum Likelihood Estimation in Binomial Data Applying the MLE principle we get (Which coincides with what one would expect) 0.2 0.4 0.6 0.8 1 L( :D) Example: (NH,NT ) = (3,2) MLE estimate is 3/5 = 0.6

Learning Parameters for a Bayesian Network Training data has the form: E B A C

Learning Parameters for a Bayesian Network Since we assume i.i.d. samples, likelihood function is E B A C

Learning Parameters for a Bayesian Network By definition of network, we get E B A C

Learning Parameters for a Bayesian Network Rewriting terms, we get E B A C

General Bayesian Networks Generalizing for any Bayesian network: The likelihood decomposes according to the structure of the network. i.i.d. samples Network factorization

General Bayesian Networks (Cont.) Complete Data  Decomposition  Independent Estimation Problems If the parameters for each family are not related, then they can be estimated independently of each other. (Not true in Genetic Linkage analysis).

Learning Parameters: Summary For multinomial we collect sufficient statistics which are simply the counts N (xi,pai) Parameter estimation Bayesian methods also require choice of priors Both MLE and Bayesian are asymptotically equivalent and consistent. MLE Bayesian (Dirichlet Prior)

Known Structure -- Incomplete Data E, B, A <Y,N,N> <Y,?,Y> <N,N,Y> <N,Y,?> . <?,Y,Y> E B A Inducer .9 .1 e b .7 .3 .99 .01 .8 .2 B E P(A | E,B) ? e b B E P(A | E,B) E B A Network structure is specified Data contains missing values We consider assignments to missing values

Learning Parameters from Incomplete Data X Y|X=H m X[m] Y[m] Y|X=T Incomplete data: Posterior distributions can become interdependent Consequence: ML parameters can not be computed separately for each multinomial Posterior is not a product of independent posteriors

Learning Parameters from Incomplete Data (cont.). In the presence of incomplete data, the likelihood can have multiple global maxima Example: We can rename the values of hidden variable H If H has two values, likelihood has two global maxima Similarly, local maxima are also replicated Many hidden variables  a serious problem H Y

MLE from Incomplete Data Finding MLE parameters: nonlinear optimization problem Gradient Ascent: Follow gradient of likelihood w.r.t. to parameters L(|D) 

MLE from Incomplete Data Finding MLE parameters: nonlinear optimization problem Expectation Maximization (EM): Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function is better scoring than the current point L(|D) 

MLE from Incomplete Data Both Ideas: Find local maxima only. Require multiple restarts to find approximation to the global maximum.

Gradient Ascent Main result Theorem GA: Requires computation: P(xi,pai|o[m],) for all i, m Inference replaces taking derivatives.

Gradient Ascent (cont) Proof: å ¶ Q = m pa x i o P D , ) | ] [ ( log q å ¶ Q = m pa x i o P , ) | ] [ ( 1 q How do we compute ?

Gradient Ascent (cont) Since: i pa x o P , ' ) | ( q ¶ Q = å i pa x nd d o P , ' ) | ( q ¶ Q = å =1 i pa x ' , nd d P o ) | ( q Q = i pa x o P ' , ) ( q Q =

Gradient Ascent (cont) Putting all together we get

Expectation Maximization (EM) A general purpose method for learning from incomplete data Intuition: If we had access to counts, then we can estimate parameters However, missing values do not allow to perform counts “Complete” counts using current parameter assignment

Expectation Maximization (EM) Y Z Data Expected Counts P(Y=H|X=H,Z=T,) = 0.3 X Y Z N (X,Y ) HTHHT ??HTT TT?TH X Y # Current model HTHT HHTT 1.30.41.71.6 These numbers are placed for illustration; they have not been computed. P(Y=H|X=T, Z=T, ) = 0.4

EM (cont.)  Training Data Reiterate Expected Counts Initial network (G,0) Reparameterize X1 X2 X3 H Y1 Y2 Y3 Updated network (G,1) (M-Step) Expected Counts N(X1) N(X2) N(X3) N(H, X1, X1, X3) N(Y1, H) N(Y2, H) N(Y3, H) Computation (E-Step) X1 X2 X3 H Y1 Y2 Y3  Training Data

Expectation Maximization (EM) In practice, EM converges rather quickly at start but converges slowly near the (possibly-local) maximum. Hence, often EM is used few iterations and then Gradient Ascent steps are applied.

Final Homework Question 1: Develop an algorithm that given a pedigree input, provides the most probably haplotype of each individual in the pedigree. Use the Bayesian network model of superlink to formulate the problem exactly as a query. Specify the algorithm at length discussing as many details as you can. Analyze its efficiency. Devote time to illuminating notation and presentation. Question 2: Specialize the formula given in Theorem GA for  in genetic linkage analysis. In particular, assume exactly 3 loci: Marker 1, Disease 2, Marker 3, with  being the recombination between loci 2 and 1 and 0.1-  being the recombination between loci 3 and 2. Specify the formula for a pedigree with two parents and two children. Extend the formula for arbitrary pedigrees. Note that  is the same in many local probability tables.