Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

A Tutorial on Learning with Bayesian Networks
Learning with Missing Data
Probabilistic models Haixu Tang School of Informatics.
1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.
Learning: Parameter Estimation
Learning in Bayes Nets Task 1: Given the network structure and given data, where a data point is an observed setting for the variables, learn the CPTs.
An Overview of Learning Bayes Nets From Data Chris Meek Microsoft Researchhttp://research.microsoft.com/~meek.
Parameter Estimation using likelihood functions Tutorial #1
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
. Learning Bayesian networks Slides by Nir Friedman.
Learning with Bayesian Networks David Heckerman Presented by Colin Rickert.
Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
Haimonti Dutta, Department Of Computer And Information Science1 David HeckerMann A Tutorial On Learning With Bayesian Networks.
Presenting: Assaf Tzabari
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
CS 188: Artificial Intelligence Fall 2006 Lecture 17: Bayes Nets III 10/26/2006 Dan Klein – UC Berkeley.
© Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573.
Learning Bayesian Networks
Thanks to Nir Friedman, HU
Learning Bayesian Networks (From David Heckerman’s tutorial)
1 Learning with Bayesian Networks Author: David Heckerman Presented by Yan Zhang April
Learning In Bayesian Networks. Learning Problem Set of random variables X = {W, X, Y, Z, …} Training set D = { x 1, x 2, …, x N }  Each observation specifies.
Bayes Net Perspectives on Causation and Causal Inference
Recitation 1 Probability Review
A Brief Introduction to Graphical Models
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
Bayesian Networks What is the likelihood of X given evidence E? i.e. P(X|E) = ?
1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:
Bayesian Networks for Data Mining David Heckerman Microsoft Research (Data Mining and Knowledge Discovery 1, (1997))
Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
Likelihood function and Bayes Theorem In simplest case P(B|A) = P(A|B) P(B)/P(A) and we consider the likelihood function in which we view the conditional.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
- 1 - Bayesian inference of binomial problem Estimating a probability from binomial data –Objective is to estimate unknown proportion (or probability of.
Learning With Bayesian Networks Markus Kalisch ETH Zürich.
Announcements Spring Courses Somewhat Relevant to Machine Learning 5314: Algorithms for molecular bio (who’s teaching?) 5446: Chaotic dynamics (Bradley)
1 Francisco José Vázquez Polo [ Miguel Ángel Negrín Hernández [ {fjvpolo or
Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)
CS498-EA Reasoning in AI Lecture #10 Instructor: Eyal Amir Fall Semester 2009 Some slides in this set were adopted from Eran Segal.
1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:
Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
The Uniform Prior and the Laplace Correction Supplemental Material not on exam.
Bayesian Optimization Algorithm, Decision Graphs, and Occam’s Razor Martin Pelikan, David E. Goldberg, and Kumara Sastry IlliGAL Report No May.
CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Spring 2014 Course PSYC Thinking Proseminar – Matt Jones Provides beginning Ph.D. students with a basic introduction to research on complex human.
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.
Bayesian Belief Network AI Contents t Introduction t Bayesian Network t KDD Data.
Oliver Schulte Machine Learning 726
CS 2750: Machine Learning Directed Graphical Models
Bayes Net Learning: Bayesian Approaches
Oliver Schulte Machine Learning 726
Irina Rish IBM T.J.Watson Research Center
An Overview of Learning Bayes Nets From Data
CSCI 5822 Probabilistic Models of Human and Machine Learning
Course on Bayesian Methods in Environmental Valuation
Bayesian Networks: Motivation
OVERVIEW OF BAYESIAN INFERENCE: PART 1
More about Posterior Distributions
Important Distinctions in Learning BNs
CSCI 5822 Probabilistic Models of Human and Machine Learning
Important Distinctions in Learning BNs
Bayesian Learning Chapter
Parameter Learning 2 Structure Learning 1: The good
Learning Bayesian networks
Presentation transcript:

Learning Bayesian Networks

Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative

Bayes net(s) data X 1 true false true X21532X21532 X Learning Bayes nets from data X1X1 X4X4 X9X9 X3X3 X2X2 X5X5 X6X6 X7X7 X8X8 Bayes-net learner + prior/expert information

From thumbtacks to Bayes nets Thumbtack problem can be viewed as learning the probability for a very simple BN: X heads/tails  X1X1 X2X2 XNXN... toss 1 toss 2toss N

The next simplest Bayes net X heads/tails Y tails heads “heads”“tails”

The next simplest Bayes net X heads/tails Y XX X1X1 X2X2 XNXN YY Y1Y1 Y2Y2 YNYN case 1 case 2 case N ?

The next simplest Bayes net X heads/tails Y XX X1X1 X2X2 XNXN YY Y1Y1 Y2Y2 YNYN case 1 case 2 case N "parameter independence"

The next simplest Bayes net X heads/tails Y XX X1X1 X2X2 XNXN YY Y1Y1 Y2Y2 YNYN case 1 case 2 case N "parameter independence"  two separate thumbtack-like learning problems

A bit more difficult... X heads/tails Y Three probabilities to learn:  X=heads  Y=heads|X=heads  Y=heads|X=tails

A bit more difficult... X heads/tails Y XX X1X1 X2X2  Y|X=heads Y1Y1 Y2Y2 case 1 case 2  Y|X=tails heads tails

A bit more difficult... X heads/tails Y XX X1X1 X2X2  Y|X=heads Y1Y1 Y2Y2 case 1 case 2  Y|X=tails

A bit more difficult... X heads/tails Y XX X1X1 X2X2  Y|X=heads Y1Y1 Y2Y2 case 1 case 2  Y|X=tails ? ? ?

A bit more difficult... X heads/tails Y XX X1X1 X2X2  Y|X=heads Y1Y1 Y2Y2 case 1 case 2  Y|X=tails 3 separate thumbtack-like problems

In general … Learning probabilities in a Bayes net is straightforward if Complete data Local distributions from the exponential family (binomial, Poisson, gamma,...) Parameter independence Conjugate priors

Incomplete data makes parameters dependent X heads/tails Y XX X1X1 X2X2  Y|X=heads Y1Y1 Y2Y2 case 1 case 2  Y|X=tails

Solution: Use EM Initialize parameters ignoring missing data E step: Infer missing values using current parameters M step: Estimate parameters using completed data Can also use gradient descent

Learning Bayes-net structure Given data, which model is correct? XY model 1: XY model 2:

Bayesian approach Given data, which model is correct? more likely? XY model 1: XY model 2: Data d

Bayesian approach: Model averaging Given data, which model is correct? more likely? XY model 1: XY model 2: Data d average predictions

Bayesian approach: Model selection Given data, which model is correct? more likely? XY model 1: XY model 2: Data d Keep the best model: - Explanation - Understanding - Tractability

To score a model, use Bayes’ theorem Given data d: "marginal likelihood" model score likelihood

Thumbtack example conjugate prior X heads/tails

More complicated graphs X heads/tails Y 3 separate thumbtack-like learning problems X Y|X=heads Y|X=tails

Model score for a discrete Bayes net

Computation of marginal likelihood Efficient closed form if Local distributions from the exponential family (binomial, poisson, gamma,...) Parameter independence Conjugate priors No missing data (including no hidden variables)

Structure search Finding the BN structure with the highest score among those structures with at most k parents is NP hard for k>1 (Chickering, 1995) Heuristic methods –Greedy –Greedy with restarts –MCMC methods score all possible single changes any changes better? perform best change yes no return saved structure initialize structure

Structure priors 1. All possible structures equally likely 2. Partial ordering, required / prohibited arcs 3. Prior(m)  Similarity(m, prior BN)

Parameter priors All uniform: Beta(1,1) Use a prior Bayes net

Parameter priors Recall the intuition behind the Beta prior for the thumbtack: The hyperparameters  h and  t can be thought of as imaginary counts from our prior experience, starting from "pure ignorance" Equivalent sample size =  h +  t The larger the equivalent sample size, the more confident we are about the long-run fraction

Parameter priors x1x1 x4x4 x9x9 x3x3 x2x2 x5x5 x6x6 x7x7 x8x8 + equivalent sample size imaginary count for any variable configuration parameter priors for any Bayes net structure for X 1 …X n parameter modularity

x1x1 x4x4 x9x9 x3x3 x2x2 x5x5 x6x6 x7x7 x8x8 prior network+equivalent sample size data improved network(s) x 1 true false true x 2 false true x 3 true false Combining knowledge & data x1x1 x4x4 x9x9 x3x3 x2x2 x5x5 x6x6 x7x7 x8x8

Example: College Plans Data (Heckerman et. Al 1997) Data on 5 variables that might influence high school students’ decision to attend college: –Sex: Male or Female –SES: Socio economic status (low, lower-middle, middle, upper- middle, high) –IQ: discritized into low, lower middle, upper middle, high –PE: Parental Encouragement (low or high) –CP: College plans (yes or no) 128 possible joint configurations Heckerman et. al. computed the exact posterior over all 29,281 possible 5 node DAGs –Except those in which Sex or SAS have parents and/or CP have children (prior knowledge)