Learning Bayesian Networks (From David Heckerman’s tutorial)

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayes rule, priors and maximum a posteriori
Basics of Statistical Estimation
A Tutorial on Learning with Bayesian Networks
Probabilistic models Haixu Tang School of Informatics.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
Learning: Parameter Estimation
Oliver Schulte Machine Learning 726
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
ฟังก์ชั่นการแจกแจงความน่าจะเป็น แบบไม่ต่อเนื่อง Discrete Probability Distributions.
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
Learning in Bayes Nets Task 1: Given the network structure and given data, where a data point is an observed setting for the variables, learn the CPTs.
An Overview of Learning Bayes Nets From Data Chris Meek Microsoft Researchhttp://research.microsoft.com/~meek.
Parameter Estimation using likelihood functions Tutorial #1
Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
Haimonti Dutta, Department Of Computer And Information Science1 David HeckerMann A Tutorial On Learning With Bayesian Networks.
Presenting: Assaf Tzabari
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Thanks to Nir Friedman, HU
Learning In Bayesian Networks. Learning Problem Set of random variables X = {W, X, Y, Z, …} Training set D = { x 1, x 2, …, x N }  Each observation specifies.
Crash Course on Machine Learning
Recitation 1 Probability Review
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:
- 1 - Bayesian inference of binomial problem Estimating a probability from binomial data –Objective is to estimate unknown proportion (or probability of.
Learning With Bayesian Networks Markus Kalisch ETH Zürich.
Announcements Spring Courses Somewhat Relevant to Machine Learning 5314: Algorithms for molecular bio (who’s teaching?) 5446: Chaotic dynamics (Bradley)
Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)
CS498-EA Reasoning in AI Lecture #10 Instructor: Eyal Amir Fall Semester 2009 Some slides in this set were adopted from Eran Segal.
IE 300, Fall 2012 Richard Sowers IESE. 8/30/2012 Goals: Rules of Probability Counting Equally likely Some examples.
Bayesian Prior and Posterior Study Guide for ES205 Yu-Chi Ho Jonathan T. Lee Nov. 24, 2000.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Bayesian Optimization Algorithm, Decision Graphs, and Occam’s Razor Martin Pelikan, David E. Goldberg, and Kumara Sastry IlliGAL Report No May.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Spring 2014 Course PSYC Thinking Proseminar – Matt Jones Provides beginning Ph.D. students with a basic introduction to research on complex human.
Bayesian Belief Network AI Contents t Introduction t Bayesian Network t KDD Data.
Bayesian Estimation and Confidence Intervals Lecture XXII.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
Oliver Schulte Machine Learning 726
Bayesian Estimation and Confidence Intervals
CS 2750: Machine Learning Density Estimation
CS 2750: Machine Learning Probability Review Density Estimation
Bayes Net Learning: Bayesian Approaches
Oliver Schulte Machine Learning 726
An Overview of Learning Bayes Nets From Data
Distributions and Concepts in Probability Theory
CSCI 5822 Probabilistic Models of Human and Machine Learning
OVERVIEW OF BAYESIAN INFERENCE: PART 1
Important Distinctions in Learning BNs
CS498-EA Reasoning in AI Lecture #20
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Statistical NLP: Lecture 4
CSCI 5822 Probabilistic Models of Human and Machine Learning
Important Distinctions in Learning BNs
LECTURE 07: BAYESIAN ESTIMATION
Parameter Learning 2 Structure Learning 1: The good
Parametric Methods Berlin Chen, 2005 References:
BN Semantics 3 – Now it’s personal! Parameter Learning 1
Mathematical Foundations of BME Reza Shadmehr
Learning Bayesian networks
Presentation transcript:

Learning Bayesian Networks (From David Heckerman’s tutorial)

Bayes net(s) data X 1 true false true X21532X21532 X Learning Bayes Nets From Data X1X1 X4X4 X9X9 X3X3 X2X2 X5X5 X6X6 X7X7 X8X8 Bayes-net learner + prior/expert information

Overview n Introduction to Bayesian statistics: Learning a probability n Learning probabilities in a Bayes net n Learning Bayes-net structure

Learning Probabilities: Classical Approach Simple case: Flipping a thumbtack tails heads True probability  is unknown Given iid data, estimate  using an estimator with good properties: low bias, low variance, consistent (e.g., ML estimate)

Learning Probabilities: Bayesian Approach tails heads True probability  is unknown Bayesian probability density for  p()p()  01

Bayesian Approach: use Bayes' rule to compute a new density for  given data prior likelihood posterior

The Likelihood “binomial distribution”

Example: Application of Bayes rule to the observation of a single "heads" p(  |heads)  01 p()p()  01 p(heads|  )=   01 priorlikelihoodposterior

A Bayes net for learning probabilities  X1X1 X2X2 XNXN... toss 1 toss 2toss N

Sufficient statistics (#h,#t) are sufficient statistics

The probability of heads on the next toss

Prior Distributions for  n Direct assessment n Parametric distributions –Conjugate distributions (for convenience) –Mixtures of conjugate distributions

Conjugate Family of Distributions Beta distribution: Properties:

Intuition The hyperparameters  h and  t can be thought of as imaginary counts from our prior experience, starting from "pure ignorance" The hyperparameters  h and  t can be thought of as imaginary counts from our prior experience, starting from "pure ignorance" Equivalent sample size =  h +  t Equivalent sample size =  h +  t n The larger the equivalent sample size, the more confident we are about the true probability

Beta Distributions Beta(3, 2 )Beta(1, 1 )Beta(19, 39 )Beta(0.5, 0.5 )

Assessment of a Beta Distribution Method 1: Equivalent sample - assess  h and  t - assess  h +  t and  h /(  h +  t ) Method 2: Imagined future samples

Generalization to m discrete outcomes ("multinomial distribution") Dirichlet distribution: Properties:

More generalizations (see, e.g., Bernardo + Smith, 1994) Likelihoods from the exponential family n Binomial n Multinomial n Poisson n Gamma n Normal

Overview n Intro to Bayesian statistics: Learning a probability n Learning probabilities in a Bayes net n Learning Bayes-net structure

From thumbtacks to Bayes nets Thumbtack problem can be viewed as learning the probability for a very simple BN: X heads/tails  X1X1 X2X2 XNXN... toss 1 toss 2toss N

The next simplest Bayes net X heads/tails Y tails heads “heads”“tails”

The next simplest Bayes net X heads/tails Y XX X1X1 X2X2 XNXN YY Y1Y1 Y2Y2 YNYN case 1 case 2 case N ?

The next simplest Bayes net X heads/tails Y XX X1X1 X2X2 XNXN YY Y1Y1 Y2Y2 YNYN case 1 case 2 case N "parameter independence"

The next simplest Bayes net X heads/tails Y XX X1X1 X2X2 XNXN YY Y1Y1 Y2Y2 YNYN case 1 case 2 case N "parameter independence"  two separate thumbtack-like learning problems

A bit more difficult... X heads/tails Y Three probabilities to learn:  X=heads  X=heads  Y=heads|X=heads  Y=heads|X=heads  Y=heads|X=tails  Y=heads|X=tails

A bit more difficult... X heads/tails Y XX X1X1 X2X2  Y|X=heads Y1Y1 Y2Y2 case 1 case 2  Y|X=tails heads tails

A bit more difficult... X heads/tails Y XX X1X1 X2X2  Y|X=heads Y1Y1 Y2Y2 case 1 case 2  Y|X=tails

A bit more difficult... X heads/tails Y XX X1X1 X2X2  Y|X=heads Y1Y1 Y2Y2 case 1 case 2  Y|X=tails ? ? ?

A bit more difficult... X heads/tails Y XX X1X1 X2X2  Y|X=heads Y1Y1 Y2Y2 case 1 case 2  Y|X=tails 3 separate thumbtack-like problems

In general… Learning probabilities in a BN is straightforward if n Local distributions from the exponential family (binomial, poisson, gamma,...) n Parameter independence n Conjugate priors n Complete data

Incomplete data makes parameters dependent X heads/tails Y XX X1X1 X2X2  Y|X=heads Y1Y1 Y2Y2 case 1 case 2  Y|X=tails

Overview n Intro to Bayesian statistics: Learning a probability n Learning probabilities in a Bayes net n Learning Bayes-net structure

Learning Bayes-net structure Given data, which model is correct? XY model 1: XY model 2:

Bayesian approach Given data, which model is correct? more likely? XY model 1: XY model 2: Data d

Bayesian approach: Model Averaging Given data, which model is correct? more likely? XY model 1: XY model 2: Data d average predictions

Bayesian approach: Model Selection Given data, which model is correct? more likely? XY model 1: XY model 2: Data d Keep the best model: - Explanation - Understanding - Tractability

To score a model, use Bayes rule Given data d: "marginal likelihood" model score likelihood

Thumbtack example conjugate prior X heads/tails

More complicated graphs X heads/tails Y 3 separate thumbtack-like learning problems X Y|X=heads Y|X=tails

Model score for a discrete BN

Computation of Marginal Likelihood Efficient closed form if n Local distributions from the exponential family (binomial, poisson, gamma,...) n Parameter independence n Conjugate priors n No missing data (including no hidden variables)

Practical considerations The number of possible BN structures for n variables is super exponential in n n How do we find the best graph(s)? n How do we assign structure and parameter priors to all possible graph?

Model search n Finding the BN structure with the highest score among those structures with at most k parents is NP hard for k>1 (Chickering, 1995) n Heuristic methods –Greedy –Greedy with restarts –MCMC methods score all possible single changes any changes better? perform best change yes no return saved structure initialize structure

Structure priors 1. All possible structures equally likely 2. Partial ordering, required / prohibited arcs 3. p(m)  similarity(m, prior BN)

Parameter priors n All uniform: Beta(1,1) n Use a prior BN

Parameter priors Recall the intuition behind the Beta prior for the thumbtack: The hyperparameters  h and  t can be thought of as imaginary counts from our prior experience, starting from "pure ignorance" The hyperparameters  h and  t can be thought of as imaginary counts from our prior experience, starting from "pure ignorance" Equivalent sample size =  h +  t Equivalent sample size =  h +  t n The larger the equivalent sample size, the more confident we are about the long-run fraction

Parameter priors x1x1 x4x4 x9x9 x3x3 x2x2 x5x5 x6x6 x7x7 x8x8 + equivalent sample size imaginary count for any variable configuration parameter priors for any BN structure for X 1 …X n parameter modularity

x1x1 x4x4 x9x9 x3x3 x2x2 x5x5 x6x6 x7x7 x8x8 prior network+equivalent sample size data improved network(s) x 1 true false true x 2 false true x 3 true false Combine user knowledge and data x1x1 x4x4 x9x9 x3x3 x2x2 x5x5 x6x6 x7x7 x8x8