1 Param. Learning (MLE) Structure Learning The Good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

1 Undirected Graphical Models Graphical Models – Carlos Guestrin Carnegie Mellon University October 29 th, 2008 Readings: K&F: 4.1, 4.2, 4.3, 4.4,

A Tutorial on Learning with Bayesian Networks

Parameter and Structure Learning Dhruv Batra, Recitation 10/02/2008.

1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.

Learning: Parameter Estimation

Midterm Review. The Midterm Everything we have talked about so far Stuff from HW I won’t ask you to do as complicated calculations as the HW Don’t need.

1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.

Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.

. Learning Bayesian networks Slides by Nir Friedman.

Learning with Bayesian Networks David Heckerman Presented by Colin Rickert.

Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex.

Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

5/25/2005EE562 EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005.

Learning Bayesian Networks (From David Heckerman’s tutorial)

1 EM for BNs Graphical Models – Carlos Guestrin Carnegie Mellon University November 24 th, 2008 Readings: 18.1, 18.2, –  Carlos Guestrin.

Crash Course on Machine Learning

Recitation 1 Probability Review

Machine Learning CUNY Graduate Center Lecture 21: Graphical Models.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.

Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.

1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:

CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)

1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,

1 EM for BNs Kalman Filters Gaussian BNs Graphical Models – Carlos Guestrin Carnegie Mellon University November 17 th, 2006 Readings: K&F: 16.2 Jordan.

Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,

Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website

Learning With Bayesian Networks Markus Kalisch ETH Zürich.

Slides for “Data Mining” by I. H. Witten and E. Frank.

Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)

CS498-EA Reasoning in AI Lecture #10 Instructor: Eyal Amir Fall Semester 2009 Some slides in this set were adopted from Eran Segal.

1 BN Semantics 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 15 th, 2008 Readings: K&F: 3.1, 3.2, –  Carlos.

CHAPTER 5 Probability Theory (continued) Introduction to Bayesian Networks.

1 Mean Field and Variational Methods finishing off Graphical Models – Carlos Guestrin Carnegie Mellon University November 5 th, 2008 Readings: K&F:

1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

1 BN Semantics 2 – The revenge of d-separation Graphical Models – Carlos Guestrin Carnegie Mellon University September 20 th, 2006 Readings: K&F:

1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.

1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,

Bayesian Optimization Algorithm, Decision Graphs, and Occam’s Razor Martin Pelikan, David E. Goldberg, and Kumara Sastry IlliGAL Report No May.

1 BN Semantics 2 – Representation Theorem The revenge of d-separation Graphical Models – Carlos Guestrin Carnegie Mellon University September 17.

1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:

1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,

1 BN Semantics 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 15 th, 2006 Readings: K&F: 3.1, 3.2, 3.3.

Bayesian Belief Network AI Contents t Introduction t Bayesian Network t KDD Data.

1 BN Semantics 3 – Now it’s personal! Parameter Learning 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 22 nd, 2006 Readings:

Oliver Schulte Machine Learning 726

CS 2750: Machine Learning Density Estimation

Oliver Schulte Machine Learning 726

Still More Uncertainty

Bayesian Networks: Motivation

CS498-EA Reasoning in AI Lecture #20

Readings: K&F: 15.1, 15.2, 15.3, 15.4, 15.5 K&F: 7 (overview of inference) K&F: 8.1, 8.2 (Variable Elimination) Structure Learning in BNs 3: (the good,

CSCI 5822 Probabilistic Models of Human and Machine Learning

Important Distinctions in Learning BNs

Parameter Learning 2 Structure Learning 1: The good

Readings: K&F: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7 Markov networks, Factor graphs, and an unified view Start approximate inference If we are lucky… Graphical.

Topic Models in Text Processing

Unifying Variational and GBP Learning Parameters of MNs EM for BNs

BN Semantics 3 – Now it’s personal! Parameter Learning 1

Junction Trees 3 Undirected Graphical Models

Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website

Variable Elimination Graphical Models – Carlos Guestrin

Generalized Belief Propagation

BN Semantics 2 – The revenge of d-separation

Learning Bayesian networks

Presentation transcript:

1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F: 16.1, 16.2, 17.1, 17.2, , –  Carlos Guestrin

2 Learning the CPTs x (1) … x (m) Data For each discrete variable X i

–  Carlos Guestrin Learning the CPTs x (1) … x (m) Data For each discrete variable X i WHY??????????

–  Carlos Guestrin Maximum likelihood estimation (MLE) of BN parameters – example Given structure, log likelihood of data: Flu Allergy Sinus Nose

–  Carlos Guestrin Maximum likelihood estimation (MLE) of BN parameters – General case Data: x (1),…,x (m) Restriction: x (j) [Pa Xi ] ! assignment to Pa Xi in x (j) Given structure, log likelihood of data:

–  Carlos Guestrin Taking derivatives of MLE of BN parameters – General case

–  Carlos Guestrin General MLE for a CPT Take a CPT: P(X|U) Log likelihood term for this CPT Parameter  X=x|U=u :

–  Carlos Guestrin Where are we with learning BNs? Given structure, estimate parameters  Maximum likelihood estimation  Later Bayesian learning What about learning structure?

–  Carlos Guestrin Learning the structure of a BN Constraint-based approach  BN encodes conditional independencies  Test conditional independencies in data  Find an I-map Score-based approach  Finding a structure and parameters is a density estimation task  Evaluate model as we evaluated parameters Maximum likelihood Bayesian etc. Data … Flu Allergy Sinus Headache Nose Possible structures Learn structure and parameters

–  Carlos Guestrin Remember: Obtaining a P-map? Given the independence assertions that are true for P  Obtain skeleton  Obtain immoralities From skeleton and immoralities, obtain every (and any) BN structure from the equivalence class Constraint-based approach:  Use Learn PDAG algorithm  Key question: Independence test

–  Carlos Guestrin Score-based approach Data … Flu Allergy Sinus Headache Nose Possible structures Score structure Learn parameters

–  Carlos Guestrin Information-theoretic interpretation of maximum likelihood Given structure, log likelihood of data: Flu Allergy Sinus Headache Nose

–  Carlos Guestrin Information-theoretic interpretation of maximum likelihood 2 Given structure, log likelihood of data: Flu Allergy Sinus Headache Nose

–  Carlos Guestrin Decomposable score Log data likelihood Decomposable score:  Decomposes over families in BN (node and its parents)  Will lead to significant computational efficiency!!!  Score(G : D) =  i FamScore(X i |Pa Xi : D)

Announcements Recitation tomorrow  Don’t miss it! HW2  Out today  Due in 2 weeks Projects!!!  Proposals due Oct. 8 th in class  Individually or groups of two  Details on course website  Project suggestions will be up soon!!! –  Carlos Guestrin

BN code release!!!! Pre-release of a C++ library for probabilistic inference and learning Features:  basic datastructures (random variables, processes, linear algebra)  distributions (Gaussian, multinomial,...)  basic graph structures (directed, undirected)  graphical models (Bayesian network, MRF, junction trees)  inference algorithms (variable elimination, loopy belief propagation, filtering) Limited amount of learning (IPF, Chow Liu, order-based search) Supported platforms:  Linux (tested on Ubuntu 8.04)  MacOS X (tested on 10.4/10.5)  limited Windows support Will be made available to the class early next week –  Carlos Guestrin

–  Carlos Guestrin How many trees are there? Nonetheless – Efficient optimal algorithm finds best tree

–  Carlos Guestrin Scoring a tree 1: I-equivalent trees

–  Carlos Guestrin Scoring a tree 2: similar trees

–  Carlos Guestrin Chow-Liu tree learning algorithm 1 For each pair of variables X i,X j  Compute empirical distribution:  Compute mutual information: Define a graph  Nodes X 1,…,X n  Edge (i,j) gets weight

–  Carlos Guestrin Chow-Liu tree learning algorithm 2 Optimal tree BN  Compute maximum weight spanning tree  Directions in BN: pick any node as root, breadth-first- search defines directions

–  Carlos Guestrin Can we extend Chow-Liu 1 Tree augmented naïve Bayes (TAN) [Friedman et al. ’97]  Naïve Bayes model overcounts, because correlation between features not considered  Same as Chow-Liu, but score edges with:

–  Carlos Guestrin Can we extend Chow-Liu 2 (Approximately learning) models with tree-width up to k  [Chechetka & Guestrin ’07]  But, O(n 2k+6 )

–  Carlos Guestrin What you need to know about learning BN structures so far Decomposable scores  Maximum likelihood  Information theoretic interpretation Best tree (Chow-Liu) Best TAN Nearly best k-treewidth (in O(N 2k+6 ))

–  Carlos Guestrin m Can we really trust MLE? What is better?  3 heads, 2 tails  30 heads, 20 tails  3x10 23 heads, 2x10 23 tails Many possible answers, we need distributions over possible parameters

–  Carlos Guestrin Bayesian Learning Use Bayes rule: Or equivalently:

–  Carlos Guestrin Bayesian Learning for Thumbtack Likelihood function is simply Binomial: What about prior?  Represent expert knowledge  Simple posterior form Conjugate priors:  Closed-form representation of posterior (more details soon)  For Binomial, conjugate prior is Beta distribution

–  Carlos Guestrin Beta prior distribution – P(  ) Likelihood function: Posterior:

–  Carlos Guestrin Posterior distribution Prior: Data: m H heads and m T tails Posterior distribution:

–  Carlos Guestrin Conjugate prior Given likelihood function P(D|  ) (Parametric) prior of the form P(  |  ) is conjugate to likelihood function if posterior is of the same parametric family, and can be written as:  P(  |  ’), for some new set of parameters  ’ Prior: Data: m H heads and m T tails (binomial likelihood) Posterior distribution:

–  Carlos Guestrin Using Bayesian posterior Posterior distribution: Bayesian inference:  No longer single parameter:  Integral is often hard to compute

–  Carlos Guestrin Bayesian prediction of a new coin flip Prior: Observed m H heads, m T tails, what is probability of m+1 flip is heads?

–  Carlos Guestrin Asymptotic behavior and equivalent sample size Beta prior equivalent to extra thumbtack flips:  As m → 1, prior is “forgotten” But, for small sample size, prior is important! Equivalent sample size:  Prior parameterized by  H,  T, or  m’ (equivalent sample size) and   Fix m’, change  Fix , change m’

–  Carlos Guestrin Bayesian learning corresponds to smoothing m=0 ) prior parameter m!1 ) MLE m

–  Carlos Guestrin Bayesian learning for multinomial What if you have a k sided coin??? Likelihood function if multinomial:  Conjugate prior for multinomial is Dirichlet:  Observe m data points, m i from assignment i, posterior: Prediction:

–  Carlos Guestrin Bayesian learning for two-node BN Parameters  X,  Y|X Priors:  P(  X ):  P(  Y|X ):

–  Carlos Guestrin Very important assumption on prior: Global parameter independence Global parameter independence:  Prior over parameters is product of prior over CPTs

–  Carlos Guestrin Global parameter independence, d-separation and local prediction Flu Allergy Sinus Headache Nose Independencies in meta BN: Proposition: For fully observable data D, if prior satisfies global parameter independence, then

–  Carlos Guestrin Within a CPT Meta BN including CPT parameters: Are  Y|X=t and  Y|X=f d-separated given D? Are  Y|X=t and  Y|X=f independent given D?  Context-specific independence!!! Posterior decomposes:

–  Carlos Guestrin Priors for BN CPTs (more when we talk about structure learning) Consider each CPT: P(X|U=u) Conjugate prior:  Dirichlet(  X=1|U=u,…,  X=k|U=u ) More intuitive:  “prior data set” D’ with m’ equivalent sample size  “prior counts”:  prediction:

–  Carlos Guestrin An example

–  Carlos Guestrin What you need to know about parameter learning MLE:  score decomposes according to CPTs  optimize each CPT separately Bayesian parameter learning:  motivation for Bayesian approach  Bayesian prediction  conjugate priors, equivalent sample size  Bayesian learning ) smoothing Bayesian learning for BN parameters  Global parameter independence  Decomposition of prediction according to CPTs  Decomposition within a CPT