1 Param. Learning (MLE) Structure Learning The Good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:

1 Param. Learning (MLE) Structure Learning The Good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F: 16.1, 16.2, 17.1, 17.2, 17.3.1, 17.4.1 10-708 –  Carlos Guestrin 2006-2008

2 Learning the CPTs x (1) … x (m) Data For each discrete variable X i

10-708 –  Carlos Guestrin 2006-2008 3 Learning the CPTs x (1) … x (m) Data For each discrete variable X i WHY??????????

10-708 –  Carlos Guestrin 2006-2008 4 Maximum likelihood estimation (MLE) of BN parameters – example Given structure, log likelihood of data: Flu Allergy Sinus Nose

10-708 –  Carlos Guestrin 2006-2008 5 Maximum likelihood estimation (MLE) of BN parameters – General case Data: x (1),…,x (m) Restriction: x (j) [Pa Xi ] ! assignment to Pa Xi in x (j) Given structure, log likelihood of data:

10-708 –  Carlos Guestrin 2006-2008 6 Taking derivatives of MLE of BN parameters – General case

10-708 –  Carlos Guestrin 2006-2008 7 General MLE for a CPT Take a CPT: P(X|U) Log likelihood term for this CPT Parameter  X=x|U=u :

10-708 –  Carlos Guestrin 2006-2008 8 Where are we with learning BNs? Given structure, estimate parameters  Maximum likelihood estimation  Later Bayesian learning What about learning structure?

10-708 –  Carlos Guestrin 2006-2008 9 Learning the structure of a BN Constraint-based approach  BN encodes conditional independencies  Test conditional independencies in data  Find an I-map Score-based approach  Finding a structure and parameters is a density estimation task  Evaluate model as we evaluated parameters Maximum likelihood Bayesian etc. Data … Flu Allergy Sinus Headache Nose Possible structures Learn structure and parameters

10-708 –  Carlos Guestrin 2006-2008 10 Remember: Obtaining a P-map? Given the independence assertions that are true for P  Obtain skeleton  Obtain immoralities From skeleton and immoralities, obtain every (and any) BN structure from the equivalence class Constraint-based approach:  Use Learn PDAG algorithm  Key question: Independence test

10-708 –  Carlos Guestrin 2006-2008 11 Score-based approach Data … Flu Allergy Sinus Headache Nose Possible structures Score structure Learn parameters

10-708 –  Carlos Guestrin 2006-2008 12 Information-theoretic interpretation of maximum likelihood Given structure, log likelihood of data: Flu Allergy Sinus Headache Nose

10-708 –  Carlos Guestrin 2006-2008 13 Information-theoretic interpretation of maximum likelihood 2 Given structure, log likelihood of data: Flu Allergy Sinus Headache Nose

10-708 –  Carlos Guestrin 2006-2008 14 Decomposable score Log data likelihood Decomposable score:  Decomposes over families in BN (node and its parents)  Will lead to significant computational efficiency!!!  Score(G : D) =  i FamScore(X i |Pa Xi : D)

Announcements Recitation tomorrow  Don’t miss it! HW2  Out today  Due in 2 weeks Projects!!!  Proposals due Oct. 8 th in class  Individually or groups of two  Details on course website  Project suggestions will be up soon!!! 15 10-708 –  Carlos Guestrin 2006-2008

BN code release!!!! Pre-release of a C++ library for probabilistic inference and learning Features:  basic datastructures (random variables, processes, linear algebra)  distributions (Gaussian, multinomial,...)  basic graph structures (directed, undirected)  graphical models (Bayesian network, MRF, junction trees)  inference algorithms (variable elimination, loopy belief propagation, filtering) Limited amount of learning (IPF, Chow Liu, order-based search) Supported platforms:  Linux (tested on Ubuntu 8.04)  MacOS X (tested on 10.4/10.5)  limited Windows support Will be made available to the class early next week. 10-708 –  Carlos Guestrin 2006-2008 16

10-708 –  Carlos Guestrin 2006-2008 17 How many trees are there? Nonetheless – Efficient optimal algorithm finds best tree

10-708 –  Carlos Guestrin 2006-2008 18 Scoring a tree 1: I-equivalent trees

10-708 –  Carlos Guestrin 2006-2008 19 Scoring a tree 2: similar trees

10-708 –  Carlos Guestrin 2006-2008 20 Chow-Liu tree learning algorithm 1 For each pair of variables X i,X j  Compute empirical distribution:  Compute mutual information: Define a graph  Nodes X 1,…,X n  Edge (i,j) gets weight

10-708 –  Carlos Guestrin 2006-2008 21 Chow-Liu tree learning algorithm 2 Optimal tree BN  Compute maximum weight spanning tree  Directions in BN: pick any node as root, breadth-first- search defines directions

10-708 –  Carlos Guestrin 2006-2008 22 Can we extend Chow-Liu 1 Tree augmented naïve Bayes (TAN) [Friedman et al. ’97]  Naïve Bayes model overcounts, because correlation between features not considered  Same as Chow-Liu, but score edges with:

10-708 –  Carlos Guestrin 2006-2008 23 Can we extend Chow-Liu 2 (Approximately learning) models with tree-width up to k  [Chechetka & Guestrin ’07]  But, O(n 2k+6 )

10-708 –  Carlos Guestrin 2006-2008 24 What you need to know about learning BN structures so far Decomposable scores  Maximum likelihood  Information theoretic interpretation Best tree (Chow-Liu) Best TAN Nearly best k-treewidth (in O(N 2k+6 ))

10-708 –  Carlos Guestrin 2006-2008 25 m Can we really trust MLE? What is better?  3 heads, 2 tails  30 heads, 20 tails  3x10 23 heads, 2x10 23 tails Many possible answers, we need distributions over possible parameters

10-708 –  Carlos Guestrin 2006-2008 26 Bayesian Learning Use Bayes rule: Or equivalently:

10-708 –  Carlos Guestrin 2006-2008 27 Bayesian Learning for Thumbtack Likelihood function is simply Binomial: What about prior?  Represent expert knowledge  Simple posterior form Conjugate priors:  Closed-form representation of posterior (more details soon)  For Binomial, conjugate prior is Beta distribution

10-708 –  Carlos Guestrin 2006-2008 28 Beta prior distribution – P(  ) Likelihood function: Posterior:

10-708 –  Carlos Guestrin 2006-2008 29 Posterior distribution Prior: Data: m H heads and m T tails Posterior distribution:

10-708 –  Carlos Guestrin 2006-2008 30 Conjugate prior Given likelihood function P(D|  ) (Parametric) prior of the form P(  |  ) is conjugate to likelihood function if posterior is of the same parametric family, and can be written as:  P(  |  ’), for some new set of parameters  ’ Prior: Data: m H heads and m T tails (binomial likelihood) Posterior distribution:

10-708 –  Carlos Guestrin 2006-2008 31 Using Bayesian posterior Posterior distribution: Bayesian inference:  No longer single parameter:  Integral is often hard to compute

10-708 –  Carlos Guestrin 2006-2008 32 Bayesian prediction of a new coin flip Prior: Observed m H heads, m T tails, what is probability of m+1 flip is heads?

10-708 –  Carlos Guestrin 2006-2008 33 Asymptotic behavior and equivalent sample size Beta prior equivalent to extra thumbtack flips:  As m → 1, prior is “forgotten” But, for small sample size, prior is important! Equivalent sample size:  Prior parameterized by  H,  T, or  m’ (equivalent sample size) and   Fix m’, change  Fix , change m’

10-708 –  Carlos Guestrin 2006-2008 34 Bayesian learning corresponds to smoothing m=0 ) prior parameter m!1 ) MLE m

10-708 –  Carlos Guestrin 2006-2008 35 Bayesian learning for multinomial What if you have a k sided coin??? Likelihood function if multinomial:  Conjugate prior for multinomial is Dirichlet:  Observe m data points, m i from assignment i, posterior: Prediction:

10-708 –  Carlos Guestrin 2006-2008 36 Bayesian learning for two-node BN Parameters  X,  Y|X Priors:  P(  X ):  P(  Y|X ):

10-708 –  Carlos Guestrin 2006-2008 37 Very important assumption on prior: Global parameter independence Global parameter independence:  Prior over parameters is product of prior over CPTs

10-708 –  Carlos Guestrin 2006-2008 38 Global parameter independence, d-separation and local prediction Flu Allergy Sinus Headache Nose Independencies in meta BN: Proposition: For fully observable data D, if prior satisfies global parameter independence, then

10-708 –  Carlos Guestrin 2006-2008 39 Within a CPT Meta BN including CPT parameters: Are  Y|X=t and  Y|X=f d-separated given D? Are  Y|X=t and  Y|X=f independent given D?  Context-specific independence!!! Posterior decomposes:

10-708 –  Carlos Guestrin 2006-2008 40 Priors for BN CPTs (more when we talk about structure learning) Consider each CPT: P(X|U=u) Conjugate prior:  Dirichlet(  X=1|U=u,…,  X=k|U=u ) More intuitive:  “prior data set” D’ with m’ equivalent sample size  “prior counts”:  prediction:

10-708 –  Carlos Guestrin 2006-2008 41 An example

10-708 –  Carlos Guestrin 2006-2008 42 What you need to know about parameter learning MLE:  score decomposes according to CPTs  optimize each CPT separately Bayesian parameter learning:  motivation for Bayesian approach  Bayesian prediction  conjugate priors, equivalent sample size  Bayesian learning ) smoothing Bayesian learning for BN parameters  Global parameter independence  Decomposition of prediction according to CPTs  Decomposition within a CPT

1 Param. Learning (MLE) Structure Learning The Good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:

Similar presentations

Presentation on theme: "1 Param. Learning (MLE) Structure Learning The Good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Param. Learning (MLE) Structure Learning The Good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:

Similar presentations

Presentation on theme: "1 Param. Learning (MLE) Structure Learning The Good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:"— Presentation transcript:

Similar presentations

About project

Feedback