Download presentation
Presentation is loading. Please wait.
Published byAlexia Foster Modified over 9 years ago
1
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F: 16.1, 16.2, 17.1, 17.2, 17.3.1, 17.4.1 10-708 – Carlos Guestrin 2006-2008
2
2 Learning the CPTs x (1) … x (m) Data For each discrete variable X i
3
10-708 – Carlos Guestrin 2006-2008 3 Learning the CPTs x (1) … x (m) Data For each discrete variable X i WHY??????????
4
10-708 – Carlos Guestrin 2006-2008 4 Maximum likelihood estimation (MLE) of BN parameters – example Given structure, log likelihood of data: Flu Allergy Sinus Nose
5
10-708 – Carlos Guestrin 2006-2008 5 Maximum likelihood estimation (MLE) of BN parameters – General case Data: x (1),…,x (m) Restriction: x (j) [Pa Xi ] ! assignment to Pa Xi in x (j) Given structure, log likelihood of data:
6
10-708 – Carlos Guestrin 2006-2008 6 Taking derivatives of MLE of BN parameters – General case
7
10-708 – Carlos Guestrin 2006-2008 7 General MLE for a CPT Take a CPT: P(X|U) Log likelihood term for this CPT Parameter X=x|U=u :
8
10-708 – Carlos Guestrin 2006-2008 8 Where are we with learning BNs? Given structure, estimate parameters Maximum likelihood estimation Later Bayesian learning What about learning structure?
9
10-708 – Carlos Guestrin 2006-2008 9 Learning the structure of a BN Constraint-based approach BN encodes conditional independencies Test conditional independencies in data Find an I-map Score-based approach Finding a structure and parameters is a density estimation task Evaluate model as we evaluated parameters Maximum likelihood Bayesian etc. Data … Flu Allergy Sinus Headache Nose Possible structures Learn structure and parameters
10
10-708 – Carlos Guestrin 2006-2008 10 Remember: Obtaining a P-map? Given the independence assertions that are true for P Obtain skeleton Obtain immoralities From skeleton and immoralities, obtain every (and any) BN structure from the equivalence class Constraint-based approach: Use Learn PDAG algorithm Key question: Independence test
11
10-708 – Carlos Guestrin 2006-2008 11 Score-based approach Data … Flu Allergy Sinus Headache Nose Possible structures Score structure Learn parameters
12
10-708 – Carlos Guestrin 2006-2008 12 Information-theoretic interpretation of maximum likelihood Given structure, log likelihood of data: Flu Allergy Sinus Headache Nose
13
10-708 – Carlos Guestrin 2006-2008 13 Information-theoretic interpretation of maximum likelihood 2 Given structure, log likelihood of data: Flu Allergy Sinus Headache Nose
14
10-708 – Carlos Guestrin 2006-2008 14 Decomposable score Log data likelihood Decomposable score: Decomposes over families in BN (node and its parents) Will lead to significant computational efficiency!!! Score(G : D) = i FamScore(X i |Pa Xi : D)
15
Announcements Recitation tomorrow Don’t miss it! HW2 Out today Due in 2 weeks Projects!!! Proposals due Oct. 8 th in class Individually or groups of two Details on course website Project suggestions will be up soon!!! 15 10-708 – Carlos Guestrin 2006-2008
16
BN code release!!!! Pre-release of a C++ library for probabilistic inference and learning Features: basic datastructures (random variables, processes, linear algebra) distributions (Gaussian, multinomial,...) basic graph structures (directed, undirected) graphical models (Bayesian network, MRF, junction trees) inference algorithms (variable elimination, loopy belief propagation, filtering) Limited amount of learning (IPF, Chow Liu, order-based search) Supported platforms: Linux (tested on Ubuntu 8.04) MacOS X (tested on 10.4/10.5) limited Windows support Will be made available to the class early next week. 10-708 – Carlos Guestrin 2006-2008 16
17
10-708 – Carlos Guestrin 2006-2008 17 How many trees are there? Nonetheless – Efficient optimal algorithm finds best tree
18
10-708 – Carlos Guestrin 2006-2008 18 Scoring a tree 1: I-equivalent trees
19
10-708 – Carlos Guestrin 2006-2008 19 Scoring a tree 2: similar trees
20
10-708 – Carlos Guestrin 2006-2008 20 Chow-Liu tree learning algorithm 1 For each pair of variables X i,X j Compute empirical distribution: Compute mutual information: Define a graph Nodes X 1,…,X n Edge (i,j) gets weight
21
10-708 – Carlos Guestrin 2006-2008 21 Chow-Liu tree learning algorithm 2 Optimal tree BN Compute maximum weight spanning tree Directions in BN: pick any node as root, breadth-first- search defines directions
22
10-708 – Carlos Guestrin 2006-2008 22 Can we extend Chow-Liu 1 Tree augmented naïve Bayes (TAN) [Friedman et al. ’97] Naïve Bayes model overcounts, because correlation between features not considered Same as Chow-Liu, but score edges with:
23
10-708 – Carlos Guestrin 2006-2008 23 Can we extend Chow-Liu 2 (Approximately learning) models with tree-width up to k [Chechetka & Guestrin ’07] But, O(n 2k+6 )
24
10-708 – Carlos Guestrin 2006-2008 24 What you need to know about learning BN structures so far Decomposable scores Maximum likelihood Information theoretic interpretation Best tree (Chow-Liu) Best TAN Nearly best k-treewidth (in O(N 2k+6 ))
25
10-708 – Carlos Guestrin 2006-2008 25 m Can we really trust MLE? What is better? 3 heads, 2 tails 30 heads, 20 tails 3x10 23 heads, 2x10 23 tails Many possible answers, we need distributions over possible parameters
26
10-708 – Carlos Guestrin 2006-2008 26 Bayesian Learning Use Bayes rule: Or equivalently:
27
10-708 – Carlos Guestrin 2006-2008 27 Bayesian Learning for Thumbtack Likelihood function is simply Binomial: What about prior? Represent expert knowledge Simple posterior form Conjugate priors: Closed-form representation of posterior (more details soon) For Binomial, conjugate prior is Beta distribution
28
10-708 – Carlos Guestrin 2006-2008 28 Beta prior distribution – P( ) Likelihood function: Posterior:
29
10-708 – Carlos Guestrin 2006-2008 29 Posterior distribution Prior: Data: m H heads and m T tails Posterior distribution:
30
10-708 – Carlos Guestrin 2006-2008 30 Conjugate prior Given likelihood function P(D| ) (Parametric) prior of the form P( | ) is conjugate to likelihood function if posterior is of the same parametric family, and can be written as: P( | ’), for some new set of parameters ’ Prior: Data: m H heads and m T tails (binomial likelihood) Posterior distribution:
31
10-708 – Carlos Guestrin 2006-2008 31 Using Bayesian posterior Posterior distribution: Bayesian inference: No longer single parameter: Integral is often hard to compute
32
10-708 – Carlos Guestrin 2006-2008 32 Bayesian prediction of a new coin flip Prior: Observed m H heads, m T tails, what is probability of m+1 flip is heads?
33
10-708 – Carlos Guestrin 2006-2008 33 Asymptotic behavior and equivalent sample size Beta prior equivalent to extra thumbtack flips: As m → 1, prior is “forgotten” But, for small sample size, prior is important! Equivalent sample size: Prior parameterized by H, T, or m’ (equivalent sample size) and Fix m’, change Fix , change m’
34
10-708 – Carlos Guestrin 2006-2008 34 Bayesian learning corresponds to smoothing m=0 ) prior parameter m!1 ) MLE m
35
10-708 – Carlos Guestrin 2006-2008 35 Bayesian learning for multinomial What if you have a k sided coin??? Likelihood function if multinomial: Conjugate prior for multinomial is Dirichlet: Observe m data points, m i from assignment i, posterior: Prediction:
36
10-708 – Carlos Guestrin 2006-2008 36 Bayesian learning for two-node BN Parameters X, Y|X Priors: P( X ): P( Y|X ):
37
10-708 – Carlos Guestrin 2006-2008 37 Very important assumption on prior: Global parameter independence Global parameter independence: Prior over parameters is product of prior over CPTs
38
10-708 – Carlos Guestrin 2006-2008 38 Global parameter independence, d-separation and local prediction Flu Allergy Sinus Headache Nose Independencies in meta BN: Proposition: For fully observable data D, if prior satisfies global parameter independence, then
39
10-708 – Carlos Guestrin 2006-2008 39 Within a CPT Meta BN including CPT parameters: Are Y|X=t and Y|X=f d-separated given D? Are Y|X=t and Y|X=f independent given D? Context-specific independence!!! Posterior decomposes:
40
10-708 – Carlos Guestrin 2006-2008 40 Priors for BN CPTs (more when we talk about structure learning) Consider each CPT: P(X|U=u) Conjugate prior: Dirichlet( X=1|U=u,…, X=k|U=u ) More intuitive: “prior data set” D’ with m’ equivalent sample size “prior counts”: prediction:
41
10-708 – Carlos Guestrin 2006-2008 41 An example
42
10-708 – Carlos Guestrin 2006-2008 42 What you need to know about parameter learning MLE: score decomposes according to CPTs optimize each CPT separately Bayesian parameter learning: motivation for Bayesian approach Bayesian prediction conjugate priors, equivalent sample size Bayesian learning ) smoothing Bayesian learning for BN parameters Global parameter independence Decomposition of prediction according to CPTs Decomposition within a CPT
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.