1 BN Semantics 3 – Now it’s personal! Parameter Learning 1 Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 22 nd, 2006 Readings:

Slides:



Advertisements
Similar presentations
1 Undirected Graphical Models Graphical Models – Carlos Guestrin Carnegie Mellon University October 29 th, 2008 Readings: K&F: 4.1, 4.2, 4.3, 4.4,
Advertisements

A Tutorial on Learning with Bayesian Networks
Parameter and Structure Learning Dhruv Batra, Recitation 10/02/2008.
Identifying Conditional Independencies in Bayes Nets Lecture 4.
Chapter 8-3 Markov Random Fields 1. Topics 1. Introduction 1. Undirected Graphical Models 2. Terminology 2. Conditional Independence 3. Factorization.
Overview Full Bayesian Learning MAP learning
Bayesian Network Representation Continued
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.
. Maximum Likelihood (ML) Parameter Estimation with applications to reconstructing phylogenetic trees Comput. Genomics, lecture 6b Presentation taken from.
Bayesian Networks Alan Ritter.
Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.
Learning Bayesian Networks (From David Heckerman’s tutorial)
1 EM for BNs Graphical Models – Carlos Guestrin Carnegie Mellon University November 24 th, 2008 Readings: 18.1, 18.2, –  Carlos Guestrin.
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.
1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,
1 EM for BNs Kalman Filters Gaussian BNs Graphical Models – Carlos Guestrin Carnegie Mellon University November 17 th, 2006 Readings: K&F: 16.2 Jordan.
Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website
Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)
1 BN Semantics 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 15 th, 2008 Readings: K&F: 3.1, 3.2, –  Carlos.
1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:
Maximum Likelihood Estimation
1 Use graphs and not pure logic Variables represented by nodes and dependencies by edges. Common in our language: “threads of thoughts”, “lines of reasoning”,
1 BN Semantics 2 – The revenge of d-separation Graphical Models – Carlos Guestrin Carnegie Mellon University September 20 th, 2006 Readings: K&F:
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
1 BN Semantics 2 – Representation Theorem The revenge of d-separation Graphical Models – Carlos Guestrin Carnegie Mellon University September 17.
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,
1 BN Semantics 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 15 th, 2006 Readings: K&F: 3.1, 3.2, 3.3.
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.
BN Semantic II d-Separation, PDAGs, etc
CS 2750: Machine Learning Directed Graphical Models
Probability Theory and Parameter Estimation I
Bayesian networks Chapter 14 Section 1 – 2.
ECE 5424: Introduction to Machine Learning
Read R&N Ch Next lecture: Read R&N
Oliver Schulte Machine Learning 726
Bayesian Networks: Motivation
Read R&N Ch Next lecture: Read R&N
CS498-EA Reasoning in AI Lecture #20
Readings: K&F: 15.1, 15.2, 15.3, 15.4, 15.5 K&F: 7 (overview of inference) K&F: 8.1, 8.2 (Variable Elimination) Structure Learning in BNs 3: (the good,
CSCI 5822 Probabilistic Models of Human and Machine Learning
Bayesian Networks Independencies Representation Probabilistic
Bayesian Learning Chapter
Parameter Learning 2 Structure Learning 1: The good
EM for BNs Kalman Filters Gaussian BNs
Readings: K&F: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7 Markov networks, Factor graphs, and an unified view Start approximate inference If we are lucky… Graphical.
Approximate Inference by Sampling
I-maps and perfect maps
Unifying Variational and GBP Learning Parameters of MNs EM for BNs
BN Semantics 3 – Now it’s personal! Parameter Learning 1
I-maps and perfect maps
Junction Trees 3 Undirected Graphical Models
Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website
Read R&N Ch Next lecture: Read R&N
Markov Networks.
Variable Elimination Graphical Models – Carlos Guestrin
Generalized Belief Propagation
BN Semantics 2 – The revenge of d-separation
Presentation transcript:

1 BN Semantics 3 – Now it’s personal! Parameter Learning 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 22 nd, 2006 Readings: K&F: 3.4, 14.1, 14.2

–  Carlos Guestrin Building BNs from independence properties From d-separation we learned:  Start from local Markov assumptions, obtain all independence assumptions encoded by graph  For most P’s that factorize over G, I(G) = I(P)  All of this discussion was for a given G that is an I-map for P Now, give me a P, how can I get a G?  i.e., give me the independence assumptions entailed by P  Many G are “equivalent”, how do I represent this?  Most of this discussion is not about practical algorithms, but useful concepts that will be used by practical algorithms Practical algs next week

–  Carlos Guestrin Minimal I-maps One option:  G is an I-map for P  G is as simple as possible G is a minimal I-map for P if deleting any edges from G makes it no longer an I-map

–  Carlos Guestrin Obtaining a minimal I-map Given a set of variables and conditional independence assumptions Choose an ordering on variables, e.g., X 1, …, X n For i = 1 to n  Add X i to the network  Define parents of X i, Pa X i, in graph as the minimal subset of {X 1,…,X i-1 } such that local Markov assumption holds – X i independent of rest of {X 1,…,X i-1 }, given parents Pa Xi  Define/learn CPT – P(X i | Pa Xi ) Flu, Allergy, SinusInfection, Headache

–  Carlos Guestrin Minimal I-map not unique (or minimal) Given a set of variables and conditional independence assumptions Choose an ordering on variables, e.g., X 1, …, X n For i = 1 to n  Add X i to the network  Define parents of X i, Pa X i, in graph as the minimal subset of {X 1,…,X i-1 } such that local Markov assumption holds – X i independent of rest of {X 1,…,X i-1 }, given parents Pa Xi  Define/learn CPT – P(X i | Pa Xi ) Flu, Allergy, SinusInfection, Headache

–  Carlos Guestrin Perfect maps (P-maps) I-maps are not unique and often not simple enough Define “simplest” G that is I-map for P  A BN structure G is a perfect map for a distribution P if I(P) = I(G) Our goal:  Find a perfect map!  Must address equivalent BNs

–  Carlos Guestrin Inexistence of P-maps 1 XOR (this is a hint for the homework)

–  Carlos Guestrin Inexistence of P-maps 2 (Slightly un-PC) swinging couples example

–  Carlos Guestrin Obtaining a P-map Given the independence assertions that are true for P Assume that there exists a perfect map G *  Want to find G * Many structures may encode same independencies as G *, when are we done?  Find all equivalent structures simultaneously!

–  Carlos Guestrin I-Equivalence Two graphs G 1 and G 2 are I-equivalent if I(G 1 ) = I(G 2 ) Equivalence class of BN structures  Mutually-exclusive and exhaustive partition of graphs How do we characterize these equivalence classes?

–  Carlos Guestrin Skeleton of a BN Skeleton of a BN structure G is an undirected graph over the same variables that has an edge X–Y for every X ! Y or Y ! X in G (Little) Lemma: Two I- equivalent BN structures must have the same skeleton A H C E G D B F K J I

–  Carlos Guestrin What about V-structures? V-structures are key property of BN structure Theorem: If G 1 and G 2 have the same skeleton and V-structures, then G 1 and G 2 are I-equivalent A H C E G D B F K J I

–  Carlos Guestrin Same V-structures not necessary Theorem: If G 1 and G 2 have the same skeleton and V-structures, then G 1 and G 2 are I-equivalent Though sufficient, same V-structures not necessary

–  Carlos Guestrin Immoralities & I-Equivalence Key concept not V-structures, but “immoralities” (unmarried parents )  X ! Z Ã Y, with no arrow between X and Y  Important pattern: X and Y independent given their parents, but not given Z  (If edge exists between X and Y, we have covered the V-structure) Theorem: G 1 and G 2 have the same skeleton and immoralities if and only if G 1 and G 2 are I-equivalent

–  Carlos Guestrin Obtaining a P-map Given the independence assertions that are true for P  Obtain skeleton  Obtain immoralities From skeleton and immoralities, obtain every (and any) BN structure from the equivalence class

–  Carlos Guestrin Identifying the skeleton 1 When is there an edge between X and Y? When is there no edge between X and Y?

–  Carlos Guestrin Identifying the skeleton 2 Assume d is max number of parents (d could be n) For each X i and X j  E ij à true  For each U µ X – {X i,X j }, |U| · 2d Is (X i  X j | U) ?  E ij à true  If E ij is true Add edge X – Y to skeleton

–  Carlos Guestrin Identifying immoralities Consider X – Z – Y in skeleton, when should it be an immorality? Must be X ! Z Ã Y (immorality):  When X and Y are never independent given U, if Z 2 U Must not be X ! Z Ã Y (not immorality):  When there exists U with Z 2 U, such that X and Y are independent given U

–  Carlos Guestrin From immoralities and skeleton to BN structures Representing BN equivalence class as a partially-directed acyclic graph (PDAG) Immoralities force direction on other BN edges Full (polynomial-time) procedure described in reading

–  Carlos Guestrin What you need to know Minimal I-map  every P has one, but usually many Perfect map  better choice for BN structure  not every P has one  can find one (if it exists) by considering I-equivalence  Two structures are I-equivalent if they have same skeleton and immoralities

–  Carlos Guestrin Announcements I’ll lead a special discussion session:  Today 2-3pm in NSH 1507 talk about homework, especially programming question

–  Carlos Guestrin Review Bayesian Networks  Compact representation for probability distributions  Exponential reduction in number of parameters  Exploits independencies Next – Learn BNs  parameters  structure Flu Allergy Sinus Headache Nose

–  Carlos Guestrin Thumbtack – Binomial Distribution P(Heads) = , P(Tails) = 1-  Flips are i.i.d.:  Independent events  Identically distributed according to Binomial distribution Sequence D of  H Heads and  T Tails

–  Carlos Guestrin Maximum Likelihood Estimation Data: Observed set D of  H Heads and  T Tails Hypothesis: Binomial distribution Learning  is an optimization problem  What’s the objective function? MLE: Choose  that maximizes the probability of observed data:

–  Carlos Guestrin Your first learning algorithm Set derivative to zero:

–  Carlos Guestrin Learning Bayes nets Known structureUnknown structure Fully observable data Missing data x (1) … x (m) Data structureparameters CPTs – P(X i | Pa Xi )

–  Carlos Guestrin Learning the CPTs x (1) … x (m) Data For each discrete variable X i

–  Carlos Guestrin Learning the CPTs x (1) … x (m) Data For each discrete variable X i WHY??????????

–  Carlos Guestrin Maximum likelihood estimation (MLE) of BN parameters – example Given structure, log likelihood of data: Flu Allergy Sinus Headache Nose

–  Carlos Guestrin Maximum likelihood estimation (MLE) of BN parameters – General case Data: x (1),…,x (m) Restriction: x (j) [Pa Xi ] ! assignment to Pa Xi in x (j) Given structure, log likelihood of data:

–  Carlos Guestrin Taking derivatives of MLE of BN parameters – General case

–  Carlos Guestrin General MLE for a CPT Take a CPT: P(X|U) Log likelihood term for this CPT Parameter  X=x|U=u :

–  Carlos Guestrin Parameter sharing (basics now, more later in the semester) Suppose we want to model customers’ rating for books You know:  features of customers, e.g., age, gender, income,…  features of books, e.g., genre, awards, # of pages, has pictures,…  ratings: each user rates a few books A simple BN:

–  Carlos Guestrin Using recommender system Answer probabilistic question:

–  Carlos Guestrin Learning parameters of recommender system BN How many parameters do I have to learn? How many samples do I have?

–  Carlos Guestrin Parameter sharing for recommender system BN Use same parameters in many CPTs How many parameters do I have to learn? How many samples do I have?

–  Carlos Guestrin MLE with simple parameter sharing Estimating  : Estimating  Estimating 

–  Carlos Guestrin What you need to know about learning BNs thus far Maximum likelihood estimation  decomposition of score  computing CPTs Simple parameter sharing  why share parameters?  computing MLE for shared parameters