Slide 1 Directed Graphical Probabilistic Models: learning in DGMs William W. Cohen Machine Learning 10-601.

Slides:



Advertisements
Similar presentations
CS188: Computational Models of Human Behavior
Advertisements

Mixture Models and the EM Algorithm
Unsupervised Learning
Expectation Maximization
Supervised Learning Recap
For Monday Read chapter 18, sections 1-2 Homework: –Chapter 14, exercise 8 a-d.
Segmentation and Fitting Using Probabilistic Methods
K-means clustering Hongning Wang
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Visual Recognition Tutorial
Overview Full Bayesian Learning MAP learning
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
… Hidden Markov Models Markov assumption: Transition model:
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Midterm Review. The Midterm Everything we have talked about so far Stuff from HW I won’t ask you to do as complicated calculations as the HW Don’t need.
Unsupervised Learning: Clustering Rong Jin Outline  Unsupervised learning  K means for clustering  Expectation Maximization algorithm for clustering.
Lecture 5: Learning models using EM
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Expectation Maximization Algorithm
Visual Recognition Tutorial
Learning Bayesian Networks
Bayesian Networks Alan Ritter.
EM Algorithm Likelihood, Mixture Models and Clustering.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
Gaussian Mixture Models and Expectation Maximization.
Crash Course on Machine Learning
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
EM and expected complete log-likelihood Mixture of Experts
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Lecture 19: More EM Machine Learning April 15, 2010.
1 Clustering: K-Means Machine Learning , Fall 2014 Bhavana Dalvi Mishra PhD student LTI, CMU Slides are based on materials from Prof. Eric Xing,
For Wednesday Read Chapter 11, sections 1-2 Program 2 due.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
First topic: clustering and pattern recognition Marc Sobel.
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Lecture 17 Gaussian Mixture Models and Expectation Maximization
HMM - Part 2 The EM algorithm Continuous density HMM.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
CS Statistical Machine learning Lecture 24
Slides for “Data Mining” by I. H. Witten and E. Frank.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 Clustering: K-Means Machine Learning , Fall 2014 Bhavana Dalvi Mishra PhD student LTI, CMU Slides are based on materials from Prof. Eric Xing,
Topic Models.
Lecture 2: Statistical learning primer for biologists
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Flat clustering approaches
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Slide 1 Directed Graphical Probabilistic Models: inference William W. Cohen Machine Learning Feb 2008.
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Latent Dirichlet Allocation
Classification of unlabeled data:
Latent Variables, Mixture Models and EM
Hidden Markov Models Part 2: Algorithms
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Directed Graphical Probabilistic Models:
Directed Graphical Probabilistic Models: the sequel
LECTURE 07: BAYESIAN ESTIMATION
Presentation transcript:

Slide 1 Directed Graphical Probabilistic Models: learning in DGMs William W. Cohen Machine Learning

Slide 2 REVIEW OF DGMS

Another difficult problem: common- sense reasoning

Slide 5 Summary of Monday(1): Bayes nets Many problems can be solved using the joint probability P(X 1,…,X n ). Bayes nets describe a way to compactly write the joint. For a Bayes net: AB First guess The money C The goat D Stick or swap? E Second guess AP(A) BP(B) ABCP(C|A,B) ………… ACDP(E|A,C,D) ………… Conditional independence:

Slide 6 Summary of Monday(2): d-separation Z Z Z X YE There are three ways paths from X to Y given evidence E can be blocked. X is d-separated from Y given E iff all paths from X to Y given E are blocked If X is d-separated from Y given E, then I All the ways paths can be blocked

Slide 7 Quiz1: which are d-separated? X1 X2 X3 Z Z Y1 Y2 Y3 Z Evidence nodes are filled

Slide 8 Quiz2: which is an example of explaining away? X2 X3 Z Z Y2 Y3 Evidence nodes are filled

Slide 9 Summary of Wednesday(1): inference “down” Simplified! d-sep + polytree Recursive call to P(E - |Y) propagating requests for “ belief due to evidential support ” down the tree I.e. info on Pr(E-|X) flows up

Slide 10 Summary of Wed(2): inference “up” CPT table lookup Recursive call to P(. |E + ) propagating requests for “ belief due to causal evidence ” up the tree Evidence for Uj that doesn’t go thru X I.e. info on Pr(X|E+) flows down

Slide 11 Summary of Wed(2.5) Markov blanket The Markov blanket for a random variable A is the set of variables B 1,…,B k that A is not conditionally independent of I.e., not d-separated from For DGM this includes parents, children, and “co-parents” (other parents of children) because explaining away

Slide 12 recursion Summary of Wed(3): sideways?! Similar ideas as before but more complex recursion: Z’s are NOT independent of X (because explaining away) Neither are they strictly higher or lower in the tree.

Slide 13 Summary of Wed(4) We reduced P(X|E) to product of two recursively calculated parts: P(X=x|E + ) i.e., CPT for X and product of “ forward ” messages from parents P(E - |X=x) i.e., combination of “ backward ” messages from parents, CPTs, and P(Z|E Z\Yk ), a simpler instance of P(X|E) This can also be implemented by message-passing (belief propagation) Messages are distributions – i.e., vectors

Another difficult problem: common- sense reasoning Have we solved the common-sense reasoning problem? Yes: We use directed graphical models. Semantics: how to specify them Inference: how to use them Yes: We use directed graphical models. Semantics: how to specify them Inference: how to use them The last piece: Learning: how to find parameters The last piece: Learning: how to find parameters

LEARNING IN DGMS

Learning for Bayes nets Input: - Sample of the joint: - Graph structure of the variables for I=1,…,N, you know Xi and parents(Xi) Output: - Estimated CPTs AB C D E ABCP(C|A,B) ………… BP(B) … Method (discrete variables): Estimate each CPT independently Use a MLE or MAP

Learning for Bayes nets AB C D E ABCP(C|A,B) ………… BP(B) … Method (discrete variables): Estimate each CPT independently Use a MLE or MAP estimate for MLE estimate:

MAP estimates The beta distribution: “ pseudo-data ” : like hallucinating a few heads and a few tails

MAP estimates The Dirichlet distribution: “ pseudo-data ” : like hallucinating α i examples of X=i for each value of i

Learning for Bayes nets AB C D E ABCP(C|A,B) ………… BP(B) … Method (discrete variables): Estimate each CPT independently Use a MLE or MAP estimate for MAP estimate:

Learning for Bayes nets AB C D E ABCP(C|A,B) ………… BP(B) Method (discrete variables): Estimate each CPT independently Use a MLE or MAP estimate for MAP estimate:

WAIT, THAT’S IT?

Learning for Bayes nets AB C D E ABCP(C|A,B) ………… BP(B) … Method (discrete variables): Estimate each CPT independently Use a MLE or MAP estimate for MAP estimate: Actually – yes.

DGMS THAT DESCRIBE LEARNING ALGORITHMS

A network with a familiar generative story For each document d in the corpus: - Pick a label y d from Pr(Y) - For each word in d: Pick a word x id from Pr(X|Y=y d ) Y Y X1 X2 X3 X4 Pr(Y=y) onion0.3 economist0.7 YX1Pr(X|y=y) onionaardvark0.034 onionai ……… economistaardvark …. economistzymurgy YX2Pr(X|y=y) onionaardvark0.034 onionai ……… economistaardvark …. economistzymurgy for X1 for X2 “Tied parameters” YX3Pr(X|y=y) onionaardvark0.034 onionai ……… economistaardvark …. economistzymurgy for X3 etc

A network with a familiar generative story For each document d in the corpus (of size D): - Pick a label y d from Pr(Y) - For each word in d (of length N d ): Pick a word x id from Pr(X|Y=y d ) Y Y X X Pr(Y=y) onion0.3 economist0.7 YXPr(X|y=y) onionaardvark0.034 onionai ……… economistaardvark …. economistzymurgy for every X Plate diagram NdNd D

Learning for Bayes nets naïve Bayes Method (discrete variables): Estimate each CPT independently Use a MLE or MAP estimate for MAP estimate: Y Y X X NdNd D

Slide 28 Inference for Bayes nets naïve Bayes d-sep + polytree Recursive call to P(E - |. ) So far: simple way of propagating requests for “ belief due to evidential support ” down the tree I.e. info on Pr(E-|X) flows up Y X1X1 X2X2

Slide 29 Inference for Bayes nets naïve Bayes Y YY X1X1 X2X2 So: we end up with a “soft” version of the naïve Bayes classification rule

Slide 30 Summary Bayes nets we can infer Pr(Y|X) we can learn CPT parameters from data samples from the joint distribution this gives us a generative model which we can use to design a classifier Special case: naïve Bayes! Ok, so what? show me something new!

Slide 31 LEARNING IN DGMS I lied, there’s actually more. Hidden (or latent) variables

Slide 32 If we know the parameters (the CPTs) I can infer the values of the hidden variables. But, we don’t If we know the values of the hidden variables we can learn the values of the parameters (the CPTs) But, we don’t Hidden (or latent) variables

Slide 33 Learning with Hidden Variables Z X Y Hidden variables: what if some of your data is not completely observed? Method: 1.Estimate parameters θsomehow or other. 2.Predict values of hidden variables using θ. 3.Add pseudo-data corresponding to these predictions, weighting each example by confidence in its correctness. 4.Re-estimate parameters using the extended dataset (real + pseudo-data). 5.Repeat starting at step 2…. ZXY ugrad<20facebook ugrad20sfacebook grad20sthesis grad20sfacebook prof30+grants ?<20facebook ?30sthesis expectation maximization (MLE/MAP) Expectation-maximization aka EM

Slide 34 Learning with Hidden Variables: Example ZX1X2 ugrad<20facebook ugrad20sfacebook grad20sthesis grad20sfacebook prof30+grants ?<20facebook ?30sthesis Z X1 X2 ZX1P(X1|Z) undergr< s grad< s prof< s ZP(Z) undergr0.333 grad0.333 prof0.333 ZX2P(X2|Z) undergrfacebk.6 thesis.2 grants.2 gradfacebk.4 thesis.4 grants.2 proffacebk.25 thesis.25 grants.5

Slide 35 Learning with Hidden Variables: Example ZX1X2 ugrad<20facebook ugrad20sfacebook grad20sthesis grad20sfacebook prof30+grants ?<20facebook ?30+thesis Z X1 X2 ZX1P(X1|Z) undergr< s grad< s prof< s ZP(Z) undergr0.333 grad0.333 prof0.333 ZX2P(X2|Z) undergrfacebk.6 thesis.2 grants.2 gradfacebk.4 thesis.4 grants.2 proffacebk.25 thesis.25 grants.5

Slide 36 Learning with Hidden Variables: Example ZX1X2 ugrad<20facebook ugrad20sfacebook grad20sthesis grad20sfacebook prof30+grants ugrad<20facebook grad<20facebook prof<20facebook ?30sthesis Z X1 X2 ZX1P(X1|Z) undergr< s grad< s prof< s ZP(Z) undergr0.333 grad0.333 prof0.333 ZX2P(X2|Z) undergrfacebk.6 thesis.2 grants.2 gradfacebk.4 thesis.4 grants.2 proffacebk.25 thesis.25 grants.5 ZX1X2 ugrad<20facebook ugrad20sfacebook grad20sthesis grad20sfacebook prof30+grants ?<20facebook ?30sthesis

Slide 37 Learning with Hidden Variables: Example ZX1X2 ugrad<20facebook ugrad20sfacebook grad20sthesis grad20sfacebook prof30+grants ugrad<20facebook grad<20facebook prof<20facebook ugrad30+thesis grad30+thesis prof30+thesis Z X1 X2 ZX1P(X1|Z) undergr<20 20s 30+ grad<20 20s 30+ prof<20 20s 30+ ZP(Z) undergr grad prof ZX2P(X2|Z) undergrfacebk thesis grants gradfacebk thesis grants proffacebk thesis grants

Slide 38 Learning with Hidden Variables: Example ZX1X2 ugrad<20facebook ugrad20sfacebook grad20sthesis grad20sfacebook prof30+grants ugrad<20facebook grad<20facebook prof<20facebook ugrad30+thesis grad30+thesis prof30+thesis Z X1 X2 ZP(Z) undergr grad prof

Slide 39 Learning with Hidden Variables: Example ZX1X2 ugrad<20facebook ugrad20sfacebook grad20sthesis grad20sfacebook prof30+grants ugrad<20facebook grad<20facebook prof<20facebook ugrad30+thesis grad30+thesis prof30+thesis Z X1 X ZX1P(X1|Z) undergfacebk thesis grants gradfacebk thesis grants proffacebk thesis grants

Slide 40 Learning with hidden variables Z X1 X2 Hidden variables: what if some of your data is not completely observed? Method: 1.Estimate parameters somehow or other. 2.Predict unknown values from your estimate. 3.Add pseudo-data corresponding to these predictions, weighting each example by confidence in its correctness. 4.Re-estimate parameters using the extended dataset (real + pseudo-data). 5.Repeat starting at step 2…. ZX1X2 ugrad<20facebook ugrad20sfacebook grad20sthesis grad20sfacebook prof30+grants ?<20facebook ?30sthesis

Slide 41 Aside: Why does this work? Ignore prior - MLE Q(z) > 0 Q(z) a pdf X: known data Z: hidden data Θ: parameters Let’s try and maximize Θ X: known data Z: hidden data Θ: parameters Let’s try and maximize Θ

Slide 42 Aside: Why does this work? Ignore prior Q(z) > 0 Q(z) a pdf Initial estimate of θ This isn’t really right though: what we’re doing is

Slide 43 Aside: Jensen ’ s inequality x1x1 x2x2 log(x) log(x 1 ) log(x 2 ) q 1 x 1 +q 2 x 2 where q 1 +q 2 =1 q 1 log(x 1 )+q 2 log(x 2 ) log(q 1 x 1 +q 2 x 2 ) Claim: log(q 1 x 1 +q 2 x 2 ) ≥ q 1 log(x 1 )+q 2 log(x 2 ) for any downward-concave function, not just log(x) Further: log(E Q [X]) ≥ E Q [log(X)] * *

Slide 44 Aside: Why does this work? Ignore prior Q(z) > 0 Q(z) a pdf since log(E Q [X]) ≥ E Q [log(X)] Q is any estimate of θ – say θ 0 θ ’ depends on X,Z but not directly on Q so… P(X,Z,θ ’ |Q)=P(θ ’ |X,Z,Q)*P(X,Z|Q) So, plugging in pseudo-data weighted by Q and finding MLE optimizes a lower bound on log-likelihood

Slide 45 Aside: Why does this work? Ignore prior Q(z) > 0 Q(z) a pdf since log(E Q [X]) ≥ E Q [log(X)] So, plugging in pseudo-data weighted by Q and finding MLE optimizes a lower bound on log-likelihood

Slide 46 Learning with hidden variables Z X1 X2 Hidden variables: what if some of your data is not completely observed? Method (Expectation-Maximization, EM): 1.Estimate parameters somehow or other. 2.Predict unknown values from your estimated parameters (Expectation step) 3.Add pseudo-data corresponding to these predictions, weighting each example by confidence in its correctness. 4.Re-estimate parameters using the extended dataset (real + pseudo-data). You find the MLE or MAP values of the parameters. (Maximization step) 5.Repeat starting at step 2…. ZX1X2 ugrad<20facebook ugrad20sfacebook grad20sthesis grad20sfacebook prof30+grants ?<20facebook ?30sthesis EM maximizes a lower bound on log likelihood It will converge to a local maximum

Slide 47 Summary Bayes nets we can infer Pr(Y|X) we can learn CPT parameters from data samples from the joint distribution this gives us a generative model which we can use to design a classifier Special case: naïve Bayes! Ok, so what? show me something new!

Slide 48 Semi-supervised naïve Bayes Given: A pool of labeled examples L A (usually larger ) pool of unlabeled examples U Option 1 for using L and U : Ignore U and use supervised learning on L Option 2: Ignore labels in L+U and use k-means, etc find clusters; then label each cluster using L Question: Can you use both L and U to do better?

Learning for Bayes nets naïve Bayes Y Y X X NdNd DLDL Y Y X X NdNd DUDU Open circle = hidden variable Algorithm: naïve Bayes plus EM

Slide 50

Slide 51

Slide 52

Slide 53 Ok, so what? show me something else new!

K-Means: Algorithm Decide on a value for k. 2. Initialize the k cluster centers randomly if necessary. 3. Repeat till any object changes its cluster assignment Decide the cluster memberships of the N objects by assigning them to the nearest cluster centroid Re-estimate the k cluster centers, by assuming the memberships found above are correct. slides: Bhavana Dalvi

Mixture of Gaussians 55 slides: Bhavana Dalvi

A Bayes network for a mixture of Gaussians 56 Z Z X X M #examples #hidden mixture component Pr(X|z=k) is a Gaussian, not a multinomial.

A closely related Bayes network for a mixture of Gaussians 57 Z Z X X D M #examples #dimensions #hidden mixture component each Pr(X|z=k) is a univariate Gaussian, not a multinomial – one for each dimension

Gaussian Mixture Models (GMMs) 58 Consider a mixture of K Gaussian components: This model can be used for unsupervised clustering. This model (fit by AutoClass) has been used to discover new kinds of stars in astronomical data, etc. mixture proportion (prior) mixture component Pr(X|z)

Expectation-Maximization (EM) 59 Start: "Guess" the mean and covariance of each of the K gaussians Loop

60

Expectation-Maximization (EM) 61 Start: "Guess" the centroid and covariance of each of the K clusters Loop

The Expectation-Maximization (EM) Algorithm 62 E Step: Guess values of Z’s

The Expectation-Maximization (EM) Algorithm 63 M Step: Update parameter estimates

EM Algorithm for GMM 64 E Step: Guess values of Z’s M Step: Update parameter estimates

K-means is a hard version of EM 65 In the K-means “E-step” we do hard assignment: In the K-means “M-step” we update the means as the weighted sum of the data, but now the weights are 0 or 1:

Soft vs. Hard EM assignments 66 GMM K-Means

Questions 67 Which model is this? One multivariate Gaussian Two univariate Gaussians Which does k-means resemble most?

Recap: learning in Bayes nets We’re learning a density estimator: - Maps x  Pr(x|θ) - Data D is unlabeled examples: - When we learn a classifier (like Naïve Bayes) we’re doing it indirectly: One of the X’s is a designated class variable and we infer the value at test time

Recap: learning in Bayes nets Simple case: all variables are observed - We just estimate the CPTs from the data using a MAP estimate Harder case: some variables are hidden We can use EM: - Repeatedly find θ,Z,… - Find expectations over Z with inference - Maximize θ|Z by with MAP over pseudo-data

Recap: learning in Bayes nets Simple case: all variables are observed - We just estimate the CPTs from the data using a MAP estimate Harder case: some variables are hidden Practical applications of this: - Medical diagnosis (expert builds structure, CPTs learned from data) - …

Recap: learning in Bayes nets Special cases discussed today - Supervised multinomial naïve Bayes - Semi-supervised multinomial naïve Bayes - Unsupervised mixtures of Gaussians “soft k-means” Special cases discussed next: - Hidden Markov models

Some things to think about P(X|Y)Y observedY mixedY hidden Multinomial Gaussian naïve BayesSS naive Bayes Mixture of Gaussians Gaussian naïve Bayes Mixture of multinomials