Directed Graphical Probabilistic Models:

Directed Graphical Probabilistic Models:
the sequel William W. Cohen Machine Learning Feb 25

William W. Cohen Machine Learning 10-601 Feb 27 2008
Directed Graphical Probabilistic Models: the son of the child of the bride of the sequel William W. Cohen Machine Learning Feb

Outline Quick recap An example of learning
Given structure, find CPTs from “fully observed” data Some interesting special cases of this Learning with hidden variables Expectation-maximization Handwave argument for why EM works

The story so far: Bayes nets
Many problems can be solved using the joint probability P(X1,…,Xn). Bayes nets describe a way to compactly write the joint. For a Bayes net: A P(A) 1 0.33 2 3 First guess The money B P(B) 1 0.33 2 3 A B Stick or swap? C The goat D A B C P(C|A,B) 1 2 0.5 3 1.0 … E Conditional independence: Second guess A C D P(E|A,C,D) …

The story so far: d-separation
X E Y There are three ways paths from X to Y given evidence E can be blocked. X is d-separated from Y given E iff all paths from X to Y given E are blocked…see there?  If X is d-separated from Y given E, then I<X,E,Y> Z Z Z

The story so far: “Explaining away”
P(E|X,Y) P(E,X,Y) 0.96 0.24 1 0.04 0.01 X Y E P(E|X,Y) P(E,X,Y) P(X,Y|E=1) 1 0.04 0.01 0.014 0.96 0.24 0.329

Recap: Inference in linear chain networks
X1 Xn ... Xj ... “backward” “forward” Instead of recursion you can use “message passing” (forward-backward, Baum-Welsh)….

Recap: Inference in polytrees
Reduce P(X|E) to the product of two recursively calculated parts: P(X=x|E+) i.e., CPT for X and product of “forward” messages from parents P(E-|X=x) i.e., combination of “backward” messages from parents, CPTs, and P(Z|EZ\Yk), a simpler instance of P(X|E) This can also be implemented by message-passing (belief propagation)

Recap: Learning for Bayes nets
Input: Sample of the joint: Graph structure of the variables for I=1,…,N, you know Xi and parents(Xi) Output: Estimated CPTs A B B P(B) 1 0.33 2 3 C D Method (discrete variables): Estimate each CPT independently Use a MLE or MAP A B C P(C|A,B) 1 2 0.5 3 1.0 … E …

Recap: Learning for Bayes nets
Method (discrete variables): Estimate each CPT independently Use a MLE or MAP MAP: A B B P(B) 1 0.33 2 3 C D A B C P(C|A,B) 1 2 0.5 3 1.0 … E …

Recap: A detailed example
Z X Y Z P(Z) undergr grad prof Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants D: Z X P(X|Z) undergr <20 20s 30+ grad prof Z Y P(Y|Z) undergr facebk thesis grants grad prof ugrad <20 facebook thesis grant 20s … 30+ grad ..

A detailed example Z X Y D: Z P(Z) undergr 0.375 Grad Prof 0.250 Z X Y
ugrad <20 facebook 20s grad thesis prof 30+ grants X Y D: Z X P(X|Z) undergr <20 20s 30+ Grad prof Z Y P(Y|Z) undergr facebk thesis grants grad prof

A detailed example Z X Y D: Z P(Z) undergr 0.375 grad prof 0.250 Z X Y
ugrad <20 facebook 20s grad thesis prof 30+ grants X Y D: Z X P(X|Z) undergr <20 .4 20s 30+ .2 grad .6 Prof .25 .5 Z Y P(Y|Z) undergr facebk thesis grants grad prof

A detailed example Z Now we’re done learning: what can we do with this? guess what your favorite professor is doing now? given a new x,y compute P(prof|x,y), P(grad|x,y), P(ugrad|x,y)…using Bayes net inference given a new x,y predict the most likely “label” Z Z P(Z) undergr 0.375 grad prof 0.250 X Y Z X P(X|Z) undergr <20 .4 20s 30+ .2 grad .6 Prof .25 .5 Z Y P(Y|Z) undergr facebk .6 thesis .2 grants grad .4 prof .25 .5 Of course we need to implement our Bayes net inference method first…

A more interesting example
C C W1 P(W1|C) SATIRE aardvark .154 Basra .0001 zymurgy 0.0001 NEWS .00001 C W2 P(W2|C) SATIRE aardvark .154 Basra .0001 zymurgy 0.0001 NEWS .00001 Parameters are “shared” or “tied” W1 W2 … WN or C A “plate” … Wi N

Some special cases of Bayes net learning
Naïve Bayes HMMs for biology and information extraction Tree-augmented Naïve Bayes

Another interesting example
A phylogenomic analysis of the Actinomycetales mce operons

IE by text segmentation
Example: Addresses, bibliography records House number Building Road City State Zip 4089 Whispering Pines Nobel Drive San Diego CA 92122 Title Year Journal Author Volume Page P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, Author, title, year, … are like “positions” in the previous example

IE with Hidden Markov Models
HMMs for IE Note: we know how to train this model from segmented citations A B C 0.6 0.3 0.1 X Z 0.4 0.2 Y 0.8 Emission probabilities dddd dd 0.9 0.5 0.8 0.2 0.1 Transition probabilities Title Author Journal Year Probabilitistic transitions and outputs make the model more robust to errors and slight variations

… Z X P(X|Z) House aardvark .000001 Apt. .23 zymurgy 0.0001 Road
ave .1 forbes 0.05 …. Basic idea of E/M: plug in expectations, update theta’s, continue …

Results: Comparative Evaluation
Dataset instances Elements IITB student Addresses 2388 17 Company 769 6 US 740 The Nested model does best in all three cases (from Borkar et a, 2001)

Learning with hidden variables
Z Hidden variables: what if some of your data is not completely observed? Method: Estimate parameters somehow or other. Predict unknown values from your estimate. Add pseudo-data corresponding to these predictions, weighting each example by confidence in its correctness. Re-estimate parameters using the extended dataset (real + pseudo-data). Repeat starting at step 2…. X Y Expectation-maximization aka EM Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants ? 30s expectation maximization (MLE/MAP)

Learning with Hidden Variables: Example
Z X Y Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants ? 30s Z P(Z) undergr 0.333 grad prof Z X P(X|Z) undergr <20 .4 20s 30+ .2 grad .6 prof .25 .5 Z Y P(Y|Z) undergr facebk .6 thesis .2 grants grad .4 prof .25 .5 Detailed example

Z X Y Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants ? Z P(Z) undergr 0.333 grad prof Z X P(X|Z) undergr <20 .4 20s 30+ .2 grad .6 prof .25 .5 Z Y P(Y|Z) undergr facebk .6 thesis .2 grants grad .4 prof .25 .5 Detailed example

Z X Y Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants ? 30s Z P(Z) undergr 0.333 grad prof Z X P(X|Z) undergr <20 .4 20s 30+ .2 grad .6 prof .25 .5 Z Y P(Y|Z) undergr facebk .6 thesis .2 grants grad .4 prof .25 .5 Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants ? 30s Detailed example

Z X Y Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants Z P(Z) undergr grad prof Z X P(X|Z) undergr <20 20s 30+ grad prof Z Y P(Y|Z) undergr facebk thesis grants grad prof Detailed example

Z X Y Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants Z P(Z) undergr grad prof .38 .35 .27 Detailed example

Z X Y Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants Z Y P(Y|Z) underg facebk thesis grants grad prof .24 Detailed example .32 .54

Learning with hidden variables
Z Hidden variables: what if some of your data is not completely observed? Method: Estimate parameters somehow or other. Predict unknown values from your estimate. Add pseudo-data corresponding to these predictions, weighting each example by confidence in its correctness. Re-estimate parameters using the extended dataset (real + pseudo-data). Repeat starting at step 2…. X Y Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants ? 30s

Why does this work? Ignore prior - MLE Q(z) > 0 Q(z) a pdf
Why E/M works: the proof Q(z) a pdf

Why does this work? Ignore prior Q(z) a pdf Q(z) > 0
Initial estimate of θ Why E/M works: the proof

Jensen’s inequality Claim: log(q1x1+q2x2) ≥q1log(x1)+q2log(x2)
Holds for any downward-concave function, not just log(x) Further: log(EQ[X]) ≥ EQ[log(X)] log(q1x1+q2x2) log(x2) log(x) * * log(x1) q1log(x1)+q2log(x2) Jensen’s inequality picture x1 x2 q1x1+q2x2 where q1+q2=1

Q is any estimate of θ – say θ0
Why does this work? Ignore prior Q(z) a pdf Q(z) > 0 since log(EQ[X]) ≥ EQ[log(X)] Q is any estimate of θ – say θ0 θ’ depends on X,Z but not directly on Q so… P(X,Z,θ’|Q)=P(θ’|X,Z,Q)*P(X,Z|Q) Why E/M works: the proof So, plugging in pseudo-data weighted by Q and finding MLE optimizes a lower bound on log-likelihood

Directed Graphical Probabilistic Models:

Similar presentations

Presentation on theme: "Directed Graphical Probabilistic Models:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Directed Graphical Probabilistic Models:

Similar presentations

Presentation on theme: "Directed Graphical Probabilistic Models:"— Presentation transcript:

Similar presentations

About project

Feedback