Download presentation
Presentation is loading. Please wait.
1
Directed Graphical Probabilistic Models:
the sequel William W. Cohen Machine Learning Feb 25
2
William W. Cohen Machine Learning 10-601 Feb 27 2008
Directed Graphical Probabilistic Models: the son of the child of the bride of the sequel William W. Cohen Machine Learning Feb
3
Outline Quick recap An example of learning
Given structure, find CPTs from “fully observed” data Some interesting special cases of this Learning with hidden variables Expectation-maximization Handwave argument for why EM works
4
The story so far: Bayes nets
Many problems can be solved using the joint probability P(X1,…,Xn). Bayes nets describe a way to compactly write the joint. For a Bayes net: A P(A) 1 0.33 2 3 First guess The money B P(B) 1 0.33 2 3 A B Stick or swap? C The goat D A B C P(C|A,B) 1 2 0.5 3 1.0 … E Conditional independence: Second guess A C D P(E|A,C,D) …
5
The story so far: d-separation
X E Y There are three ways paths from X to Y given evidence E can be blocked. X is d-separated from Y given E iff all paths from X to Y given E are blocked…see there? If X is d-separated from Y given E, then I<X,E,Y> Z Z Z
6
The story so far: “Explaining away”
P(E|X,Y) P(E,X,Y) 0.96 0.24 1 0.04 0.01 X Y E P(E|X,Y) P(E,X,Y) P(X,Y|E=1) 1 0.04 0.01 0.014 0.96 0.24 0.329
7
Recap: Inference in linear chain networks
X1 Xn ... Xj ... “backward” “forward” Instead of recursion you can use “message passing” (forward-backward, Baum-Welsh)….
8
Recap: Inference in polytrees
Reduce P(X|E) to the product of two recursively calculated parts: P(X=x|E+) i.e., CPT for X and product of “forward” messages from parents P(E-|X=x) i.e., combination of “backward” messages from parents, CPTs, and P(Z|EZ\Yk), a simpler instance of P(X|E) This can also be implemented by message-passing (belief propagation)
9
Recap: Learning for Bayes nets
Input: Sample of the joint: Graph structure of the variables for I=1,…,N, you know Xi and parents(Xi) Output: Estimated CPTs A B B P(B) 1 0.33 2 3 C D Method (discrete variables): Estimate each CPT independently Use a MLE or MAP A B C P(C|A,B) 1 2 0.5 3 1.0 … E …
10
Recap: Learning for Bayes nets
Method (discrete variables): Estimate each CPT independently Use a MLE or MAP MAP: A B B P(B) 1 0.33 2 3 C D A B C P(C|A,B) 1 2 0.5 3 1.0 … E …
11
Recap: A detailed example
Z X Y Z P(Z) undergr grad prof Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants D: Z X P(X|Z) undergr <20 20s 30+ grad prof Z Y P(Y|Z) undergr facebk thesis grants grad prof ugrad <20 facebook thesis grant 20s … 30+ grad ..
12
A detailed example Z X Y D: Z P(Z) undergr 0.375 Grad Prof 0.250 Z X Y
ugrad <20 facebook 20s grad thesis prof 30+ grants X Y D: Z X P(X|Z) undergr <20 20s 30+ Grad prof Z Y P(Y|Z) undergr facebk thesis grants grad prof
13
A detailed example Z X Y D: Z P(Z) undergr 0.375 grad prof 0.250 Z X Y
ugrad <20 facebook 20s grad thesis prof 30+ grants X Y D: Z X P(X|Z) undergr <20 .4 20s 30+ .2 grad .6 Prof .25 .5 Z Y P(Y|Z) undergr facebk thesis grants grad prof
14
A detailed example Z Now we’re done learning: what can we do with this? guess what your favorite professor is doing now? given a new x,y compute P(prof|x,y), P(grad|x,y), P(ugrad|x,y)…using Bayes net inference given a new x,y predict the most likely “label” Z Z P(Z) undergr 0.375 grad prof 0.250 X Y Z X P(X|Z) undergr <20 .4 20s 30+ .2 grad .6 Prof .25 .5 Z Y P(Y|Z) undergr facebk .6 thesis .2 grants grad .4 prof .25 .5 Of course we need to implement our Bayes net inference method first…
15
A more interesting example
C C W1 P(W1|C) SATIRE aardvark .154 Basra .0001 zymurgy 0.0001 NEWS .00001 C W2 P(W2|C) SATIRE aardvark .154 Basra .0001 zymurgy 0.0001 NEWS .00001 Parameters are “shared” or “tied” W1 W2 … WN or C A “plate” … Wi N
16
Some special cases of Bayes net learning
Naïve Bayes HMMs for biology and information extraction Tree-augmented Naïve Bayes
17
Another interesting example
A phylogenomic analysis of the Actinomycetales mce operons
18
Another interesting example
Z1 Z2 Z4 Z3 ... X1 X2 X4 X3 Z1 X1 P(X1|Z1) pos1 A 0.05 T 0.2 C G 0.7 Z1 P(Z1) pos0 1.0 Z2 X2 P(X2|Z1) pos2 A 0.01 T C 0.02 G 0.95 Z3 X3 P(X3|Z3) pos3 A 0.2 T 0.58 C 0.02 G Z1 Z2 P(Z2|Z1) pos1 pos2 1.0 Z2 Z3 P(Z3|Z2) pos2 pos3 1.0 ... ...
19
Another interesting example
Tie: P(X2|Z2=pos4)=P(X4|Z4=pos4) G(T|A|G) “optional” p1 p2 p3 Tie: P(Xi|Zi=pos4)=P(Xj|Zj=pos4) Z1 Z2 Z4 Z3 ... X1 X2 X4 X3 Z1 X1 P(X1|Z1) pos1 A 0.05 T 0.2 C G 0.7 Z2 X2 P(X2|Z2) pos2 A 0.01 T C 0.02 G 0.95 pos4 0.05 0.45 Z1 P(Z1) pos0 1.0 Z1 Z2 P(Z2|Z1) pos1 pos2 0.5 pos4
20
Another interesting example
Tie: P(X2|Z2=pos4)=P(X4|Z4=pos4) G(T|A|G) “optional” Tie: P(Xi|Zi=pos4)=P(Xj|Zj=pos4) … 0.5 p1 p2 p4 p3 ... 0.5 P(X|P1) P(X|P2) … Three tables: P(posj|posi) for all i,j … aka transition probabilities P(x|posi) for all x,i …. Aka P(Z1=posi)
21
Another interesting example
22
IE by text segmentation
Example: Addresses, bibliography records House number Building Road City State Zip 4089 Whispering Pines Nobel Drive San Diego CA 92122 Title Year Journal Author Volume Page P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, Author, title, year, … are like “positions” in the previous example
23
IE with Hidden Markov Models
HMMs for IE Note: we know how to train this model from segmented citations A B C 0.6 0.3 0.1 X Z 0.4 0.2 Y 0.8 Emission probabilities dddd dd 0.9 0.5 0.8 0.2 0.1 Transition probabilities Title Author Journal Year Probabilitistic transitions and outputs make the model more robust to errors and slight variations
24
… Z X P(X|Z) House aardvark .000001 Apt. .23 zymurgy 0.0001 Road
ave .1 forbes 0.05 …. Basic idea of E/M: plug in expectations, update theta’s, continue …
25
Results: Comparative Evaluation
Dataset instances Elements IITB student Addresses 2388 17 Company 769 6 US 740 The Nested model does best in all three cases (from Borkar et a, 2001)
26
Learning with hidden variables
Z Hidden variables: what if some of your data is not completely observed? Method: Estimate parameters somehow or other. Predict unknown values from your estimate. Add pseudo-data corresponding to these predictions, weighting each example by confidence in its correctness. Re-estimate parameters using the extended dataset (real + pseudo-data). Repeat starting at step 2…. X Y Expectation-maximization aka EM Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants ? 30s expectation maximization (MLE/MAP)
27
Learning with Hidden Variables: Example
Z X Y Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants ? 30s Z P(Z) undergr 0.333 grad prof Z X P(X|Z) undergr <20 .4 20s 30+ .2 grad .6 prof .25 .5 Z Y P(Y|Z) undergr facebk .6 thesis .2 grants grad .4 prof .25 .5 Detailed example
28
Learning with Hidden Variables: Example
Z X Y Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants ? Z P(Z) undergr 0.333 grad prof Z X P(X|Z) undergr <20 .4 20s 30+ .2 grad .6 prof .25 .5 Z Y P(Y|Z) undergr facebk .6 thesis .2 grants grad .4 prof .25 .5 Detailed example
29
Learning with Hidden Variables: Example
Z X Y Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants ? 30s Z P(Z) undergr 0.333 grad prof Z X P(X|Z) undergr <20 .4 20s 30+ .2 grad .6 prof .25 .5 Z Y P(Y|Z) undergr facebk .6 thesis .2 grants grad .4 prof .25 .5 Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants ? 30s Detailed example
30
Learning with Hidden Variables: Example
Z X Y Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants Z P(Z) undergr grad prof Z X P(X|Z) undergr <20 20s 30+ grad prof Z Y P(Y|Z) undergr facebk thesis grants grad prof Detailed example
31
Learning with Hidden Variables: Example
Z X Y Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants Z P(Z) undergr grad prof .38 .35 .27 Detailed example
32
Learning with Hidden Variables: Example
Z X Y Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants Z Y P(Y|Z) underg facebk thesis grants grad prof .24 Detailed example .32 .54
33
Learning with hidden variables
Z Hidden variables: what if some of your data is not completely observed? Method: Estimate parameters somehow or other. Predict unknown values from your estimate. Add pseudo-data corresponding to these predictions, weighting each example by confidence in its correctness. Re-estimate parameters using the extended dataset (real + pseudo-data). Repeat starting at step 2…. X Y Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants ? 30s
34
Why does this work? Ignore prior - MLE Q(z) > 0 Q(z) a pdf
Why E/M works: the proof Q(z) a pdf
35
Why does this work? Ignore prior Q(z) a pdf Q(z) > 0
Initial estimate of θ Why E/M works: the proof
36
Jensen’s inequality Claim: log(q1x1+q2x2) ≥q1log(x1)+q2log(x2)
Holds for any downward-concave function, not just log(x) Further: log(EQ[X]) ≥ EQ[log(X)] log(q1x1+q2x2) log(x2) log(x) * * log(x1) q1log(x1)+q2log(x2) Jensen’s inequality picture x1 x2 q1x1+q2x2 where q1+q2=1
37
Q is any estimate of θ – say θ0
Why does this work? Ignore prior Q(z) a pdf Q(z) > 0 since log(EQ[X]) ≥ EQ[log(X)] Q is any estimate of θ – say θ0 θ’ depends on X,Z but not directly on Q so… P(X,Z,θ’|Q)=P(θ’|X,Z,Q)*P(X,Z|Q) Why E/M works: the proof So, plugging in pseudo-data weighted by Q and finding MLE optimizes a lower bound on log-likelihood
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.