Presentation is loading. Please wait.

Presentation is loading. Please wait.

Directed Graphical Probabilistic Models:

Similar presentations


Presentation on theme: "Directed Graphical Probabilistic Models:"— Presentation transcript:

1 Directed Graphical Probabilistic Models:
the sequel William W. Cohen Machine Learning Feb 25

2 William W. Cohen Machine Learning 10-601 Feb 27 2008
Directed Graphical Probabilistic Models: the son of the child of the bride of the sequel William W. Cohen Machine Learning Feb

3 Outline Quick recap An example of learning
Given structure, find CPTs from “fully observed” data Some interesting special cases of this Learning with hidden variables Expectation-maximization Handwave argument for why EM works

4 The story so far: Bayes nets
Many problems can be solved using the joint probability P(X1,…,Xn). Bayes nets describe a way to compactly write the joint. For a Bayes net: A P(A) 1 0.33 2 3 First guess The money B P(B) 1 0.33 2 3 A B Stick or swap? C The goat D A B C P(C|A,B) 1 2 0.5 3 1.0 E Conditional independence: Second guess A C D P(E|A,C,D)

5 The story so far: d-separation
X E Y There are three ways paths from X to Y given evidence E can be blocked. X is d-separated from Y given E iff all paths from X to Y given E are blocked…see there?  If X is d-separated from Y given E, then I<X,E,Y> Z Z Z

6 The story so far: “Explaining away”
P(E|X,Y) P(E,X,Y) 0.96 0.24 1 0.04 0.01 X Y E P(E|X,Y) P(E,X,Y) P(X,Y|E=1) 1 0.04 0.01 0.014 0.96 0.24 0.329

7 Recap: Inference in linear chain networks
X1 Xn ... Xj ... “backward” “forward” Instead of recursion you can use “message passing” (forward-backward, Baum-Welsh)….

8 Recap: Inference in polytrees
Reduce P(X|E) to the product of two recursively calculated parts: P(X=x|E+) i.e., CPT for X and product of “forward” messages from parents P(E-|X=x) i.e., combination of “backward” messages from parents, CPTs, and P(Z|EZ\Yk), a simpler instance of P(X|E) This can also be implemented by message-passing (belief propagation)

9 Recap: Learning for Bayes nets
Input: Sample of the joint: Graph structure of the variables for I=1,…,N, you know Xi and parents(Xi) Output: Estimated CPTs A B B P(B) 1 0.33 2 3 C D Method (discrete variables): Estimate each CPT independently Use a MLE or MAP A B C P(C|A,B) 1 2 0.5 3 1.0 E

10 Recap: Learning for Bayes nets
Method (discrete variables): Estimate each CPT independently Use a MLE or MAP MAP: A B B P(B) 1 0.33 2 3 C D A B C P(C|A,B) 1 2 0.5 3 1.0 E

11 Recap: A detailed example
Z X Y Z P(Z) undergr grad prof Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants D: Z X P(X|Z) undergr <20 20s 30+ grad prof Z Y P(Y|Z) undergr facebk thesis grants grad prof ugrad <20 facebook thesis grant 20s 30+ grad ..

12 A detailed example Z X Y D: Z P(Z) undergr 0.375 Grad Prof 0.250 Z X Y
ugrad <20 facebook 20s grad thesis prof 30+ grants X Y D: Z X P(X|Z) undergr <20 20s 30+ Grad prof Z Y P(Y|Z) undergr facebk thesis grants grad prof

13 A detailed example Z X Y D: Z P(Z) undergr 0.375 grad prof 0.250 Z X Y
ugrad <20 facebook 20s grad thesis prof 30+ grants X Y D: Z X P(X|Z) undergr <20 .4 20s 30+ .2 grad .6 Prof .25 .5 Z Y P(Y|Z) undergr facebk thesis grants grad prof

14 A detailed example Z Now we’re done learning: what can we do with this? guess what your favorite professor is doing now? given a new x,y compute P(prof|x,y), P(grad|x,y), P(ugrad|x,y)…using Bayes net inference given a new x,y predict the most likely “label” Z Z P(Z) undergr 0.375 grad prof 0.250 X Y Z X P(X|Z) undergr <20 .4 20s 30+ .2 grad .6 Prof .25 .5 Z Y P(Y|Z) undergr facebk .6 thesis .2 grants grad .4 prof .25 .5 Of course we need to implement our Bayes net inference method first…

15 A more interesting example
C C W1 P(W1|C) SATIRE aardvark .154 Basra .0001 zymurgy 0.0001 NEWS .00001 C W2 P(W2|C) SATIRE aardvark .154 Basra .0001 zymurgy 0.0001 NEWS .00001 Parameters are “shared” or “tied” W1 W2 WN or C A “plate” Wi N

16 Some special cases of Bayes net learning
Naïve Bayes HMMs for biology and information extraction Tree-augmented Naïve Bayes

17 Another interesting example
A phylogenomic analysis of the Actinomycetales mce operons

18 Another interesting example
Z1 Z2 Z4 Z3 ... X1 X2 X4 X3 Z1 X1 P(X1|Z1) pos1 A 0.05 T 0.2 C G 0.7 Z1 P(Z1) pos0 1.0 Z2 X2 P(X2|Z1) pos2 A 0.01 T C 0.02 G 0.95 Z3 X3 P(X3|Z3) pos3 A 0.2 T 0.58 C 0.02 G Z1 Z2 P(Z2|Z1) pos1 pos2 1.0 Z2 Z3 P(Z3|Z2) pos2 pos3 1.0 ... ...

19 Another interesting example
Tie: P(X2|Z2=pos4)=P(X4|Z4=pos4) G(T|A|G) “optional” p1 p2 p3 Tie: P(Xi|Zi=pos4)=P(Xj|Zj=pos4) Z1 Z2 Z4 Z3 ... X1 X2 X4 X3 Z1 X1 P(X1|Z1) pos1 A 0.05 T 0.2 C G 0.7 Z2 X2 P(X2|Z2) pos2 A 0.01 T C 0.02 G 0.95 pos4 0.05 0.45 Z1 P(Z1) pos0 1.0 Z1 Z2 P(Z2|Z1) pos1 pos2 0.5 pos4

20 Another interesting example
Tie: P(X2|Z2=pos4)=P(X4|Z4=pos4) G(T|A|G) “optional” Tie: P(Xi|Zi=pos4)=P(Xj|Zj=pos4) 0.5 p1 p2 p4 p3 ... 0.5 P(X|P1) P(X|P2) Three tables: P(posj|posi) for all i,j … aka transition probabilities P(x|posi) for all x,i …. Aka P(Z1=posi)

21 Another interesting example

22 IE by text segmentation
Example: Addresses, bibliography records House number Building Road City State Zip 4089 Whispering Pines Nobel Drive San Diego CA 92122 Title Year Journal Author Volume Page P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, Author, title, year, … are like “positions” in the previous example

23 IE with Hidden Markov Models
HMMs for IE Note: we know how to train this model from segmented citations A B C 0.6 0.3 0.1 X Z 0.4 0.2 Y 0.8 Emission probabilities dddd dd 0.9 0.5 0.8 0.2 0.1 Transition probabilities Title Author Journal Year Probabilitistic transitions and outputs make the model more robust to errors and slight variations

24 … Z X P(X|Z) House aardvark .000001 Apt. .23 zymurgy 0.0001 Road
ave .1 forbes 0.05 …. Basic idea of E/M: plug in expectations, update theta’s, continue

25 Results: Comparative Evaluation
Dataset instances Elements IITB student Addresses 2388 17 Company 769 6 US 740 The Nested model does best in all three cases (from Borkar et a, 2001)

26 Learning with hidden variables
Z Hidden variables: what if some of your data is not completely observed? Method: Estimate parameters somehow or other. Predict unknown values from your estimate. Add pseudo-data corresponding to these predictions, weighting each example by confidence in its correctness. Re-estimate parameters using the extended dataset (real + pseudo-data). Repeat starting at step 2…. X Y Expectation-maximization aka EM Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants ? 30s expectation maximization (MLE/MAP)

27 Learning with Hidden Variables: Example
Z X Y Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants ? 30s Z P(Z) undergr 0.333 grad prof Z X P(X|Z) undergr <20 .4 20s 30+ .2 grad .6 prof .25 .5 Z Y P(Y|Z) undergr facebk .6 thesis .2 grants grad .4 prof .25 .5 Detailed example

28 Learning with Hidden Variables: Example
Z X Y Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants ? Z P(Z) undergr 0.333 grad prof Z X P(X|Z) undergr <20 .4 20s 30+ .2 grad .6 prof .25 .5 Z Y P(Y|Z) undergr facebk .6 thesis .2 grants grad .4 prof .25 .5 Detailed example

29 Learning with Hidden Variables: Example
Z X Y Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants ? 30s Z P(Z) undergr 0.333 grad prof Z X P(X|Z) undergr <20 .4 20s 30+ .2 grad .6 prof .25 .5 Z Y P(Y|Z) undergr facebk .6 thesis .2 grants grad .4 prof .25 .5 Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants ? 30s Detailed example

30 Learning with Hidden Variables: Example
Z X Y Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants Z P(Z) undergr grad prof Z X P(X|Z) undergr <20 20s 30+ grad prof Z Y P(Y|Z) undergr facebk thesis grants grad prof Detailed example

31 Learning with Hidden Variables: Example
Z X Y Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants Z P(Z) undergr grad prof .38 .35 .27 Detailed example

32 Learning with Hidden Variables: Example
Z X Y Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants Z Y P(Y|Z) underg facebk thesis grants grad prof .24 Detailed example .32 .54

33 Learning with hidden variables
Z Hidden variables: what if some of your data is not completely observed? Method: Estimate parameters somehow or other. Predict unknown values from your estimate. Add pseudo-data corresponding to these predictions, weighting each example by confidence in its correctness. Re-estimate parameters using the extended dataset (real + pseudo-data). Repeat starting at step 2…. X Y Z X Y ugrad <20 facebook 20s grad thesis prof 30+ grants ? 30s

34 Why does this work? Ignore prior - MLE Q(z) > 0 Q(z) a pdf
Why E/M works: the proof Q(z) a pdf

35 Why does this work? Ignore prior Q(z) a pdf Q(z) > 0
Initial estimate of θ Why E/M works: the proof

36 Jensen’s inequality Claim: log(q1x1+q2x2) ≥q1log(x1)+q2log(x2)
Holds for any downward-concave function, not just log(x) Further: log(EQ[X]) ≥ EQ[log(X)] log(q1x1+q2x2) log(x2) log(x) * * log(x1) q1log(x1)+q2log(x2) Jensen’s inequality picture x1 x2 q1x1+q2x2 where q1+q2=1

37 Q is any estimate of θ – say θ0
Why does this work? Ignore prior Q(z) a pdf Q(z) > 0 since log(EQ[X]) ≥ EQ[log(X)] Q is any estimate of θ – say θ0 θ’ depends on X,Z but not directly on Q so… P(X,Z,θ’|Q)=P(θ’|X,Z,Q)*P(X,Z|Q) Why E/M works: the proof So, plugging in pseudo-data weighted by Q and finding MLE optimizes a lower bound on log-likelihood


Download ppt "Directed Graphical Probabilistic Models:"

Similar presentations


Ads by Google