Download presentation
Presentation is loading. Please wait.
1
Sequential Learning with Dependency Nets
William W. Cohen 2/22
2
CRFs: the good, the bad, and the cumbersome…
Good points: Global optimization of weight vector that guides decision making Trade off decisions made at different points in sequence Worries: Cost (of training) Complexity (do we need all this math?) Amount of context: Matrix for normalizer is |Y| * |Y|, so high-order models for many classes get expensive fast. Strong commitment to maxent-style learning Loglinear models are nice, but nothing is always best.
3
Dependency Nets
5
Proposed solution: parents of node are the Markov blanket like undirected Markov net capture all “correlational associations” one conditional probability for each node X, namely P(X|parents of X) like directed Bayes net–no messy clique potentials
6
Example – bidirectional chains
Y1 Y2 … Yi … When will dr Cohen post the notes
7
DN chains … … How do we do inference? Iteratively:
Yi … When will dr Cohen post the notes How do we do inference? Iteratively: Pick values for Y1, Y2, …at random Pick some j, and compute Set new value of Yj according to this Go back to (2) Current values
8
This an MCMC process Transition probability General case … … Markov Chain Monte Carlo: a randomized process that doesn’t depend on previous y’s changes y(t) to y(t+1) One particular run … … How do we do inference? Iteratively: Pick values for Y1, Y2, …at random: y(0) Pick some j, and compute Set new value of Yj according to this: y(1) Go back to (2) and repeat to get y(1) , y(2) , …, y(t) , … Current values (t)
10
This an MCMC process … … Claim: suppose Y(t) is drawn from some distribution D such that Then Y(t+1) is also drawn from D (i.e., the random flip doesn’t move us “away from D”
11
This an MCMC process … … “Burn-in”
Claim: if you wait long enough then for some t, Y(t) will be drawn from some distribution D such that …under certain reasonable conditions (e.g., graph of potential edges is connected, …). So D is a “sink”.
13
averaged for prediction
This an MCMC process … … “burn-in” - discarded averaged for prediction An algorithm: Run the MCMC chain for a long time t, and hope that Y(t) will be drawn from the target distribution D. Run the MCMC chain for a while longer and save sample S = { Y(t) , Y(t+1) , …, Y(t+m) } Use S to answer any probabilistic queries like Pr(Yj|X)
14
More on MCMC This particular process is Gibbs sampling
Transition probabilities are defined by sampling from the posterior of one variable Yj given the others. MCMC is very general-purpose inference scheme (and sometimes very slow) On the plus side, learning is relatively cheap, since there’s no inference involved (!) A dependency net is closely related to a Markov random field learned by maximizing pseudo-likelihood Identical? Statistical relation learning community has some proponents of this approach: Pedro Domingos, David Jensen, … A big advantage is the generality of the approach Sparse learners (eg L1 regularized maxent, decision trees, …) can be used to infer Markov blanket (NIPS 2006)
15
Examples Y1 Y2 … Yi … When will dr Cohen post the notes
16
Examples … … POS? … … BIO Z1 Z2 Zi Y1 Y2 Yi will dr post the notes
When will dr Cohen post the notes
17
Examples Y1 Y2 … Yi … When will dr Cohen post the notes
18
Dependency nets The bad and the ugly:
Inference is less efficient –MCMC sampling Can’t reconstruct probability via chain rule Networks might be inconsistent ie local P(x|pa(x))’s don’t define a pdf Exactly equal, representationally, to normal undirected Markov nets
20
Dependency nets The good:
Learning is simple and elegant (if you know each node’s Markov blanket): just learn a probabilistic classifier for P(X|pa(X)) for each node X. (You might not learn a consistent model, but you’ll probably learn a reasonably good one.) Inference can be speeded up substantially over naïve Gibbs sampling.
21
Dependency nets Learning is simple and elegant (if you know each node’s Markov blanket): just learn a probabilistic classifier for P(X|pa(X)) for each node X. Pr(y1|x,y2) Pr(y2|x,y1,y2) Pr(y3|x,y2,y4) Pr(y4|x,y3) y1 y2 y3 y4 Learning is local, but inference is not, and need not be unidirectional x
22
Toutanova, Klein, Manning, Singer
Dependency nets for POS tagging vs CMM’s. Maxent is used for local conditional model. Goals: An easy-to-train bidirectional model A really good POS tagger
23
Toutanova et al D = {11, 11, 11, 12, 21, 33} ML state: {11}
Don’t use Gibbs sampling for inference: instead use a Viterbi variant (which is not guaranteed to produce the ML sequence) D = {11, 11, 11, 12, 21, 33} ML state: {11} P(a=1|b=1)P(b=1|a=1) < 1 P(a=3|b=3)P(b=3|a=3) = 1
24
Results with model
25
Results with model
26
Results with model “Best” model includes some special unknown-word features, including “a crude company-name detector”
27
Results with model MXPost: 47.6, 96.4, 86.2 CRF+: 95.7, 76.4
Final test-set results MXPost: 47.6, 96.4, CRF+: 95.7, 76.4 (Ratnaparki) (Lafferty et al ICML2001)
28
Other comments Smoothing (quadratic regularization, aka Gaussian prior) is important—it avoids overfitting effects reported elsewhere
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.