Download presentation
Presentation is loading. Please wait.
Published byMyles Garrett Modified over 9 years ago
1
Conditional Random Fields Rahul Gupta (KReSIT, IIT Bombay)
2
Undirected models Graph representation of random variables Edges define conditional independence statements Graph separation criteria Markovian independence statements are enough P(S|V) = P(S|Nb(S)) X Y Z W X indep. of W given Z
3
Hammersley-Clifford Theorem Not all distributions satisfy Markovian properties –The ones which do can be factorized as –In practice, we assume 1 2 5 4 3 6
4
Computing marginals We are interested in –Marginals P(S) –Conditionals P(S|S’) = P(S,S’)/P(S’) Same as computing marginals –Labelings with highest joint probabilities Reuse the algorithm for computing marginals Only compute marginals for cliques
5
Marginals in trees Sum-Product Algorithm –Receive belief messages from children –Pass consolidated belief to parent –Node marginal prop. to Messages from children Node potential m 21 (y 1 )m 31 (y 1 ) 1 23
6
Marginals (contd.) Reuse messages for other marginals Running time dependent on tree diameter For best labeling, replace sum by max –Store argmax along with max
7
Junction-Tree algorithm Sum-Product fails to converge with cycles –Make graph acyclic by merging cliques –Run Sum-Product on transformed graph –Local consistency Runtime exponential in max clique size 1 2 5 4 3 6 1’2’3’ 1’ (y 1’ )= 1 (y 1 ) 2 (y 2 ) 12 (y 1 y 2 ) 23 (y 2 y 3 ) 13 (y 1 y 3 ) 2’ (y 2’ ) = 3 (y 3 ) 4 (y 4 ) 34 (y 3 y 4 )
8
Junction-Tree (contd.) Blindly arranging cliques is wrong –Message Passing maintains only local consistency –Need a ‘running intersection property’ 2 1 35 4 124 25 234 124 25 234 Junction tree P(4) may not be consistent
9
Junction Tree (contd.) Junction-Tree cannot exist for un-triangulated graphs –Triangulate the graph –Max clique size may increase arbitrarily –NP-hard to output the best triangulation 2 1 3 4 12 23 34 14 P(1) not consistent
10
ILP-based inferencing c (y c ) is 1 if clique c has value y c c (y c )’s are mutually exclusive c (y c ) and c’ (y c’ ) are consistent for c’ ½ c LP relaxation – c (y c ) behave like marginal probabilities –May admit invalid solutions if the graph is un- triangulated! Triangulation adds variables and constraints that keep the solution valid
11
Approximate inferencing Sampling-based methods Variational methods –Find upper and lower bounds to marginals Approximate algos when potentials are metrics (Kleinberg, Tardos 99) –O(logk loglogk) approx ratio for k labels –2-approx algo for uniform potentials
12
Learning the potentials
13
Max-margin formulation Assume –Triangulated graph, one node x is fixed x has edges to all nodes – c (y c ) = (y c ) = w T f(c,y c,x) f : vector of arbitrary local features of the clique w : weights of the features (will be learnt) P(y|x) / exp(w T c f(c,y c,x)) –Loss function can be decomposed over cliques Given (x,y), ideally we want P(y|x) to be higher than any other P(y’|x)
14
Max-margin (contd.) Exponential number of constraints Transform to Can use a cutting plane approach Still not fully exploiting the decomposability of F and L
15
Max-margin dual ’ s occur only as – ’ s behave as probability distributions, and behave like marginals However, constraints are not enough if graph is un-triangulated.
16
Max-margin (contd.) Rewrite dual in terms of c (y c ) –Drastic reduction in number of variables. Algorithms for solving the dual –Modified SMO –Exponentiated gradient algorithm
17
SMO For example i: –Pick y’ and y’’ using KKT violations e.g. i,y’ = 0 but (i,y’) is a support vector i,y’ and i,y’’ will be optimized But ’s are neither known, nor unique –Use maximum entropy principle –Closed form optimization for chosen ’s Let be the mass transfer from i,y’ to i,y’’
18
SMO (contd.) Project back the changes to relevant ’s – i,c (y c ) receives mass if y c is a part of y’ – i,c (y c ) loses mass if y c is a part of y’’ –Both may happen to the same marginal variable
19
Exponentiated Gradient Exponentiated updates Normalization ) simplex constraint. exp() ) non-negativity r = grad of the dual objective Cannot update explicitly –Use the decomposability of F and L
20
Exponentiated Gradient Parameterize ’s ) gradient descent for ’s Gradient should not have ’s –Tricky term is – This is the ‘expectation’ Replace sums over ’s by local marginals Expectation becomes
21
Exponentiated Gradient (contd.) Marginals can be computed quickly using message passing Final algo 1.Begin with some random ’s 2.Compute marginals 3.Compute gradient using 4.Update ’s, goto Step 2 Exponentiated gradient experimentally shown to be better than SMO
22
Other training methods Maximize log-likelihood (log P(y|x)) Maximize pseudo-likelihood ( i logP(y i |x)) –Useful for Hamming loss functions Voted perceptron –Approximate the gradient of log-likelihood Piecewise training –Divide the graph into independently normalized parts Gradient tree boosting Logarithmic pooling –Learn a committee of simpler CRFs
23
Associative Markov Networks Potentials favour same labeling of all vertices in the clique – c (y c ) = k f c (k) « y c = (k,…,k) ¬ –Hypertext classification, Image segmentation Instead of c (y c ), we have c (k) which is one for labeling the clique with k –Training is highly simplified
24
AMNs (contd.) Inferencing –LP relaxation is optimal for binary labels Can be reduced to graph min-cut –For multiple labels, use iterative min-cut At step k, do a min-cut to decide whether nodes keep their labels or switch to label k Approx algo with O(logk loglogk) ratio if potentials are metric 2-approx algo if potentials are uniform and metric
25
Other work Approximating graphs with low tree-width networks (Srebro 2001) Finding best subgraphs with d edges less (Narasimahan,Bilmes 2004) DLR hierarchy of approximate inference (Jordan et. al. 2005) Extragradient method (Taskar et.al. 2005)
26
Possible directions Integrating CRFs with imprecise DBs –CRF probabilities interpretable as confidence –Compress exponential number of outputs –Exploit Markovian property Piecewise training –Label bias / loss of correlation vs efficiency Constrained inferencing –Probabilities lose interpretability ?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.