Presentation is loading. Please wait.

Presentation is loading. Please wait.

Conditional Random Fields Rahul Gupta (KReSIT, IIT Bombay)

Similar presentations


Presentation on theme: "Conditional Random Fields Rahul Gupta (KReSIT, IIT Bombay)"— Presentation transcript:

1 Conditional Random Fields Rahul Gupta (KReSIT, IIT Bombay)

2 Undirected models Graph representation of random variables Edges define conditional independence statements Graph separation criteria Markovian independence statements are enough P(S|V) = P(S|Nb(S)) X Y Z W X indep. of W given Z

3 Hammersley-Clifford Theorem Not all distributions satisfy Markovian properties –The ones which do can be factorized as –In practice, we assume 1 2 5 4 3 6

4 Computing marginals We are interested in –Marginals P(S) –Conditionals P(S|S’) = P(S,S’)/P(S’) Same as computing marginals –Labelings with highest joint probabilities Reuse the algorithm for computing marginals Only compute marginals for cliques

5 Marginals in trees Sum-Product Algorithm –Receive belief messages from children –Pass consolidated belief to parent –Node marginal prop. to Messages from children Node potential m 21 (y 1 )m 31 (y 1 ) 1 23

6 Marginals (contd.) Reuse messages for other marginals Running time dependent on tree diameter For best labeling, replace sum by max –Store argmax along with max

7 Junction-Tree algorithm Sum-Product fails to converge with cycles –Make graph acyclic by merging cliques –Run Sum-Product on transformed graph –Local consistency Runtime exponential in max clique size 1 2 5 4 3 6 1’2’3’  1’ (y 1’ )=  1 (y 1 )  2 (y 2 )  12 (y 1 y 2 )  23 (y 2 y 3 )  13 (y 1 y 3 )  2’ (y 2’ ) =  3 (y 3 )  4 (y 4 )  34 (y 3 y 4 )

8 Junction-Tree (contd.) Blindly arranging cliques is wrong –Message Passing maintains only local consistency –Need a ‘running intersection property’ 2 1 35 4 124 25 234 124 25 234 Junction tree P(4) may not be consistent

9 Junction Tree (contd.) Junction-Tree cannot exist for un-triangulated graphs –Triangulate the graph –Max clique size may increase arbitrarily –NP-hard to output the best triangulation 2 1 3 4 12 23 34 14 P(1) not consistent

10 ILP-based inferencing  c (y c ) is 1 if clique c has value y c  c (y c )’s are mutually exclusive  c (y c ) and  c’ (y c’ ) are consistent for c’ ½ c LP relaxation –  c (y c ) behave like marginal probabilities –May admit invalid solutions if the graph is un- triangulated! Triangulation adds variables and constraints that keep the solution valid

11 Approximate inferencing Sampling-based methods Variational methods –Find upper and lower bounds to marginals Approximate algos when potentials are metrics (Kleinberg, Tardos 99) –O(logk loglogk) approx ratio for k labels –2-approx algo for uniform potentials

12 Learning the potentials

13 Max-margin formulation Assume –Triangulated graph, one node x is fixed x has edges to all nodes –  c (y c ) =  (y c ) = w T f(c,y c,x) f : vector of arbitrary local features of the clique w : weights of the features (will be learnt) P(y|x) / exp(w T  c f(c,y c,x)) –Loss function can be decomposed over cliques Given (x,y), ideally we want P(y|x) to be higher than any other P(y’|x)

14 Max-margin (contd.) Exponential number of constraints Transform to Can use a cutting plane approach Still not fully exploiting the decomposability of F and L

15 Max-margin dual  ’ s occur only as –  ’ s behave as probability distributions, and behave like marginals However, constraints are not enough if graph is un-triangulated.

16 Max-margin (contd.) Rewrite dual in terms of  c (y c ) –Drastic reduction in number of variables. Algorithms for solving the dual –Modified SMO –Exponentiated gradient algorithm

17 SMO For example i: –Pick y’ and y’’ using KKT violations e.g.  i,y’ = 0 but (i,y’) is a support vector  i,y’ and  i,y’’ will be optimized But  ’s are neither known, nor unique –Use maximum entropy principle –Closed form optimization for chosen  ’s Let  be the mass transfer from  i,y’ to  i,y’’

18 SMO (contd.) Project back the changes to relevant  ’s –  i,c (y c ) receives  mass if y c is a part of y’ –  i,c (y c ) loses  mass if y c is a part of y’’ –Both may happen to the same marginal variable

19 Exponentiated Gradient Exponentiated updates Normalization ) simplex constraint. exp() ) non-negativity r = grad of the dual objective Cannot update explicitly –Use the decomposability of F and L

20 Exponentiated Gradient Parameterize  ’s ) gradient descent for  ’s Gradient should not have  ’s –Tricky term is – This is the ‘expectation’ Replace sums over  ’s by local marginals Expectation becomes

21 Exponentiated Gradient (contd.) Marginals can be computed quickly using message passing Final algo 1.Begin with some random  ’s 2.Compute marginals  3.Compute gradient using  4.Update  ’s, goto Step 2 Exponentiated gradient experimentally shown to be better than SMO

22 Other training methods Maximize log-likelihood (log P(y|x)) Maximize pseudo-likelihood (  i logP(y i |x)) –Useful for Hamming loss functions Voted perceptron –Approximate the gradient of log-likelihood Piecewise training –Divide the graph into independently normalized parts Gradient tree boosting Logarithmic pooling –Learn a committee of simpler CRFs

23 Associative Markov Networks Potentials favour same labeling of all vertices in the clique –  c (y c ) =  k f c (k) « y c = (k,…,k) ¬ –Hypertext classification, Image segmentation Instead of  c (y c ), we have  c (k) which is one for labeling the clique with k –Training is highly simplified

24 AMNs (contd.) Inferencing –LP relaxation is optimal for binary labels Can be reduced to graph min-cut –For multiple labels, use iterative min-cut At step k, do a min-cut to decide whether nodes keep their labels or switch to label k Approx algo with O(logk loglogk) ratio if potentials are metric 2-approx algo if potentials are uniform and metric

25 Other work Approximating graphs with low tree-width networks (Srebro 2001) Finding best subgraphs with d edges less (Narasimahan,Bilmes 2004) DLR hierarchy of approximate inference (Jordan et. al. 2005) Extragradient method (Taskar et.al. 2005)

26 Possible directions Integrating CRFs with imprecise DBs –CRF probabilities interpretable as confidence –Compress exponential number of outputs –Exploit Markovian property Piecewise training –Label bias / loss of correlation vs efficiency Constrained inferencing –Probabilities lose interpretability ?


Download ppt "Conditional Random Fields Rahul Gupta (KReSIT, IIT Bombay)"

Similar presentations


Ads by Google