Inference and Learning via Integer Linear Programming Vasin,Dan,Scott,Dav
Outline Problem Definition Integer Linear Programming (ILP) Its generality Learning and Inference via ILP Experiments Extension to hierarchical learning Future Direction Hidden Variables Place Inference task (finding assignment to variable set) as ILP -Cost Fcn – defined by set of learned classifiers -Constraints – maintain structure of solution Doing this 1. allow MANY constraints in inference 2. set up a framework where different learning methods can be used -- compare two natural algorithms - independent vs. global training Experiments -- independent training is better sometimes easy problems -- global better for difficult problems Extension when -- dependent tasks (one on the next) -- classification is done in levels Future Direction learn with hidden variables
Problem Definition X = (X1,...,Xk) X k = X Y = (Y1,...,Yl) Y l = Y Given X = x, find Y = y Notation agreements Capital letters mean variables Non-capital letters mean values Bold indicates vectors or matrixes X,Y is a set
Example (Text Chunking) y = NP ADJP VP ADVP VP x = The guy presenting now is so tired
Classifiers A classifier Example h: X Y (l-1)Y {1,..,l} R score(x,y-3,NP,3) = 0.3 score(x,y-3,VP,3) = 0.5 score(x,y-3,ADVP,3) = 0.2 score(x,y-3,ADJP,3) = 1.2 score(x,y-3,NULL,3) = 0.1
Inference Goal: x y Given Find y x input score(x,y-t,y,t) for all (y-t,y) Y l, t {1,..,l} C A set of constraints over Y Find y maximizes global function score(x,y) = t score(x,y-t,yt,t) satisfies constraints C
Integer Linear Programming Boolean variables: U = (U1,...,Ud) {0,1}d Cost vector: p = (p1,…,pd) Rd Cost Function: pU Constraint Matrix: c ReRd Maximize pU Subject to cU 0 (cU=0, cU3, possible)
ILP (Example) U = (U1,U2,U3) p = (0.3, 0.5, 0.8) c = 1 2 3 -1 -2 2 -1 -2 2 0 -3 2 Maximize pU Subject to cU 0
Boolean Functions as Linear Constraints Conjunction U1U2U3 U1=1, U2=1, U3=1 Disjunction U1U2U3 U1 + U2 + U3 1 CNF (U1U2)(U3U4) U1+U2 1, U3+U4 1
Text Chunking Indicator Variables Cost Vector U1,NP,U1,NULL,U2,VP... y1=NP, y1=NULL, y2=VP,.. U1,NP indicates that phrase 1 is labeled NP Cost Vector p1,NP = score(x,NP,1) p1,NULL = score(x,NULL,1) p2,VP = score(x,VP,2) ... pU = score(x,y) = t score(x,yt,t), subject to constraints
Structural Constraints Coherency yt can take only one value y{NP,..NULL} Ut,y = 1 Non-Overlapping y1 and y2 overlap U1,NULL + U2,NULL = 1
Linguistic Constraints Every sentence must have at least one VP t Ut,VP 1 Every sentence must have at least one NP t Ut,NP 1 ...
Interacting Classifiers Classifier for an output yt uses other outputs y-t as inputs score(x,y-t,y,t) Need to ensure that the final output from ILP computed from a consistent y Introduce additional variables Introduce additional coherency constraints
Interacting Classifiers Additional variables Y=y UY,y for all possible y-t,y Additional coherency constraints UY,y = 1 iff Ut,yt = 1 for all yt in y yt in y Ut,yt - UY,y l – 1 yt in y Ut,yt - lUY,y 0
Learning Classifiers score(x,y-t,y,t) = yy(x,y-t,t) Learn y, for all y Y Multi-class learning Example (x,y) {y(x,y-t,t),yt}t=1..l Learn each classifier independently
Learn with Inference Feedback Learn by observing global behavior For each example (x,y) Make prediction with the current classifiers and ILP y’ = argmaxy t score(x,y-t,y,t) For each t, update If y’t yt Promote score(x,y-t,yt,t) Demote score(x,y’-t,y’t,t)
Experiments Semantics Role Labeling Assume correct boundaries are given Only sentences with more than 5 arguments are included
Experimental Results For difficult task: For easy task: Winnow Perceptron For difficult task: Inference feedback during training improves performance For easy task: Learning without inference feedback is better
Conservative Updating Update only if necessary Example U1 + U2 = 1 Predict (U1, U2) = (1,0) Correct (U1, U2) = (0,1) Feedback Demote class 1, promote class 2 So, U1=0 U2=1, so only demote class 1
Conservative Updating S = minset(Constraints) Set of functions that, if changed, would make global prediction correct. Promote (Demote) only those functions in the minset S
Hierarchical Learning Given x Compute hierarchically z1 = h1(x) z2 = h2(x,z1) … y = hs+1(x,z1,…,zs) Assume all z are known in training
Hierarchical Learning Assume each hj can be computed via ILP pj, Uj, cj y = argmaxymaxz1,…zs jjpjUj Subject to c1U1 0, c2U2 0, …, cs+1Us+1 0 where j is a large enough constant to preserve hierarchy
Hidden Variables Given x y = h(x,z) z is not known in training y = argmaxymaxz score(x,z,y-t,y,t) Subject to some constraints
Learning with Hidden Variables Truncated EM styled learning For each example (x,y) Compute z with the current classifiers and ILP z = argmaxz score(x,z,y-t,y,t) Make prediction with the current classifiers and ILP (y’,z’) = argmaxy,z t score(x,z,y-t,y,t) For each t, update If y’t yt Promote score(x,z,y-t,yt,t) Demote score(x,z’,y’-t,y’t,t)
Conclusion ILP is powerful general learnable useful fast (or at least not too slow) extendable
Boolean Functions as Linear Constraints Conjunction abc Ua + Ub + Uc 3 Disjunction abc Ua + Ub + Uc 1 DNF ab + cd Iab + Icd 1 Introduce new variables Iab, Icd
Helper Variables We must link Ia, Ib, and Iab Iab ab IaIb Iab Iab IaIb 2Iab <= Ia + Ib
Semantic Role Labeling a,b,c... ph1=A0, ph1=A1,ph2=A0,.. Cost Vector pa = score(ph1=A0) pb = score(ph1=A1) ... Indicator Variables Ia indicates that phrase 1 is labeled A0 paIa = 0.3 if Ia and 0 ow
Learning X = (X1,...,Xk) X1,…,Xk = X Y-t = (Y1,...,Yt-1,Yt+1,Yl) Y1,…,Yt-1,Yt+1,…,Yl = Y -t Yt Yt Given X = x, and Y-t = y-t, find Yt = yt or score of each possible yt X Y –t Yt or X Y –tYt R
SRL via Generalized Inference
Outline Find potential argument candidates Classify arguments to types Inference for Argument Structure Integer linear programming (ILP) Cost Function Constraints Features We follow a now seemingly standard approach to SRL. Given a sentence, first we find a set of potential argument candidates by identifying which words are at the border of an argument. Then, once we have a set of potential arguments, we use a suite of classifiers to tell us how likely an argument is to be of each type. Finally, we use all of the information we have so far to find the assignment of types to argument that gives us the “optimal” global assignment. Similar approaches (with similar results) use inference procedures tied to their represntation. Instead, we use a general inference procedure by setting up the problem as a linear programming problem. This is really where our technique allows us to apply powerful information that similar approaches can not.
Find Potential Arguments I left my nice pearls to her Every chunk can be an argument Restrict potential arguments BEGIN(word) BEGIN(word) = 1 “word begins argument” END(word) END(word) = 1 “word ends argument” Argument (wi,...,wj) is a potential argument iff BEGIN(wi) = 1 and END(wj) = 1 Reduce set of potential argments I left my nice pearls to her [ [ [ [ [ ] ] ] ] ]
Details... Learn a function BEGIN(word) Learn a function B(word,context,structure) {0,1} END(word) E(word,context,structure) {0,1} POTARG = {arg | BEGIN(first(arg)) and END(last(arg))}
Arguments Type Likelihood Assign type-likelihood How likely is it that arg a is type t? For all a POTARG , t T P (argument a = type t ) I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] 0.3 0.2 0.2 0.3 0.6 0.0 0.0 0.4 A0 CA1 A1 Ø
Details... Learn a classifier Estimate Probabilites ARGTYPE(arg) P(arg) {A0,A1,...,CA0,...,LOC,...} argmaxt{A0,A1,...,CA0,...,LOC,...} wt P(arg) Estimate Probabilites P(a = t) = wt P(a) / Z
What is a Good Assignment? Likelihood of being correct P(Arg a = Type t) if t is the correct type for argument a For a set of arguments a1, a2, ..., an Expected number of arguments correct i P( ai = ti ) We search for the assignment with maximum expected correct
I left my nice pearls to her I left my nice pearls to her Inference Maximize expected number correct T* = argmaxT i P( ai = ti ) Subject to some constraints Structural and Linguistic I left my nice pearls to her 0.3 0.2 0.2 0.3 0.6 0.0 0.0 0.4 0.1 0.3 0.5 0.1 0.1 0.2 0.3 0.4 I left my nice pearls to her Cost = 0.3 + 0.4 + 0.5 + 0.4 = 1.6 Non-Overlapping Cost = 0.3 + 0.6 + 0.5 + 0.4 = 1.8 Independent Max Cost = 0.3 + 0.4 + 0.3 + 0.4 = 1.4 BlueRed & N-O
Everything is Linear Cost function a POTARG P(a=t) = a POTARG , t T P(a=t)Iat Constraints Non-Overlapping a and a’ overlap IaØ + Ia’Ø = 0 Linguistic CA0 A0 a IaCA0 – a IaA0 1 Integer Linear Programming
Features are Important Here, a discussion of the features should go. Which are most important? Comparison to other people.
I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] I left my nice pearls to her I left my nice pearls to her