June 2013 Inferning Workshop, ICML, Atlanta GA Amortized Integer Linear Programming Inference Dan Roth Department of Computer Science University of Illinois.

June 2013 Inferning Workshop, ICML, Atlanta GA Amortized Integer Linear Programming Inference Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign Page 1 With thanks to: Collaborators: Gourab Kundu, Vivek Srikumar, Many others Funding: NSF; DHS; NIH; DARPA. DASH Optimization (Xpress-MP)

Please… Page 2

Learning and Inference Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome.  In current NLP we often think about simpler structured problems: Parsing, Information Extraction, SRL, etc.  As we move up the problem hierarchy (Textual Entailment, QA,….) not all component models can be learned simultaneously  We need to think about (learned) models for different sub-problems  Knowledge relating sub-problems (constraints) becomes more essential and may appear only at evaluation time Goal: Incorporate models’ information, along with prior knowledge (constraints) in making coherent decisions  Decisions that respect the local models as well as domain & context specific knowledge/constraints. Page 3

Comprehension 1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now. (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous. This is an Inference Problem Page 4

Outline Integer Linear Programming Formulations for Natural Language Processing  Examples Amortized Inference  What is it and why could it be possible?  The general scheme Theorems for amortized inference  Making the k-th inference cheaper than the 1st  Full structures; Partial structures  Experimental results Page 5

Semantic Role Labeling I left my pearls to my daughter in my will. [ I ] A0 left [ my pearls ] A1 [ to my daughter ] A2 [ in my will ] AM-LOC. A0Leaver A1Things left A2Benefactor AM-LOCLocation I left my pearls to my daughter in my will. Page 6 Archetypical Information Extraction Problem: E.g., Concept Identification and Typing, Event Identification, etc.

Identify argument candidates  Pruning [Xue&Palmer, EMNLP’04]  Argument Identifier Binary classification Classify argument candidates  Argument Classifier Multi-class classification Inference  Use the estimated probability distribution given by the argument classifier  Use structural and linguistic constraints  Infer the optimal global output One inference problem for each verb predicate. argmax  a,t y a,t c a,t Subject to : One label per argument:  t y a,t = 1 No overlapping or embedding Relations between verbs and arguments,…. Algorithmic Approach I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] I left my nice pearls to her candidate arguments I left my nice pearls to her Page 7 Variable y a,t indicates whether candidate argument a is assigned a label t. c a,t is the corresponding model score

John, a fast-rising politician, slept on the train to Chicago. Verb Predicate: sleep  Sleeper: John, a fast-rising politician  Location: on the train to Chicago Who was John?  Relation: Apposition (comma)  John, a fast-rising politician What was John’s destination?  Relation: Destination (preposition)  train to Chicago Verb SRL is not Sufficient

Examples of preposition relations Queen of England City of Chicago Page 9

The bus was heading for Nairobi in Kenya. Coherence of predictions Location Destination Predicate: head.02 A0 (mover): The bus A1 (destination): for Nairobi in Kenya Predicate: head.02 A0 (mover): The bus A1 (destination): for Nairobi in Kenya Predicate arguments from different triggers should be consistent Joint constraints linking the two tasks. Destination  A1 Page 10

Joint inference (CCMs) Each argument label Argument candidates Preposition Preposition relation label Verb SRL constraintsOnly one label per preposition Verb argumentsPreposition relations Re-scaling parameters (one per label) Constraints: Variable y a,t indicates whether candidate argument a is assigned a label t. c a,t is the corresponding model score Page 11 + …. + Joint constraints between tasks, easy with ILP formulation Joint Inference – no (or minimal) joint learning

Have been shown useful in the context of many NLP problems [Roth&Yih, 04,07: Entities and Relations; Punyakanok et. al: SRL …]  Summarization; Co-reference; Information & Relation Extraction; Event Identifications; Transliteration; Textual Entailment; Knowledge Acquisition; Sentiments; Temporal Reasoning, Dependency Parsing,… Some theoretical work on training paradigms [Punyakanok et. al., 05 more; Constraints Driven Learning, PR, Constrained EM…] Some work on Inference, mostly approximations, bringing back ideas on Lagrangian relaxation, etc. Good summary and description of training paradigms: [Chang, Ratinov & Roth, Machine Learning Journal 2012] Summary of work & a bibliography: http://L2R.cs.uiuc.edu/tutorials.htmlhttp://L2R.cs.uiuc.edu/tutorials.html Constrained Conditional Models—ILP Formulations Page 12

Outline Integer Linear Programming Formulations for Natural Language Processing  Examples Amortized Inference  What is it and why could it be possible?  The general scheme Theorems for amortized inference  Making the k-th inference cheaper than the 1st?  Full structures; Partial structures  Experimental results Page 13

Inference in NLP In NLP, we typically don’t solve a single inference problem. We solve one or more per sentence. Beyond improving the inference algorithm, what can be done? S1 He is reading a book After inferring the POS structure for S1, Can we speed up inference for S2 ? Can we make the k-th inference problem cheaper than the first? S2 They are watching a movie POS PRP VBZ VBG DT NN S1 & S2 look very different but their output structures are the same The inference outcomes are the same Page 14

Amortized ILP Inference [Kundu, Srikumar & Roth, EMNLP-12,ACL-13] We formulate the problem of amortized inference: reducing inference time over the lifetime of an NLP tool We develop conditions under which the solution of a new, previously unseen problem, can be exactly inferred from earlier solutions without invoking a solver. This results in a family of exact inference schemes  Algorithms are invariant to the underlying solver; we simply reduce the number of calls to the solver Significant improvements both in terms of solver calls and wall clock time in a state-of-the-art Semantic Role Labeling Page 15

The Hope: POS Tagging on Gigaword Number of Tokens Page 16

Number of structures is much smaller than the number of sentences The Hope: POS Tagging on Gigaword Number of Tokens Number of examples of a given size Number of unique POS tag sequences Page 17

The Hope: Dependency Parsing on Gigaword Number of Tokens Number of structures is much smaller than the number of sentences Number of examples of a given size Number of unique Dependency Trees Page 18

The Hope: Semantic Role Labeling on Gigaword Number of Arguments per Predicate Number of structures is much smaller than the number of sentences Number of examples of a given size Number of unique SRL structures Page 19

POS Tagging on Gigaword Number of Tokens How skewed is the distribution of the structures? A small # of structures occur very frequently Page 20

Amortized ILP Inference These statistics show that many different instances are mapped into identical inference outcomes. How can we exploit this fact to save inference cost? We do this in the context of 0-1 LP, which is the most commonly used formulation in NLP. Max cx Ax ≤ b x 2 {0,1} Page 21 After solving n inference problems, can we make the (n+1) th one faster?

x * P : c P : c Q : max 2x 1 +4x 2 +2x 3 +0.5x 4 x 1 + x 2 ≤ 1 x 3 + x 4 ≤ 1 max 2x 1 +3x 2 +2x 3 +x 4 x 1 + x 2 ≤ 1 x 3 + x 4 ≤ 1 Equivalence Classes PQ Same equivalence class Optimal Solution Objective coefficients of problems P, Q We define an equivalence class as the set of ILPs that have: the same number of inference variables the same feasible set (same constraints modulo renaming) Page 22 For problems in a given equivalence class, we give conditions on the objective functions, under which the solution of a new problem Q is the same as the one of P (which we already cached)

The Recipe Given:  A cache of solved ILPs and a new problem If CONDITION(cache, new problem) then SOLUTION(new problem) = old solution Else Call base solver and update cache End Page 23

Amortized Inference Experiments Setup  Verb semantic role labeling Other results also at the end of the talk  Speedup & Accuracy are measured over WSJ test set (Section 23)  Baseline is solving ILP using Gurobi solver. For amortization  Cache 250,000 SRL inference problems from Gigaword  For each problem in test set, invoke an amortized inference algorithm 24

x * P : c P : c Q : max 2x 1 +4x 2 +2x 3 +0.5x 4 x 1 + x 2 ≤ 1 x 3 + x 4 ≤ 1 max 2x 1 +3x 2 +2x 3 +x 4 x 1 + x 2 ≤ 1 x 3 + x 4 ≤ 1 Theorem I PQ The objective coefficients of active variables did not decrease from P to Q Page 25 If

x * P : c P : c Q : max 2x 1 +4x 2 +2x 3 +0.5x 4 x 1 + x 2 ≤ 1 x 3 + x 4 ≤ 1 max 2x 1 +3x 2 +2x 3 +x 4 x 1 + x 2 ≤ 1 x 3 + x 4 ≤ 1 Theorem I PQ The objective coefficients of inactive variables did not increase from P to Q x*P=x*Qx*P=x*Q Then: The optimal solution of Q is the same as P’s Page 26 And The objective coefficients of active variables did not decrease from P to Q If

Theorem I x * P,i = 0c Q,i ≤ c P,i x * P,i = 1c Q,i ≥ c P,i Page 27

Speedup & Accuracy SpeedupSpeedup Amortized inference gives a speedup without losing accuracy Solve only 40% of problems 28

max 10x 1 +18x 2 +10x 3 +3.5x 4 x 1 + x 2 ≤ 1 x 3 + x 4 ≤ 1 c Q : c Q = 2c P1 + 3c P2 max 2x 1 +3x 2 +2x 3 +x 4 x 1 + x 2 ≤ 1 x 3 + x 4 ≤ 1 Theorem II x * P1=p2 : c P1 : c P2 : P1 max 2x 1 +4x 2 +2x 3 +0.5x 4 x 1 + x 2 ≤ 1 x 3 + x 4 ≤ 1 P2 Q x * P1 = x * P2 = x * Q Conclusion: The optimal solution of Q is the same as the P’s Page 29

c P1 c P2 Solution x* Feasible region ILPs corresponding to all these objective vectors will share the same maximizer for this feasible region All ILPs in the cone will share the maximizer Theorem II (Geometric Interpretation) Page 30

Theorem II Page 31

Theorem III (Combining I and II) Page 32

Theorem III Objective values for problem P Structured Margin  Objective values for problem Q Increase in objective value of the competing structures B = (C Q – C P ) y Decrease in objective value of the solution A = (C P – C Q ) y * Increasing objective value Two competing structures y * the solution to problem P Theorem (margin based amortized inference): If A + B is less than the structured margin, then y * is still the optimum for Q Page 33

Speedup & Accuracy Amortization schemes [EMNLP’12, ACL’13] SpeedupSpeedup 1.0 Amortized inference gives a speedup without losing accuracy Solve only one in three problems Page 34

So far… Amortized inference  Making inference faster by re-using previous computations Techniques for amortized inference But these are not useful if the full structure is not redundant! Smaller Structures are more redundant Page 35

Decomposed amortized inference Taking advantage of redundancy in components of structures  Extend amortization techniques to cases where the full structured output may not be repeated  Store partial computations of “components” for use in future inference problems Page 36

The bus was heading for Nairobi in Kenya. Coherence of predictions Location Destination Predicate: head.02 A0 (mover): The bus A1 (destination): for Nairobi in Kenya Predicate: head.02 A0 (mover): The bus A1 (destination): for Nairobi in Kenya Predicate arguments from different triggers should be consistent Joint constraints linking the two tasks. Destination  A1 Page 37

Example: Decomposition for inference Verb SRL constraintsOnly one label per preposition Joint constraints Verb relationsPreposition relations Constraints: Re-introduce constraints using Lagrangian Relaxation [Komodakis, et al 2007], [Rush & Collins, 2011], [Chang & Collins, 2011], … The Preposition Problem The Verb Problem Page 38

Decomposed amortized Inference Intuition  Create smaller problems by removing constraints from ILPs Smaller problems -> more cache hits!  Solve relaxed inference problems using any amortized inference algorithm  Re-introduce these constraints via Lagrangian relaxation Page 39

Speedup & Accuracy SpeedupSpeedup 1.0 Amortized inference gives a speedup without losing accuracy Solve only one in six problems Amortization schemes [EMNLP’12, ACL’13] Page 40

Reduction in inference calls (SRL) Solve only one in six problems Page 41

Reduction in inference calls (Entity-relation extraction) Solve only one in four problems Page 42

So far… We have given theorems that allow savings of 5/6 of the calls to your favorite inference engine. But, there is some cost in  Checking the conditions of the theorems  Accessing the cache Our implementations are clearly not state-of-the-art but…. Page 43

Reduction in wall-clock time (SRL) Solve only one in 2.6 problems Page 44

Conclusion Amortized inference: Gave conditions for determining when a new, unseen problem, shares a previously seen solution (or parts of it) Theory depends on the ILP formulation of the problem, but applies to your favorite inference algorithm  In particular, can use approximate inference as the base solver  The approximation properties of the underlying algorithm will be retained We showed that we can save 5/6 or calls to an inference engine  Theorems can be relaxed to increase cache hits Integer Linear Programming formulations are powerful  We already knew that they are expressive and easy to use in many problems  Moreover: even if you want to use other solvers….  We showed that the ILP formulation is key to amortization Thank You! Page 45

Theorem III Objective values for problem P Structured Margin  Objective values for problem Q Increase in objective value of the competing structures B = (C Q – C P ) y Decrease in objective value of the solution A = (C P – C Q ) y * 46 Increasing objective value Two competing structures y * the solution to problem P Theorem (margin based amortized inference): If A + B is less than the structured margin, then y * is still the optimum for Q Easy to compute during caching Easy to compute Hard to compute max over y – we relax the problem, and only increase B

Experiments: Semantic Role Labeling SRL: Based on the state-of-the-art Illinois SRL  [V. Punyakanok and D. Roth and W. Yih, The Importance of Syntactic Parsing and Inference in Semantic Role Labeling, Computational Linguistics – 2008]  In SRL, we solve an ILP problem for each verb predicate in each sentence Amortization Experiments:  Speedup & Accuracy are measured over WSJ test set (Section 23)  Baseline is solving ILP using Gurobi 4.6 For amortization:  We collect 250,000 SRL inference problems from Gigaword and store in a database  For each ILP in test set, we invoke one of the theorems (exact / approx.)  If found, we return it, otherwise we call the baseline ILP solver Page 47

Inference with General Constraint Structure [Roth&Yih’04,07] Recognizing Entities and Relations Dole ’s wife, Elizabeth, is a native of N.C. E 1 E 2 E 3 R 12 R 23 other 0.05 per 0.85 loc 0.10 other 0.05 per 0.50 loc 0.45 other 0.10 per 0.60 loc 0.30 irrelevant 0.10 spouse_of 0.05 born_in 0.85 irrelevant 0.05 spouse_of 0.45 born_in 0.50 irrelevant 0.05 spouse_of 0.45 born_in 0.50 other 0.05 per 0.85 loc 0.10 other 0.10 per 0.60 loc 0.30 other 0.05 per 0.50 loc 0.45 irrelevant 0.05 spouse_of 0.45 born_in 0.50 irrelevant 0.10 spouse_of 0.05 born_in 0.85 other 0.05 per 0.50 loc 0.45 Improvement over no inference: 2-5% Models could be learned separately; constraints may come up only at decision time. Page 48 Note: Non Sequential Model Key Questions: How to guide the global inference? How to learn? Why not Jointly? Y = argmax  y score(y=v) [[y=v]] = = argmax score(E 1 = PER) ¢ [[E 1 = PER]] + score(E 1 = LOC) ¢ [[E 1 = LOC]] +… score(R 1 = S-of) ¢ [[R 1 = S-of]] +….. Subject to Constraints An Objective function that incorporates learned models with knowledge (constraints) A constrained Conditional Model

June 2013 Inferning Workshop, ICML, Atlanta GA Amortized Integer Linear Programming Inference Dan Roth Department of Computer Science University of Illinois.

Similar presentations

Presentation on theme: "June 2013 Inferning Workshop, ICML, Atlanta GA Amortized Integer Linear Programming Inference Dan Roth Department of Computer Science University of Illinois."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

June 2013 Inferning Workshop, ICML, Atlanta GA Amortized Integer Linear Programming Inference Dan Roth Department of Computer Science University of Illinois.

Similar presentations

Presentation on theme: "June 2013 Inferning Workshop, ICML, Atlanta GA Amortized Integer Linear Programming Inference Dan Roth Department of Computer Science University of Illinois."— Presentation transcript:

Similar presentations

About project

Feedback