June 2011 Microsoft Research, Washington With thanks to: Collaborators: Scott Yih, Ming-Wei Chang, James Clarke, Dan Goldwasser, Lev Ratinov, Vivek Srikumar,

Slides:



Advertisements
Similar presentations
Latent Variables Naman Agarwal Michael Nute May 1, 2013.
Advertisements

CONSTRAINED CONDITIONAL MODELS TUTORIAL Jingyu Chen, Xiao Cheng.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Page 1 SRL via Generalized Inference Vasin Punyakanok, Dan Roth, Wen-tau Yih, Dav Zimak, Yuancheng Tu Department of Computer Science University of Illinois.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Page 1 CS 546 Machine Learning in NLP Structured Prediction: Theories and Applications to Natural Language Processing Dan Roth Department of Computer Science.
Page 1 CS 546 Machine Learning in NLP Structured Prediction: Theories and Applications to Natural Language Processing Dan Roth Department of Computer Science.
Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A.
Structured SVM Chen-Tse Tsai and Siddharth Gupta.
Support Vector Machines
Machine learning continued Image source:
A Linear Programming Formulation for Global Inference in Natural Language Tasks Dan RothWen-tau Yih Department of Computer Science University of Illinois.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Page 1 Learning and Global Inference for Information Access and Natural Language Understanding Dan Roth Department of Computer Science University of Illinois.
Page 1 SRL via Generalized Inference Vasin Punyakanok, Dan Roth, Wen-tau Yih, Dav Zimak, Yuancheng Tu Department of Computer Science University of Illinois.
LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.
Page 1 Generalized Inference with Multiple Semantic Role Labeling Systems Peter Koomen, Vasin Punyakanok, Dan Roth, (Scott) Wen-tau Yih Department of Computer.
LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
Experiments  Synthetic data: random linear scoring function with random constraints  Information extraction: Given a citation, extract author, book-title,
Radial Basis Function Networks
Online Learning Algorithms
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Integer Linear Programming in NLP Constrained Conditional Models
Research Focus Textual Inference and Knowledge Representation Our research focuses on the computational foundations of intelligent behavior. We develop.
Page 1 February 2008 University of Edinburgh With thanks to: Collaborators: Ming-Wei Chang, Vasin Punyakanok, Lev Ratinov, Nick Rizzolo, Mark Sammons,
Page 1 Global Inference and Learning Towards Natural Language Understanding Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign.
February 2012 Princeton Plasma Physics Laboratory With thanks to: Collaborators: Ming-Wei Chang, James Clarke, Michael Connor, Dan Goldwasser, Lev Ratinov,
Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign.
Page 1 November 2007 Beckman Institute With thanks to: Collaborators: Ming-Wei Chang, Vasin Punyakanok, Lev Ratinov, Mark Sammons, Scott Yih, Dav Zimak.
Page 1 March 2009 Brigham Young University With thanks to: Collaborators: Ming-Wei Chang, Vasin Punyakanok, Lev Ratinov, Nick Rizzolo, Mark Sammons, Scott.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Page 1 CS 546 Machine Learning in NLP Structured Prediction: Theories and Applications to Natural Language Processing Dan Roth Department of Computer Science.
Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.
Aspect Guided Text Categorization with Unobserved Labels Dan Roth, Yuancheng Tu University of Illinois at Urbana-Champaign.
Unsupervised Constraint Driven Learning for Transliteration Discovery M. Chang, D. Goldwasser, D. Roth, and Y. Tu.
June 2013 Inferning Workshop, ICML, Atlanta GA Amortized Integer Linear Programming Inference Dan Roth Department of Computer Science University of Illinois.
Page 1 Global Inference in Learning for Natural Language Processing.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
December 2011 Technion, Israel With thanks to: Collaborators: Ming-Wei Chang, James Clarke, Michael Connor, Dan Goldwasser, Lev Ratinov, Vivek Srikumar,
Multi-core Structural SVM Training Kai-Wei Chang Department of Computer Science University of Illinois at Urbana-Champaign Joint Work With Vivek Srikumar.
Page 1 CS 546 Machine Learning in NLP Structured Prediction: Theories and Applications in Natural Language Processing Dan Roth Department of Computer Science.
Page 1 April 2010 Carnegie Mellon University With thanks to: Collaborators: Ming-Wei Chang, James Clarke, Dan Goldwasser, Lev Ratinov, Vivek Srikumar,
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
Page 1 July 2008 ICML Workshop on Prior Knowledge for Text and Language Constraints as Prior Knowledge Ming-Wei Chang, Lev Ratinov, Dan Roth Department.
Page 1 January 2010 Saarland University, Germany. Constrained Conditional Models Learning and Inference for Natural Language Understanding Dan Roth Department.
Static model noOverlaps :: ArgumentCandidate[] candidates -> discrete[] types for (i : (0.. candidates.size() - 1)) for (j : (i candidates.size()
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Page 1 June 2009 ILPNLP NAACL-HLT With thanks to: Collaborators: Ming-Wei Chang, Dan Goldwasser, Vasin Punyakanok, Lev Ratinov, Nick Rizzolo,
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Lecture 7: Constrained Conditional Models
PART 5: CONSTRAINTS DRIVEN LEARNING
CSC 594 Topics in AI – Natural Language Processing
Integer Linear Programming Formulations in Natural Language Processing
Part 2 Applications of ILP Formulations in Natural Language Processing
By Dan Roth and Wen-tau Yih PowerPoint by: Reno Kriz CIS
Kai-Wei Chang University of Virginia
CIS 700 Advanced Machine Learning for NLP Inference Applications
Margin-based Decomposed Amortized Inference
Overview of Machine Learning
Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^
Dan Roth Computer and Information Science University of Pennsylvania
Dan Roth Department of Computer Science
Presentation transcript:

June 2011 Microsoft Research, Washington With thanks to: Collaborators: Scott Yih, Ming-Wei Chang, James Clarke, Dan Goldwasser, Lev Ratinov, Vivek Srikumar, Many others Funding: NSF: ITR IIS , SoD-HCER , DHS; DARPA: Bootstrap Learning & Machine Reading Programs DASH Optimization (Xpress-MP) Constraints Driven Learning for Natural Language Understanding Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign

Page 2 Comprehension 1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now. A process that maintains and updates a collection of propositions about the state of affairs. This is an Inference Problem (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous.

Coherency in Semantic Role Labeling [EMNLP’11] Predicate-arguments generated should be consistent across phenomena The touchdown scored by Mccoy cemented the victory of the Eagles. VerbNominalizationPreposition Predicate: score A0: Mccoy (scorer) A1: The touchdown (points scored) Predicate: win A0: the Eagles (winner) Sense: 11(6) “the object of the preposition is the object of the underlying verb of the nominalization” Linguistic Constraints: A0: the Eagles  Sense(of): 11(6) A0:Mccoy  Sense(by): 1(1) Page 3

Semantic Parsing [CoNLL’10,…] Successful interpretation involves multiple decisions  What entities appear in the interpretation?  “New York” refers to a state or a city?  How to compose fragments together? state(next_to()) >< next_to(state()) X :“What is the largest state that borders New York and Maryland ?" Y: largest( state( next_to( state(NY) AND next_to (state(MD)))) Communication Page 4

Learning and Inference Natural Language Decisions are Structured  Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. It is essential to make coherent decisions in a way that takes the interdependencies into account. Joint, Global Inference. But: Learning structured models requires annotating structures. Interdependencies among decision variables should be exploited in Decision Making (Inference) and in Learning.  Goal: learn from minimal, indirect supervision  Amplify it using interdependencies among variables Page 5

Natural Language Decisions are Structured  Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. It is essential to make coherent decisions in a way that takes the interdependencies into account. Joint, Global Inference. But: What are the best ways to learn in support of global inference?  Often, decoupling learning from inference is best [IJCAI’05, others]  Sometimes, interdependencies among decision variables can be exploited in Decision Making (Inference) and in Learning. Learning and Inference Page 6

Three Ideas Idea 1: Separate modeling and problem formulation from algorithms  Similar to the philosophy of probabilistic modeling Idea 2: Keep model simple, make expressive decisions (via constraints)  Unlike probabilistic modeling, where models become more expressive Idea 3: Expressive structured decisions can be supervised indirectly via related simple binary decisions  Global Inference can be used to amplify the minimal supervision. Page 7 Modeling Inference Learning

Constrained Conditional Models (aka ILP Inference) How to solve? This is an Integer Linear Program Solving using ILP packages gives an exact solution. Cutting Planes, Dual Decomposition & other search techniques are possible (Soft) constraints component Weight Vector for “local” models Penalty for violating the constraint. How far y is from a “legal” assignment Features, classifiers; log- linear models (HMM, CRF) or a combination How to train? Training is learning the objective function Decouple? Decompose? How to exploit the structure to minimize supervision? Page 8

Page 9 Constrained Conditional Models Difficulty of Annotating Data Decouple? Joint Learning vs. Joint Inference How to solve? [Inference] An Integer Linear Program Exact (ILP packages) or approximate solutions How to train? [Learning] Training is learning the objective function [A lot of work on this] Examples Indirect Supervision Constraint Driven Learning Semi-supervised Learning Constraint Driven Learning New Applications

Outline I. Modeling: From Pipelines to Integer Linear Programming  Global Inference in NLP II. Simple Models, Expressive Decisions:  Semi-supervised Training for structures  Constraints Driven Learning III. Indirect Supervision Training Paradigms for structure  Indirect Supervision Training with latent structure (NAACL’10) Transliteration; Textual Entailment; Paraphrasing  Training Structure Predictors by Inventing (simple) binary labels (ICML’10) POS, Information extraction tasks  Driving supervision signal from World’s Response (CoNLL’10,IJCAI’11,….) Semantic Parsing Page 10

Pipeline Conceptually, Pipelining is a crude approximation  Interactions occur across levels and down stream decisions often interact with previous decisions.  Leads to propagation of errors  Occasionally, later stage problems are easier but cannot correct earlier errors. But, there are good reasons to use pipelines  Putting everything in one bucket may not be right  How about choosing some stages and think about them jointly? POS TaggingPhrasesSemantic EntitiesRelations Most problems are not single classification problems ParsingWSDSemantic Role Labeling Raw Data Page 11

Inference with General Constraint Structure [Roth&Yih ’ 04, 07] Recognizing Entities and Relations Dole ’s wife, Elizabeth, is a native of N.C. E 1 E 2 E 3 R 12 R 23 other 0.05 per 0.85 loc 0.10 other 0.05 per 0.50 loc 0.45 other 0.10 per 0.60 loc 0.30 irrelevant 0.10 spouse_of 0.05 born_in 0.85 irrelevant 0.05 spouse_of 0.45 born_in 0.50 irrelevant 0.05 spouse_of 0.45 born_in 0.50 other 0.05 per 0.85 loc 0.10 other 0.10 per 0.60 loc 0.30 other 0.05 per 0.50 loc 0.45 irrelevant 0.05 spouse_of 0.45 born_in 0.50 irrelevant 0.10 spouse_of 0.05 born_in 0.85 other 0.05 per 0.50 loc 0.45 Improvement over no inference: 2-5% Some Questions: How to guide the global inference? Why not learn Jointly? Models could be learned separately; constraints may come up only at decision time. Non- Sequential Key Components: 1.Write down an objective function (Linear). (depends on the models; one per instance) 2.Write down constraints as linear inequalities XE R Page 12

Page 13 Linguistics Constraints Cannot have both A states and B states in an output sequence. Linguistics Constraints If a modifier chosen, include its head If verb is chosen, include its arguments Examples: CCM Formulations (aka ILP for NLP) CCMs can be viewed as a general interface to easily combine domain knowledge with data driven statistical models Sequential Prediction HMM/CRF based: Argmax  ¸ ij x ij Sentence Compression/Summarization: Language Model based: Argmax  ¸ ijk x ijk Formulate NLP Problems as ILP problems (inference may be done otherwise) 1. Sequence tagging (HMM/CRF + Global constraints) 2. Sentence Compression (Language Model + Global Constraints) 3. SRL (Independent classifiers + Global Constraints)

Page 14 Example: Semantic Role Labeling I left my pearls to my daughter in my will. [ I ] A0 left [ my pearls ] A1 [ to my daughter ] A2 [ in my will ] AM-LOC. A0Leaver A1Things left A2Benefactor AM-LOCLocation I left my pearls to my daughter in my will. Overlapping arguments If A2 is present, A1 must also be present. Who did what to whom, when, where, why,…

Page 15 PropBank [Palmer et. al. 05] provides a large human-annotated corpus of semantic verb-argument relations.  It adds a layer of generic semantic labels to Penn Tree Bank II.  (Almost) all the labels are on the constituents of the parse trees. Core arguments: A0-A5 and AA  different semantics for each verb  specified in the PropBank Frame files 13 types of adjuncts labeled as AM- arg  where arg specifies the adjunct type Semantic Role Labeling (2/2)

Page 16 Algorithmic Approach Identify argument candidates  Pruning [Xue&Palmer, EMNLP’04]  Argument Identifier Binary classification (A-Perc) Classify argument candidates  Argument Classifier Multi-class classification (A-Perc) Inference  Use the estimated probability distribution given by the argument classifier  Use structural and linguistic constraints  Infer the optimal global output I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] I left my nice pearls to her candidate arguments

Page 17 Semantic Role Labeling (SRL) I left my pearls to my daughter in my will

Page 18 Semantic Role Labeling (SRL) I left my pearls to my daughter in my will

Page 19 Semantic Role Labeling (SRL) I left my pearls to my daughter in my will One inference problem for each verb predicate.

Page 20 Integer Linear Programming Inference For each argument a i  Set up a Boolean variable: a i,t indicating whether a i is classified as t Goal is to maximize   i score(a i = t ) a i,t  Subject to the (linear) constraints If score(a i = t ) = P(a i = t ), the objective is to find the assignment that maximizes the expected number of arguments that are correct and satisfies the constraints. The Constrained Conditional Model is completely decomposed during training

Page 21 No duplicate argument classes  a  P OT A RG x { a = A0 }  1 R-ARG  a2  P OT A RG,  a  P OT A RG x { a = A0 }  x { a2 = R-A0 } C-ARG  a2  P OT A RG,  (a  P OT A RG )  (a is before a2 ) x { a = A0 }  x { a2 = C-A0 } Many other possible constraints: Unique labels No overlapping or embedding Relations between number of arguments; order constraints If verb is of type A, no argument of type B Any Boolean rule can be encoded as a (collection of) linear constraints. If there is an R-ARG phrase, there is an ARG Phrase If there is an C-ARG phrase, there is an ARG before it Constraints Joint inference can be used also to combine different (SRL) Systems. Universally quantified rules LBJ: allows a developer to encode constraints in FOL; these are compiled into linear inequalities automatically.

Page 22 SRL: Posing the Problem 2:22 Demo: Top ranked system in CoNLL’05 shared task Key difference is the Inference 2) Produces a very good semantic parser. F1~90% 3) Easy and fast: ~7 Sent/Sec (using Xpress-MP)

Three Ideas Idea 1: Separate modeling and problem formulation from algorithms  Similar to the philosophy of probabilistic modeling Idea 2: Keep model simple, make expressive decisions (via constraints)  Unlike probabilistic modeling, where models become more expressive Idea 3: Expressive structured decisions can be supervised indirectly via related simple binary decisions  Global Inference can be used to amplify the minimal supervision. Page 23 Modeling Inference Learning

Page 24 Constrained Conditional Models – ILP formulations – have been shown useful in the context of many NLP problems, [Roth&Yih, 04,07; Chang et. al. 07,08,…]  SRL, Summarization; Co-reference; Information Extraction; Transliteration, Textual Entailment, Knowledge Acquisition Some theoretical work on training paradigms [Punyakanok et. al., 05 more] See a NAACL’10 tutorial on my web page & an NAACL’09 ILPNLP workshop Summary of work & a bibliography: But: Learning structured models requires annotating structures. Constrained Conditional Models

Page 25 Information extraction without Prior Knowledge Prediction result of a trained HMM Lars Ole Andersen. Program analysis and specialization for the C Programming language. PhD thesis. DIKU, University of Copenhagen, May [AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE] Violates lots of natural constraints! Lars Ole Andersen. Program analysis and specialization for the C Programming language. PhD thesis. DIKU, University of Copenhagen, May 1994.

Page 26 Strategies for Improving the Results (Pure) Machine Learning Approaches  Higher Order HMM/CRF?  Increasing the window size?  Adding a lot of new features Requires a lot of labeled examples  What if we only have a few labeled examples? Other options?  The output does not make sense Increasing the model complexity Can we keep the learned model simple and still make expressive decisions?

Page 27 Examples of Constraints Each field must be a consecutive list of words and can appear at most once in a citation. State transitions must occur on punctuation marks. The citation can only start with AUTHOR or EDITOR. The words pp., pages correspond to PAGE. Four digits starting with 20xx and 19xx are DATE. Quotations can appear only in TITLE ……. Easy to express pieces of “knowledge” Non Propositional; May use Quantifiers

Page 28 Information Extraction with Constraints Adding constraints, we get correct results!  Without changing the model [AUTHOR] Lars Ole Andersen. [TITLE] Program analysis and specialization for the C Programming language. [TECH-REPORT] PhD thesis. [INSTITUTION] DIKU, University of Copenhagen, [DATE] May, Constrained Conditional Models Allow: Learning a simple model Make decisions with a more complex model Accomplished by directly incorporating constraints to bias/re- rank decisions made by the simpler model

Page 29 II. Guiding Semi-Supervised Learning with Constraints Model Decision Time Constraints Un-labeled Data Constraints In traditional Semi-Supervised learning the model can drift away from the correct one. Constraints can be used to generate better training data  At training to improve labeling of un-labeled data (and thus improve the model)  At decision time, to bias the objective function towards favoring constraint satisfaction.

Page 30 Constraints Driven Learning (CoDL) (w 0, ½ 0 )= learn(L)‏ For N iterations do T=  For each x in unlabeled dataset h à argmax y w T Á (x,y) -  ½ k d C (x,y) T=T  {(x, h)} (w, ½ ) =  (w 0, ½ 0 ) + (1-  ) learn(T) [Chang, Ratinov, Roth, ACL’07;ICML’08,ML, to appear] Generalized by Ganchev et. al [PR work] Supervised learning algorithm parameterized by (w, ½ ). Learning can be justified as an optimization procedure for an objective function Inference with constraints: augment the training set Learn from new training data Weigh supervised & unsupervised models. Excellent Experimental Results showing the advantages of using constraints, especially with small amounts on labeled data [Chang et. al, Others] Several Training Paradigms

Page 31 Objective function: Constraints Driven Learning (CODL) # of available labeled examples Learning w 10 Constraints Poor model + constraints Constraints are used to:  Bootstrap a semi-supervised learner  Correct weak models predictions on unlabeled data, which in turn are used to keep training the model. Learning w/o Constraints: 300 examples. Semi-Supervised Learning Paradigm that makes use of constraints to bootstrap from a small number of examples [Chang, Ratinov, Roth, ACL’07;ICML’08,MLJ, to appear] Generalized by Ganchev et. al [PR work]

Outline I. Modeling: From Pipelines to Integer Linear Programming  Global Inference in NLP II. Simple Models, Expressive Decisions:  Semi-supervised Training for structures  Constraints Driven Learning III. Indirect Supervision Training Paradigms for structure  Indirect Supervision Training with latent structure (NAACL’10) Transliteration; Textual Entailment; Paraphrasing  Training Structure Predictors by Inventing (simple) binary labels (ICML’10) POS, Information extraction tasks  Driving supervision signal from World’s Response (CoNLL’10,IJCAI’11,….) Semantic Parsing Page 32 Indirect Supervision Replace a structured label by a related (easy to get) binary label

Page 33 Paraphrase Identification Consider the following sentences: S1: Druce will face murder charges, Conte said. S2: Conte said Druce will be charged with murder. Are S1 and S2 a paraphrase of each other? There is a need for an intermediate representation to justify this decision Textual Entailment is equivalent Given an input x 2 X Learn a model f : X ! {-1, 1} We need latent variables that explain why this is a positive example. Given an input x 2 X Learn a model f : X ! H ! {-1, 1} XY H

Page 34 Algorithms: Two Conceptual Approaches Two stage approach (a pipeline; typically used for TE, paraphrase id, others)  Learn hidden variables; fix it Need supervision for the hidden layer (or heuristics)  For each example, extract features over x and (the fixed) h.  Learn a binary classier for the target task Proposed Approach: Joint Learning  Drive the learning of h from the binary labels  Find the best h(x)  An intermediate structure representation is good to the extent is supports better final prediction.  Algorithm? How to drive learning a good H? XY H

Page 35 Learning with Constrained Latent Representation (LCLR): Intuition If x is positive  There must exist a good explanation (intermediate representation)  9 h, w T Á (x,h) ¸ 0  or, max h w T Á (x,h) ¸ 0 If x is negative  No explanation is good enough to support the answer  8 h, w T Á (x,h) · 0  or, max h w T Á (x,h) · 0 Altogether, this can be combined into an objective function: Min w ¸ /2 ||w|| 2 + C  i L(1-z i max h 2 C w T  {s} h s Á s (x i )) Why does inference help?  Constrains intermediate representations supporting good predictions New feature vector for the final decision. Chosen h selects a representation. Inference: best h subject to constraints C

Page 36 Optimization Non Convex, due to the maximization term inside the global minimization problem In each iteration:  Find the best feature representation h* for all positive examples (off- the shelf ILP solver)  Having fixed the representation for the positive examples, update w solving the convex optimization problem:  Not the standard SVM/LR: need inference Asymmetry: Only positive examples require a good intermediate representation that justifies the positive label.  Consequently, the objective function decreases monotonically

Page 37  Formalized as Structured SVM + Constrained Hidden Structure  LCRL: Learning Constrained Latent Representation Iterative Objective Function Learning Inference best h subj. to C Prediction with inferred h Training w/r to binary decision label Initial Objective Function Generate features Update weight vector Feedback relative to binary problem ILP inference discussed earlier; restrict possible hidden structures considered.

LCLR provides a general inference formulation that allows the use of expressive constraints to determine the hidden level  Flexibly adapted for many tasks that require latent representations. Paraphrasing: Model input as graphs, V(G 1,2 ), E(G 1,2 )  Four (types of) Hidden variables: h v1,v2 – possible vertex mappings; h e1,e2 – possible edge mappings  Constraints: Each vertex in G 1 can be mapped to a single vertex in G 2 or to null Each edge in G 1 can be mapped to a single edge in G 2 or to null Edge mapping active iff the corresponding node mappings are active Page 38 Learning with Constrained Latent Representation (LCLR): Framework LCLR Model H: Problem Specific Declarative Constraints XY H

Page 39 Experimental Results Transliteration: Recognizing Textual Entailment: Paraphrase Identification:*

Outline I. Modeling: From Pipelines to Integer Linear Programming  Global Inference in NLP II. Simple Models, Expressive Decisions:  Semi-supervised Training for structures  Constraints Driven Learning III. Indirect Supervision Training Paradigms for structure  Indirect Supervision Training with latent structure (NAACL’10) Transliteration; Textual Entailment; Paraphrasing  Training Structure Predictors by Inventing (simple) binary labels (ICML’10) POS, Information extraction tasks  Driving supervision signal from World’s Response (CoNLL’10,IJCAI’11,….) Semantic Parsing Page 40 Indirect Supervision Replace a structured label by a related (easy to get) binary label

Page 41 Structured Prediction Before, the structure was in the intermediate level  We cared about the structured representation only to the extent it helped the final binary decision  The binary decision variable was given as supervision What if we care about the structure?  Information Extraction; Relation Extraction; POS tagging, many others. Invent a companion binary decision problem!

Page 42 Information extraction Prediction result of a trained HMM Lars Ole Andersen. Program analysis and specialization for the C Programming language. PhD thesis. DIKU, University of Copenhagen, May [AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE] Lars Ole Andersen. Program analysis and specialization for the C Programming language. PhD thesis. DIKU, University of Copenhagen, May 1994.

Page 43 Structured Prediction Before, the structure was in the intermediate level  We cared about the structured representation only to the extent it helped the final binary decision  The binary decision variable was given as supervision What if we care about the structure?  Information Extraction; Relation Extraction; POS tagging, many others. Invent a companion binary decision problem!  Parse Citations: Lars Ole Andersen. Program analysis and specialization for the C Programming language. PhD thesis. DIKU, University of Copenhagen, May  Companion: Given a citation; does it have a legitimate citation parse?  POS Tagging  Companion: Given a word sequence, does it have a legitimate POS tagging sequence? Binary Supervision is almost free XY H

Page 44 Companion Task Binary Label as Indirect Supervision The two tasks are related just like the binary and structured tasks discussed earlier All positive examples must have a good structure Negative examples cannot have a good structure We are in the same setting as before  Binary labeled examples are easier to obtain  We can take advantage of this to help learning a structured model Algorithm: combine binary learning and structured learning Positive transliteration pairs must have “good” phonetic alignments Negative transliteration pairs cannot have “good” phonetic alignments XY H

Page 45 Learning Structure with Indirect Supervision In this case we care about the predicted structure Use both Structural learning and Binary learning The feasible structures of an example Correct Predicted Negative examples cannot have a good structure Negative examples restrict the space of hyperplanes supporting the decisions for x

Page 46 Joint Learning Framework Joint learning : If available, make use of both supervision types ylatI י ט לי הא Target Task Yes/No Loss on Target TaskLoss on Companion Task Loss function – same as described earlier. Key: the same parameter w for both components Companion Task נ ל יי וא י I l l i n o i s

Page 47 Experimental Result Very little direct (structured) supervision.

Page 48 Experimental Result Very little direct (structured) supervision. (Almost free) Large amount binary indirect supervision

Outline I. Modeling: From Pipelines to Integer Linear Programming  Global Inference in NLP II. Simple Models, Expressive Decisions:  Semi-supervised Training for structures  Constraints Driven Learning III. Indirect Supervision Training Paradigms for structure  Indirect Supervision Training with latent structure (NAACL’10) Transliteration; Textual Entailment; Paraphrasing  Training Structure Predictors by Inventing (simple) binary labels (ICML’10) POS, Information extraction tasks  Driving supervision signal from World’s Response (CoNLL’10,IJCAI’11,….) Semantic Parsing Page 49

Page 50 Connecting Language to the World [CoNLL’10,ACL’11,IJCAI’11] Can I get a coffee with no sugar and just a bit of milk Can we rely on this interaction to provide supervision? MAKE(COFFEE,SUGAR=NO,MILK=LITTLE) Arggg Great! Semantic Parser

Page 51 Traditional approach: learn from logical forms and gold alignments EXPENSIVE! Semantic parsing is a structured prediction problem: identify mappings from text to a meaning representation Query Response: Supervision = Expected Response Check if Predicted response == Expected response Logical Query Real World Feedback Interactive Computer System Pennsylvania Query Response : r largest( state( next_to( const(NY)))) y “What is the largest state that borders NY?" NL Query x Train a structured predictor with this binary supervision ! Expected : Pennsylvania Predicted : NYC Negative Response Pennsylvania r Binary Supervision Expected : Pennsylvania Predicted : Pennsylvania Positive Response Our approach: use only the responses

Page 52 Empirical Evaluation [CoNLL’10] Key Question: Can we learn from this type of supervision? Algorithm# training structures Test set accuracy No Learning: Initial Objective Fn Binary signal: Protocol I % 69.2 % Binary signal: Protocol II073.2 % WM*2007 (fully supervised – uses gold structures) % *[WM] Y.-W. Wong and R. Mooney Learning synchronous grammars for semantic parsing with lambda calculus. ACL. Current emphasis: Learning to understand natural language instructions for games via response based learning

Page 53 Conclusion Constrained Conditional Models: Computational Framework for global inference and a vehicle for incorporating knowledge in structured tasks  Integer Linear Programming Formulation – a lot of recent work (see tutorial) Focused today on COnstraint Driven Learning & Indirect Supervision  Simple Models, Expressive Decisions  Indirect supervision is cheap and easy to obtain Learning Structures from Real World Feedback  Obtain binary supervision from “real world” interaction  Indirect supervision replaces direct supervision Thank You! LBJ (Learning Based Java): A modeling language for Constrained Conditional Models. Supports programming along with building learned models, high level specification of constraints and inference with constraints

Page 54 y* = argmax y  w i Á (x; y) Linear objective functions Often Á (x,y) will be local functions, or Á (x,y) = Á (x) Summary: Constrained Conditional Models y7y7 y4y4 y5y5 y6y6 y8y8 y1y1 y2y2 y3y3 y7y7 y4y4 y5y5 y6y6 y8y8 y1y1 y2y2 y3y3 Conditional Markov Random FieldConstraints Network   i ½ i d C (x,y) Expressive constraints over output variables Soft, weighted constraints Specified declaratively as FOL formulae Clearly, there is a joint probability distribution that represents this mixed model. We would like to:  Learn a simple model or several simple models  Make decisions with respect to a complex model Key difference from MLNs, which provide a concise definition of a model, but the whole joint one.

Nice to Meet You Page 55

Learning and Inference  Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome.  E.g. Structured Output Problems – multiple dependent output variables  (Learned) models/classifiers for different sub-problems  In some cases, not all local models can be learned simultaneously  Information Extraction, Co-Ref, Dep. Parsing, Summarization, TE, QA,…  Incorporate models ’ information, along with prior knowledge (constraints), in making coherent decisions  decisions that respect the local models as well as domain & context specific knowledge/constraints. Page 56

Page 57 Predicting phonetic alignment (For Transliteration) Target Task  Input: an English Named Entity and its Hebrew Transliteration  Output: Phonetic Alignment (character sequence mapping)  A structured output prediction task (many constraints), hard to label Companion Task  Input: an English Named Entity and an Hebrew Named Entity  Companion Output: Do they form a transliteration pair?  A binary output problem, easy to label  Negative Examples are FREE, given positive examples ylatI י ט לי הא Target Task Yes/No Why it is a companion task? Companion Task נ ל יי וא י I l l i n o i s