Page 1 Learning and Inference in Natural Language From Stand Alone Learning Tasks to Structured Representations Dan Roth Department of Computer Science.

Slides:



Advertisements
Similar presentations
Latent Variables Naman Agarwal Michael Nute May 1, 2013.
Advertisements

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
CONSTRAINED CONDITIONAL MODELS TUTORIAL Jingyu Chen, Xiao Cheng.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Page 1 SRL via Generalized Inference Vasin Punyakanok, Dan Roth, Wen-tau Yih, Dav Zimak, Yuancheng Tu Department of Computer Science University of Illinois.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Page 1 CS 546 Machine Learning in NLP Structured Prediction: Theories and Applications to Natural Language Processing Dan Roth Department of Computer Science.
Support Vector Machines
Machine learning continued Image source:
A Linear Programming Formulation for Global Inference in Natural Language Tasks Dan RothWen-tau Yih Department of Computer Science University of Illinois.
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National.
Page 1 Learning and Global Inference for Information Access and Natural Language Understanding Dan Roth Department of Computer Science University of Illinois.
Page 1 SRL via Generalized Inference Vasin Punyakanok, Dan Roth, Wen-tau Yih, Dav Zimak, Yuancheng Tu Department of Computer Science University of Illinois.
LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Page 1 Generalized Inference with Multiple Semantic Role Labeling Systems Peter Koomen, Vasin Punyakanok, Dan Roth, (Scott) Wen-tau Yih Department of Computer.
LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Introduction to Machine Learning Approach Lecture 5.
Radial Basis Function Networks
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Online Learning Algorithms
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Page 1 February 2008 University of Edinburgh With thanks to: Collaborators: Ming-Wei Chang, Vasin Punyakanok, Lev Ratinov, Nick Rizzolo, Mark Sammons,
Page 1 Global Inference and Learning Towards Natural Language Understanding Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign.
Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign.
Page 1 November 2007 Beckman Institute With thanks to: Collaborators: Ming-Wei Chang, Vasin Punyakanok, Lev Ratinov, Mark Sammons, Scott Yih, Dav Zimak.
Page 1 March 2009 Brigham Young University With thanks to: Collaborators: Ming-Wei Chang, Vasin Punyakanok, Lev Ratinov, Nick Rizzolo, Mark Sammons, Scott.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Graphical models for part of speech tagging
8/25/05 Cognitive Computations Software Tutorial Page 1 SNoW: Sparse Network of Winnows Presented by Nick Rizzolo.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
ICML2004, Banff, Alberta, Canada Learning Larger Margin Machine Locally and Globally Kaizhu Huang Haiqin Yang, Irwin King, Michael.
June 2013 Inferning Workshop, ICML, Atlanta GA Amortized Integer Linear Programming Inference Dan Roth Department of Computer Science University of Illinois.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Page 1 Global Inference in Learning for Natural Language Processing.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
An Introduction to Support Vector Machines (M. Law)
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Face Detection Using Large Margin Classifiers Ming-Hsuan Yang Dan Roth Narendra Ahuja Presented by Kiang “Sean” Zhou Beckman Institute University of Illinois.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Data Mining and Decision Support
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
Page 1 July 2008 ICML Workshop on Prior Knowledge for Text and Language Constraints as Prior Knowledge Ming-Wei Chang, Lev Ratinov, Dan Roth Department.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
1 Machine Learning in Natural Language More on Discriminative models Dan Roth University of Illinois, Urbana-Champaign
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Lecture 7: Constrained Conditional Models
Inference and Learning via Integer Linear Programming
Chapter 7. Classification and Prediction
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
By Dan Roth and Wen-tau Yih PowerPoint by: Reno Kriz CIS
Machine Learning Basics
Margin-based Decomposed Amortized Inference
Objective of This Course
CSc4730/6730 Scientific Visualization
Dan Roth Computer and Information Science University of Pennsylvania
Dan Roth Department of Computer Science
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

Page 1 Learning and Inference in Natural Language From Stand Alone Learning Tasks to Structured Representations Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign Joint work with my students: Vasin Punyakanok, Wen-tau Yih, Dav Zimak Biologically Inspired Computing Sendai, Japan, Nov. 2004

Page 2 We have concentrated on developing the theoretical basis within which to address some of the obstacles and on developing an experimental paradigm so that realistic experiments can be performed to validate the theoretical basis. The emphasis is on large-scale real-world problems in natural language understanding and visual recognition The group develops theories and systems pertaining to intelligent behavior using a unified methodology. At the heart of the approach is the idea that learning has a central role in intelligence. Cognitive Computation Group

Page 3 Cognitive Computation Group Foundations  Learning Theory: Classification; Multi-Class Classification; Ranking  Knowledge Representation: Relational Representations, Relational Kernels  Inference approaches: structural mappings Intelligent Information Access  Information Extraction  Named Entities and Relations  Matching Entities Mentions within and across documents and data bases Natural Language Processing  Semantic Role Labeling  Question answering  Semantics Software  Basic tools development: SNoW, FEX; shallow parser, pos tagger, semantic parser, … Some of our work on understanding the role of learning in supporting reasoning in the natural language domain

Page 4 Comprehension (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous. 1. Who is Christopher Robin? 2. When was Winnie the Pooh written? 3. What did Mr. Robin do when Chris was three years old? 4. Where did young Chris live? 5. Why did Chris write two books of his own?

Page 5 Illinois’ bored of education board...Nissan Car and truck plant is … …divide life into plant and animal kingdom (This Art) (can N) (will MD) (rust V) V,N,N The dog bit the kid. He was taken to a veterinarian a hospital Tiger was in Washington for the PGA Tour What we Know: Stand Alone Ambiguity Resolution

Page 6 Comprehension (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous. 1. Who is Christopher Robin? 2. When was Winnie the Pooh written? 3. What did Mr. Robin do when Chris was three years old? 4. Where did young Chris live? 5. Why did Chris write two books of his own?

Page 7 Inference

Page 8  Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome.  Learned classifiers for different sub-problems  Incorporate classifiers’ information, along with constraints, in making coherent decisions – decisions that respect the local classifiers as well as domain & context specific constraints.  Global inference for the best assignment to all variables of interest. Inference with Classifiers

Page 9 Overview Stand Alone Learning  Modeling  Representational Issues.  Computational Issues Inference  Making Decisions under General Constraints  Semantic Role Labeling How to train Components of Global Decisions  Tradeoff that depends on easiness of learning components.  Feedback to learning is (indirectly) given by the reasoning stage.  There may not be a need (or even a possibility) to learn exactly, but only to the extent that is supports Reasoning.

Page 10 Structured Input  Feature Mapping  Learning  Structured Output Structured Input Primitive Features Features Functions (non linear) Structure output Learning of Multi valued outputs

Page 11 Illinois’ bored of education board...Nissan Car and truck plant is … …divide life into plant and animal kingdom (This Art) (can N) (will MD) (rust V) V,N,N The dog bit the kid. He was taken to a veterinarian a hospital Tiger was in Washington for the PGA Tour Stand Alone Ambiguity Resolution

Page 12 Disambiguation Problems Middle Eastern ____ are known for their sweetness Task: Decide which of { deserts, desserts } is more likely in the given context. Ambiguity: modeled as confusion sets (class labels C ) C={ deserts, desserts} C={ Noun,Adj.., Verb…} C={ topic=Finance, topic=Computing} C={ NE=Person, NE=location}

Page 13 Learning to Disambiguate Given  a confusion set C={ deserts, desserts}  sentence (s) Middle Eastern ____ are known for their sweetness Map into a feature based representation  : S  {  1 (s),  2 (s), …} Learn a function F C that determines which of C={ deserts, desserts} is more likely in a given context. F C (x)= w ¢  (x) Evaluate the function on future C sentences

Page 14 S= I don’t know whether to laugh or cry [x x x x] Consider words, pos tags, relative location in window Generate binary features representing presence of: a word/pos within window around target word don’t within +/-3 know within +/-3 Verb at -1 to within +/- 3 laugh within +/-3 to a +1 conjunctions of size 2, within window of size 3 words: know__to; ___to laugh pos+words: Verb__to; ____to Verb Example: Representation The sentence is represented as a set of its active features S= (don’t at -2, know within +/-3,… ____to Verb,...) Hope: S=I don’t care whether to laugh or cry has almost the same representation

Page 15 Structured Input: Features can be Complex join John will the board as a director [ NP Which type] [ PP of ] [ NP submarine] [ VP was bought ] [ ADVP recently ] [ PP by ] [ NP South Korea ] (. ?) S = John will join the board as a director Word= POS= IS-A= … Can be an involved process; builds on previous learners; computationally hard; some algorithms (perceptron) support implicit mapping

Page 16 A feature is a function over sentences, which maps a sentence to a set of properties of the sentence.  : S  {0,1} or [0,1] There is a huge number of potential features (~10 5 ); Out of these – only a small number is actually active in each example. Representation: List only features that are active (non zero) in example When the number of features is fixed, the collection of examples is {  1 (s),  2 (s), …  n (s)} = {0,1} n. No need to fix number of features (on- line algorithms).  infinite attribute domain {  1 (s),  2 (s), …} = {0,1} 1 Some algorithms can make use of variable size input. Notes on Representation

Page 17 Weather Whether New discriminator in functionally simpler Embedding

Page 18 The number of potential features is very large The instance space is sparse Decisions depend on a small set of features (sparse) Want to learn from a number of examples that is small relative to the dimensionality Natural Language: Domain Characteristics

Page 19 Focus: Two families of on-line algorithms  Examples x 2 {0,1} n ; Hypothesis w 2 R n ; Prediction: sgn{w ¢ x -  } Additive weight update algorithm (Perceptron, Rosenblatt, Variations exist) Multiplicative weight update algorithm (Winnow, Littlestone, Variations exist) Algorithm Descriptions

Page 20 Dominated by the sparseness of the function space  Most features are irrelevant  advantage to multiplicative  # of examples required by multiplicative algorithms depends mostly on # of relevant features  Generalization bounds depend on ||w||. Lesser issue: Sparseness of features space  Very few active features  advantage to additive.  Generalization depend on ||x|| [Kivinen/Warmuth 95] Generalization

Page 21 Function: At least 10 out of fixed 100 variables are active Dimensionality is n Perceptron,SVMs n: Total # of Variables (Dimensionality) Winnow Mistakes bounds for 10 of 100 of n # of mistakes to convergence

Page 22 Multiclass Classification in NLP Name/Entity Recognition  Label people, locations, and organizations in a sentence  [PER Sam Houston],[born in] [LOC Virginia], [was a member of the] [ORG US Congress]. Decompose into sub-problems  Sam Houston, born in Virginia...  (PER,LOC,ORG,?)  PER(1)  Sam Houston, born in Virginia...  (PER,LOC,ORG,?)  None (0)  Sam Houston, born in Virginia...  (PER,LOC,ORG,?)  LOC(2) Input : {0,1} d or R d Output: {0,1,2,3,...,k}

Page 23 Solving Multi-Class via Binary Learning Decompose; use Winner-Take-All  y = argmax w i ¢ x + t i  w i, x  R n, t i  R (Pairwise classification also possible) Key issue: how to train the binary classifiers w i Via Kessler Construction - comparative training - allows learning voroni diagrams. Equivalently: learn in nk-dimension 1-vs-all: not expressive enough

Page 24 Detour – Basic Classifier: SNoW A learning architecture that supports several linear update rules (Winnow, Perceptron, naïve Bayes) Allows regularization; pruning; many options True multi-class classification [Har-Peled, Roth, Zimak, NIPS 2003] Variable size examples; very good support for large scale domains like NLP both in terms of number of examples and number of features. Very efficient (1-2 order of magnitude faster than SVMs) Integrated with an expressive Feature EXtraction Language (FEX) [Dowload from: ]

Page 25 Summary: Stand Alone Classification Theory is well understood  Generalization bounds  Practical issues Essentially all work is done with linear representations  Features: generated explicitly or implicitly (Kernels)  Tradeoff here is relatively understood  Success on a large number of large scale classification problems Key issues:  Features How to decide what are good features How to compute/extract features (intermediate representations)  Supervision: learning protocol

Page 26 Overview Stand Alone Learning  Modeling  Representational Issues.  Computational Issues Inference  Making Decisions under General Constraints  Semantic Role Labeling How to train Components of Global Decisions  Tradeoff that depends on easiness of learning components.  Feedback to learning is (indirectly) given by the reasoning stage.  There may not be a need (or even a possibility) to learn exactly, but only to the extent that is supports Reasoning.

Page 27 Comprehension (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous. 1. Who is Christopher Robin? 2. When was Winnie the Pooh written? 3. What did Mr. Robin do when Chris was three years old? 4. Where did young Chris live? 5. Why did Chris write two books of his own?

Page 28 Identifying Phrase Structure Classifiers 1. Recognizing “The beginning of NP” 2. Recognizing “The end of NP” (or: word based classifiers: BIO representation) Also for other kinds of phrases… Some Constraints 1. Phrases do not overlap 2. Order of phrases 3. Length of phrases Use classifiers to infer a coherent set of phrases He reckons the current account deficit will narrow to only # 1.8 billion in September [ NP He ] [ VP reckons ] [ NP the current account deficit ] [ VP will narrow ] [ PP to ] [ NP only # 1.8 billion ] [ PP in ] [ NP September ]

Page 29 Constrains Structure Sequential Constraints  Three models for sequential inference with classifiers [Punyakanok&Roth NIPS’01,JMLR05] HMM; HMM with Classifiers Conditional Models Constraint Satisfaction Models (CSCL: more general constrains)  Other models have been proposed that can deal with sequential structures. Conditional models (other classifiers); CRF, StructurePerceptron [later] Many Applications: Shallow Parsing, Named Entity; Biological Sequences General Constraints Structure  An Integer/Linear Programming formulation [ Roth&Yih ‘02,’03,’04 ] s1s1 o1o1 s2s2 o2o2 s3s3 o3o3 s4s4 o4o4 s5s5 o5o5 s6s6 o6o6 s1s1 o1o1 s2s2 o2o2 s3s3 o3o3 s4s4 o4o4 s5s5 o5o5 s6s6 o6o6 Allow for Dynamic Programming based Inference No Dynamic Programming.

Page 30 Identifying Entities and Relations J.V. Oswald was murdered at JFK after his assassin, K. F. Johns… Identify: J.V. Oswald was murdered at JFK after his assassin, K. F. Johns… location person Kill (X, Y) Identify named entities Identify relations between entities Exploit mutual dependencies between named entities and relation to yield a coherent global detection. Some knowledge (classifiers) may be known in advance Some constraints may be available only at decision time

Page 31 Inference with Classifiers Scenario: Global decisions in which several local decisions / components play a role, but there are mutual dependencies on their outcome. Assume: Learned classifiers for different sub-problems Constraints on classifiers’ labels (known during training or only at evaluation time). Goal: Incorporate classifiers’ predictions, along with the constraints, in making coherent decisions – that respect the classifiers as well as domain/context specific constrains. Formally: Global inference for the best assignment to all variables of interest.

Page 32 Setting Inference with classifiers is not a new idea.  On sequential constraint structure: HMM, PMM [Punyakanok&Roth], CRF [Lafferty et al.], CSCL [Punyakanok&Roth]  On general structure: Heuristic search  Attempts to use Bayesian Networks [Roth&Yih ’ 02] have problems The Proposed Integer linear programming (ILP) formulation  General: works on non-sequential constraint structure  Expressive: can represent many types of constraints  Optimal: finds the optimal solution  Fast: commercial packages are able to quickly solve very large problems (hundreds of variables and constraints)

Page 33 Problem Setting (1/2) x4x4 x5x5 x6x6 x7x7 x8x8 x1x1 x2x2 x3x3 C(x 1,x 4 ) C(x 2,x 3,x 6,x 7,x 8 ) Random Variables X: Conditional Distributions P (learned by classifiers) Constraints C– any Boolean function defined on partial assignments (possible weights W on constraints) Goal: Find the “best” assignment  The assignment that achieves the highest global accuracy. This is an Integer Programming Problem X*=argmax X P  X subject to constraints C (+ W  C) Everything is Linear

Page 34 Integer Linear Programming A set of binary variables, x = (x 1,…, x d ) A cost vector p  R d, Cost matrices C 1  R d  R t ; C 2  R d  R r, t, r: # of (inequality, equality) constraints; d - # of variables. The ILP solution x* is the vector that maximizes the cost function, x* = argmax x  {0,1} d p  x Subject to C 2 x> b 1 ; and C 1 x = b 2, where b 1, b 2  R d and x  {0,1} d

Page 35 Problem Setting (2/2) Very general formalism; Connections to a large number of well studied optimisation problems and a variety of applications. Justification:  direct argument for the appropriate “best assignment”  Relations to Markov Random Fields (but better computationally) Significant modelling and computational advantages

Page 36 Semantic Role Labeling For each verb in a sentence 1. identify all constituents that fill a semantic role 2. determine their roles Agent, Patient or Instrument, … Their adjuncts, e.g., Locative, Temporal or Manner PropBank project [Kingsbury & Palmer02] provides a large human-annotated corpus of semantic verb-argument relations. Experiment: CoNLL-2004 shared task [Carreras & Marquez 04]  No parsed data in the input

Page 37 Example  A0 represents the leaver,  A1 represents the thing left,  A2 represents the benefactor,  AM-LOC is an adjunct indicating the location of the action,  V determines the verb.

Page 38 Argument Types A0-A5 and AA have different semantics for each verb as specified in the PropBank Frame files. 13 types of adjuncts labeled as AM-XXX where ARG specifies the adjunct type. C-ARG is used to specify the continuity of the argument ARG. In some cases, the actual agent is labeled as the appropriate argument type, ARG, while the relative pronoun is instead labeled as R-ARG.

Page 39 Examples C-ARG R-ARG

Page 40 Algorithm I. Find potential argument candidates (Filtering) II. Classify arguments to types III. Inference for Argument Structure  Cost Function  Constraints  Integer linear programming (ILP)

Page 41 I. Find Potential Arguments An argument can be any set of consecutive words Restrict potential arguments  Classify BEGIN(word) BEGIN(word) = 1  “word begins argument”  Classify END(word) END(word) = 1  “word ends argument” Argument  (w i,...,w j ) is a potential argument iff BEGIN(w i ) = 1 and END(w j ) = 1 Reduce set of potential arguments (PotArg) I left my nice pearls to her [ [ [ [ [ ] ] ] ] ]

Page 42 II. Arguments Type Likelihood Assign type-likelihood  How likely is it that arg a is type t?  For all a  P OT A RG, t  T P (argument a = type t ) I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] I left my nice pearls to her A0 C-A1A1Ø

Page 43 Details – Phrase-level Classifier Learn a classifier (SNoW)  ARGTYPE (arg)   P (arg)  {A0,A1,...,C-A0,...,AM-LOC,...}  argmax t  {A0,A1,...,C-A0,...,LOC,...} w t  P (arg) Estimate Probabilities  Softmax over SNoW activations  P(a = t) = exp(w t  P (a)) / Z

Page 44 What is a Good Assignment? Likelihood of being correct  P(Arg a = Type t) if t is the correct type for argument a For a set of arguments a 1, a 2,..., a n  Expected number of arguments that are correct  i P( a i = t i ) The solution is the assignment with the maximum expected number of correct arguments.

Page 45 Inference Maximize expected number correct  T* = argmax T  i P( a i = t i ) Subject to some constraints  Structural and Linguistic (R-A1  A1) I left my nice pearls to her Cost = = 1.6Non-OverlappingCost = = 1.4 Blue  Red & N-O Cost = = 1.8Independent Max

Page 46 LP Formulation – Linear Cost Cost function   a  P OT A RG P(a=t) =  a  P OT A RG, t  T P(a=t) x {a=t} Indicator variables x {a1=A0}, x {a1= A1}, …, x {a4= AM-LOC}, x {P4=  }  {0,1} Total Cost = p (a1= A0) · x (a1= A1) + p (a1=  ) · x (a1=  ) +… + p (a4=  ) · x (a4=  ) Corresponds to Maximizing expected number of correct phrases

Page 47 Binary values  a  P OT A RG, t  T, x { a = t }  {0,1} Unique labels  a  P OT A RG,  t  T x { a = t } = 1 No overlapping or embedding a1 and a2 overlap  x {a1= Ø } + x {a2= Ø }  1 Linear Constraints (1/2)

Page 48 No duplicate argument classes  a  P OT A RG x { a = A0 }  1 R-ARG  a2  P OT A RG,  a  P OT A RG x { a = A0 }  x { a2 = R-A0 } C-ARG  a2  P OT A RG,  (a  P OT A RG )  (a is before a2 ) x { a = A0 }  x { a2 = C-A0 } Many other possible constraints:  Exactly one argument of type Z  If verb is of type A, no argument of type B Linear Constraints (2/2) Any Boolean rule can be encoded as a linear constraint. If the is an R-ARG phrase, there is an ARG Phrase If the is an C-ARG phrase, there is an ARG before it

Page 49 Discussion Inference approach used also for simultaneous named entities and relation identification (CoNLL’04) A few other problems in progress Global inference helps !  All constraints vs. only non-overlapping constraints:  error reduction > 5% ; > 1% absolute F 1  A lot of room for improvement (additional constraints)  Easy and fast: Sentences/Second (using Xpress-MP) Modeling and Implementation details:  

Page 50 Overview Stand Alone Learning  Modeling  Representational Issues.  Computational Issues Inference  Semantic Role Labeling  Making Decisions under General Constraints How to train Components of Global Decisions  Tradeoff that depends on easiness of learning components.  Feedback to learning is (indirectly) given by the reasoning stage.  There may not be a need (or even a possibility) to learn exactly, but only to the extent that is supports Reasoning.

Page 51 Input: o 1 o 2 o 3 o 4 o 5 o 6 o 7 o 8 o 9 o 10 Classifier 1: Classifier 2: Infer: Phrase Identification Problem Use classifiers’ outcomes to identify phrases Final outcome determined by optimizing classifiers outcome and constrains [[[[ ]] ] ]]] Output: s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 s 9 s 10 []][ Did this classifier make a mistake? How to train it?

Page 52 Learning Structured Output Input variables, x = (x 1,…, x d ) 2 X; Output variables y = (y 1,…, y d ) 2 Y A set of constrains C(Y) µ Y A cost function f(x,y) that assigns a score to each possible output. The cost function is linear in the components of y = (y 1,…, y d ): f(x, (y 1,…, y d ) ) =  i f i (x, y) Each scoring function (classifier) is linear over some feature space f i (x,y) = w i  (x,y) Therefore the overall cost function is linear We seek a solution y * that maximizes the cost function, Subject to the constrain s C(Y) y* = argmax C(y)  i  (x,y)

Page 53 Learning and Inference Structured Output Inference is the task of determining an optimal assignment y given an assignment x. For sequential structure of constraints, polynomial-time algorithms such as Viterbi or CSCL [Punyakanok&Roth, NIPS’01] can be used. For general structure of constraints, we proposed a formalism that uses Integer Linear Programming (ILP). Irrespective of the inference chosen, there are several ways to learn the scoring function parameters. These differ in whether or not the structure-based inference process is leveraged during training.  Learning Local Classifiers: Decouple Learning from Inference.  Learning Global Classifiers: Interleave inference with learning.

Page 54 Learning Local and Global Classifiers Learning Local Classifiers: No knowledge of the inference used during learning.  For each example (x, y) ∈ D, the learning algorithm must ensure that each component of y produces the correct output.  Global constraint are enforced only at evaluation time. Learning Global Classifiers: Train to produce correct global output.  Feedback from the inference process determines which classifiers to provide feedback to; together, the classifiers and the inference yield the desired result.  At each step a subset of the classifiers are updated according to inference feedback.  Conceptually similar to CRF and Collin’s Perceptron; we provide an online algorithm with a more general inference procedure.,

Page 55 Relative Merits Learning Local Classifiers = L+I Learning Global Classifier = Inference based Training (IBT) Claim: With a fixed number of examples: 1. If the local classification tasks are separable, then L+I outperforms IBT. 2. If the task is globally separable, but not locally separable then IBT outperforms L+I only with sufficient examples. This number correlates with the degree of the separability of the local classifiers. (The more strict the constrains are, the larger IBT’s example is)

Page 56 Relative Merits Learning Local Classifiers = L+I Learning Global Classifier = Inference based Training (IBT) LO: learning a stand alone component Results on the SRL task: In order to get to the region In which IBT is good, we reduced the number of features used by the individuals classifiers

Page 57 Relative Merits (2) Simulation results in which we compare different learning strategies in various degrees of difficulties of the local classifiers. κ = 0 implies locally linearly separability. Higher κ indicates harder local classification. K=0.15K=1K=0

Page 58 Summary Stand alone learning  Learning itself is well understood Learning & Inference with General Global Constraints  Many problems in NLP involve several interdependent components and requires applying inference to obtain the best global solution  Need to incorporate linguistics and other constraints into NLP reasoning A general paradigm for inference over learned components, based on (ILP) over general learners and expressive constraints.  Preliminary understanding of relative merits of different training approaches. In practical situations, decoupling training from inference is advantageous. Features: how to map to a high dimensional space Learning protocol (weaker forms of supervision) What are the components? What else do we do wrong: Developmental Issues

Page 59 Questions? Thank you