CIS 700 Advanced Machine Learning for NLP A First Look at Structures

Slides:



Advertisements
Similar presentations
Latent Variables Naman Agarwal Michael Nute May 1, 2013.
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
CONSTRAINED CONDITIONAL MODELS TUTORIAL Jingyu Chen, Xiao Cheng.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A.
Structured SVM Chen-Tse Tsai and Siddharth Gupta.
CS 6961: Structured Prediction Fall 2014 Introduction Lecture 1 What is structured prediction?
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Conditional Random Fields
Page 1 Generalized Inference with Multiple Semantic Role Labeling Systems Peter Koomen, Vasin Punyakanok, Dan Roth, (Scott) Wen-tau Yih Department of Computer.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Experiments  Synthetic data: random linear scoring function with random constraints  Information extraction: Given a citation, extract author, book-title,
Online Learning Algorithms
Graphical models for part of speech tagging
CS 6961: Structured Prediction Fall 2014 Course Information.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
Page 1 July 2008 ICML Workshop on Prior Knowledge for Text and Language Constraints as Prior Knowledge Ming-Wei Chang, Lev Ratinov, Dan Roth Department.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Dan Roth Department of Computer Science
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Today’s Lecture Neural networks Training
Lecture 7: Constrained Conditional Models
Semi-Supervised Clustering
Chapter 7. Classification and Prediction
k-Nearest neighbors and decision tree
Statistical NLP Winter 2009
Dan Roth Department of Computer and Information Science
Deep learning David Kauchak CS158 – Fall 2016.
CSC 594 Topics in AI – Natural Language Processing
Integer Linear Programming Formulations in Natural Language Processing
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Dan Roth Department of Computer and Information Science
Dan Roth Department of Computer and Information Science
Entity- & Topic-Based Information Ordering
By Dan Roth and Wen-tau Yih PowerPoint by: Reno Kriz CIS
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
CIS 700 Advanced Machine Learning Structured Machine Learning:   Theory and Applications in Natural Language Processing Shyam Upadhyay Department of.
Kai-Wei Chang University of Virginia
CIS 700 Advanced Machine Learning for NLP Inference Applications
Basic machine learning background with Python scikit-learn
Data Mining Lecture 11.
Margin-based Decomposed Amortized Inference
Hidden Markov Models Part 2: Algorithms
Objective of This Course
CSc4730/6730 Scientific Visualization
Discrete Event Simulation - 4
Kai-Wei Chang University of Virginia
Artificial Intelligence Lecture No. 28
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Support Vector Machines
CSCI 5832 Natural Language Processing
Word embeddings (continued)
NER with Models Allowing Long-Range Dependencies
Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^
Dan Roth Computer and Information Science University of Pennsylvania
Implementation of Learning Systems
Dan Roth Department of Computer Science
Extracting Information from Diverse and Noisy Scanned Document Images
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Modeling IDS using hybrid intelligent systems
CS249: Neural Language Model
Presentation transcript:

CIS 700 Advanced Machine Learning for NLP A First Look at Structures Dan Roth Department of Computer and Information Science University of Pennsylvania Augmented and modified by Vivek Srikumar

Admin Stuff Grading: Project Proposals are due on 9/28 Next Tuesday we will start with students’ presentation of papers. If you haven’t done so, please choose a paper TODAY Presentations: 20 Min + 10 Min discussion Please send your presentation to me no later than 2 days before the presentation Sunday evening/Tuesday Evening (Tuesday/Thursday Presentation) Come to the office hour to discuss it once you get comments You will be asked to grade the presentations of all other students Critical Reviews First one is due next week, 9/26 < 2p; pdf; make it look like a paper; try to make it interesting. Grading: Project 40%, Reviews 30% (12,6,6,6), Presentation 20%, Participation 10% Project Proposals are due on 9/28 Come to office hours to discuss it

So far… Binary classifiers Multiclass classifiers Output: 0/1 Multiclass classifiers Output: one of a set of labels Decomposed learning vs. joint/global learning Winner-take-all prediction for multiclass

Sequence Tagging The cat sat on the mat . DT NN VBD IN DT NN . Naïve approach (actually works pretty well for POS tagging) Multiple extensions What are the issues that come up? DT NN VBD IN DT NN .

Structured Prediction Brings up two key issues: 1. Large space of outputs Prediction: y* = argmaxy 2 Y wT Á ( x) Since Y is large, Prediction needs to be more sophisticated. We call it Inference. And, can’t train a separate weight vector for each possible output as we did in multiclass 2. If you want to decompose the output and deal with a smaller Y Need to account for (in) consistencies If the various Y = (y1, y2, …, yk) components depend on each other, how would you predict one without knowledge of the others? Prediction: y* = argmaxy 2 Y wT Á ( x, y) This issue will come up both in training and at decision time.

Decomposing the output We need to produce a graph In general, we cannot enumerate all possible graphs for the argmax Solution: Think of the graph as combination of many smaller parts The parts should agree with each other in the final output Each part has a score The total score for the graph is the sum of scores of each part Decomposition of the output into parts also helps generalization Why?

Decomposing the output: Example The scoring function (via the weight vector) scores outputs For generalization and ease of inference, break the output into parts and score each part The score for the structure is the sum of the part scores What is the best way to do this decomposition? Depends…. Note: The output y is a labeled assignment of the nodes and edges , , ,… The input x not shown here 3 possible node labels 3 possible edge labels Setting Output: Nodes and edges are labeled and the blue and orange edges form a tree Goal: Find the highest scoring tree

Decomposing the output: Example One option: Decompose fully. All nodes and edges are independently scored Still need to ensure that the colored edges form a valid output (i.e. a tree) 3 possible node labels 3 possible edge labels Linear functions Setting Output: Nodes and edges are labeled and the blue and orange edges form a tree Goal: Find the highest scoring tree Prediction:

Decomposing the output: Example One option: Decompose fully. All nodes and edges are independently scored Still need to ensure that the colored edges form a valid output (i.e. a tree) This is invalid output! Even this simple decomposition requires inference to ensure validity 3 possible node labels 3 possible edge labels Setting Output: Nodes and edges are labeled and the blue and orange edges form a tree Goal: Find the highest scoring tree Prediction:

Decomposing the output: Example Another possibility: Score each edge and its nodes together And many other edges… Each patch represents piece that is scored independently 3 possible node labels 3 possible edge labels Linear function Setting Output: Nodes and edges are labeled and the blue and orange edges form a tree Goal: Find the highest scoring tree

Decomposing the output: Example Another possibility: Score each edge and its nodes together And many other edges… Each patch represents piece that is scored independently 3 possible node labels 3 possible edge labels Setting Output: Nodes and edges are labeled and the blue and orange edges form a tree Goal: Find the highest scoring tree Inference should ensure that The output is a tree, and Shared nodes have the same label in all the pieces

Decomposing the output: Example Another possibility: Score each edge and its nodes together And many other edges… Each patch represents piece that is scored independently Invalid! Two parts disagree on the label for this node 3 possible node labels 3 possible edge labels Setting Output: Nodes and edges are labeled and the blue and orange edges form a tree Goal: Find the highest scoring tree Inference should ensure that The output is a tree, and Shared nodes have the same label in all the parts

Decomposing the output: Example We have seen two examples of decomposition Many other decompositions possible… 3 possible node labels 3 possible edge labels Setting Output: Nodes and edges are labeled and the blue and orange edges form a tree Goal: Find the highest scoring tree

Inference Each part is scored independently Key observation: Number of possible inference outcomes for each part may not be large Even if the number of possible structures might be large Inference: How to glue together the pieces to build a valid output? Depends on the “shape” of the output Computational complexity of inference is important Worst case: intractable With assumptions about the output, polynomial algorithms exist. We may encounter some examples in more detail: Predicting sequence chains: Viterbi algorithm To parse a sentence into a tree: CKY algorithm In general, might have to either live with intractability or approximate Questions?

Training regimes Decomposition of outputs gives two approaches for training Decomposed training/Learning without inference Learning algorithm does not use the prediction procedure during training Global training/Joint training/Inference-based training Learning algorithm uses the final prediction procedure during training Similar to the two strategies we had before with multiclass Inference complexity often an important consideration in choice of modeling and training Especially so if full inference plays a part during training Ease of training smaller/less complex models could give intermediate training strategies between fully decomposed and fully joint

Example 1: Semantic Role Labeling One important lesson in the following examples is that “consistency” is actually a loaded term. In most interesting cases it would mean “coherency via some knowledge. Based on the dataset PropBank [Palmer et. al. 05] Large human-annotated corpus of verb semantic relations The task: To predict arguments of verbs Given a sentence, identifies who does what to whom, where and when. The bus was heading for Nairobi in Kenya Predicate Relation: Head Mover[A0]: the bus Destination[A1]: Nairobi in Kenya Arguments

Predicting verb arguments The bus was heading for Nairobi in Kenya. Identify candidate arguments for verb using parse tree Filtered using a binary classifier Classify argument candidates Multi-class classifier (one of multiple labels per candidate) Inference Using probability estimates from argument classifier Must respect structural and linguistic constraints Eg: No overlapping arguments

Inference: verb arguments The bus was heading for Nairobi in Kenya. 0.1 0.5 0.2 0.5 0.2 0.0 0.1 0.4 0.1 0.3 Special label, meaning “Not an argument” 0.1 0.6

Inference: verb arguments The bus was heading for Nairobi in Kenya. 0.1 0.5 0.2 0.5 0.2 0.0 0.1 0.4 0.1 0.3 Special label, meaning “Not an argument” 0.1 0.6 Violates constraint: Overlapping argument! Total: 2.0 heading (The bus, for Nairobi, for Nairobi in Kenya)

Inference: verb arguments The bus was heading for Nairobi in Kenya. 0.1 0.5 0.2 0.5 0.2 0.0 0.1 0.4 0.1 0.3 Special label, meaning “Not an argument” 0.1 0.6 Total: 2.0 Total: 1.9 heading (The bus, for Nairobi in Kenya)

Inference: verb arguments The bus was heading for Nairobi in Kenya. 0.1 0.5 0.2 0.5 0.2 0.0 0.1 Input Text with pre-processing Output Five possible decisions for each candidate Create a binary variable for each decision, only one of which is true for each candidate. Collectively, a “structure” 0.4 0.1 0.3 ( ) Special label, meaning “Not an argument” 0.1 0.6 Total: 1.9 heading (The bus, for Nairobi in Kenya)

Structured output is… A data structure with a pre-defined schema Eg: SRL converts raw text into a record in a database Equivalently, a graph Often restricted to be a specific family of graphs: chains, trees, etc Predicate A0 A1 Location Head The bus Nairobi in Kenya - Head A0 A1 The bus Nairobi in Kenya Questions/comments?

Example 2: Object detection A good illustration for why one might want decomposition – transfer. E.g., a wheel classifier might be useful in many other tasks. Example 2: Object detection How would you design a predictor that labels all the parts using the tools we have seen so far? Right facing bicycle handle bar saddle/seat left wheel right wheel Photo by Andrew Dressel - Own work. Licensed under Creative Commons Attribution-Share Alike 3.0

One approach to build this structure Left wheel detector: Is there a wheel in this box? Binary classifier 2. Right wheel detector 1. Left wheel detector 3. Handle bar detector 4. Seat detector Final output: Combine the predictions of these individual classifiers (local classifiers) The predictions interact with each other Eg: The same box can not be both a left wheel and a right wheel, handle bar does not overlap with seat, etc Need inference to compose the output Photo by Andrew Dressel - Own work. Licensed under Creative Commons Attribution-Share Alike 3.0

Example 3: Sequence labeling More on this in next lecture Input: A sequence of tokens (like words) Output: A sequence of labels of same length as input Eg: Part-of-speech tagging: Given a sentence, find parts-of-speech of all the words The Fed raises interest rates Determiner Noun Verb Noun Noun Other possible tags in different contexts, Verb (I fed the dog) Verb (Poems don’t interest me) Verb (He rates movies online)

Part-of-speech tagging Given a word, its label depends on : The identity and characteristics of the word Eg. Raises is a Verb because it ends in –es (among other reasons) Its grammatical context Fed in “The Fed” is a Noun because it follows a Determiner Fed in “I fed the..” is a Verb because it follows a Pronoun One possible model: Each output label is dependent on its neighbors in addition to the input

More examples Protein 3D structure prediction Inferring layout of a room Image from [Schwing et al 2013]

Structured output is… A graph, possibly labeled and/or directed Representation A graph, possibly labeled and/or directed Possibly from a restricted family, such as chains, trees, etc. A discrete representation of input Eg. A table, the SRL frame output, a sequence of labels etc A collection of inter-dependent decisions Eg: The sequence of decisions used to construct the output The result of a combinatorial optimization problem argmaxy 2 all outputsscore(x, y) Procedural We have seen something similar before in the context of multiclass There are a countable number of graphs Question: Why can’t we treat each output as a label and train/predict as multiclass?

Challenges with structured output Two challenges We cannot train a separate weight vector for each possible inference outcome For multiclass, we could train one weight vector for each label We cannot enumerate all possible structures for inference Inference for multiclass was easy Solution Decompose the output into parts that are labeled Define how the parts interact with each other how labels are scored for each part an inference algorithm to assign labels to all the parts

Computational issues: Reprise Model definition What are the parts of the output? What are the inter-dependencies? How to train the model? How to do inference? Background knowledge about domain Task dependent, of course Training and inference concerns drive this too Data annotation difficulty Different algorithms, Exact vs approximate inference Different algorithms, joint vs decomposed learning Semi-supervised/indirectly supervised?

Summary We saw several examples of structured output Structures are graphs Sometimes useful to think of them as a sequence of decisions Also useful to think of them as data structures Multiclass is the simplest type of structure Lessons from multiclass are useful Modeling outputs as structures Decomposition of the output, inference, training Questions?

Structured Prediction: Inference Placing in context: a crash course in structured prediction (skip) Structured Prediction: Inference Inference: given input x (a document, a sentence), predict the best structure y = {y1,y2,…,yn} 2 Y (entities & relations) Assign values to the y1,y2,…,yn, accounting for dependencies among yis Inference is expressed as a maximization of a scoring function y’ = argmaxy 2 Y wT Á (x,y) Inference requires, in principle, touching all y 2 Y at decision time, when we are given x 2 X and attempt to determine the best y 2 Y for it, given w For some structures, inference is computationally easy. Eg: Using the Viterbi algorithm In general, NP-hard (can be formulated as an ILP) Joint features on inputs and outputs Set of allowed structures Feature Weights (estimated during learning)

Structured Prediction: Learning Learning: given a set of structured examples {(x,y)} find a scoring function w that minimizes empirical loss. Learning is thus driven by the attempt to find a weight vector w such that for each given annotated example (xi, yi):

Structured Prediction: Learning Learning: given a set of structured examples {(x,y)} find a scoring function w that minimizes empirical loss. Learning is thus driven by the attempt to find a weight vector w such that for each given annotated example (xi, yi): We call these conditions the learning constraints. In most learning algorithms used today, the update of the weight vector w is done in an on-line fashion, Think about it as Perceptron; this procedure applies to Structured Perceptron, CRFs, Linear Structured SVM W.l.o.g. (almost) we can thus write the generic structured learning algorithm as follows: Score of annotated structure Score of any other structure Penalty for predicting other structure 8 y

Structured Prediction: Learning Algorithm In the structured case, prediction (inference) is often intractable but needs to be done many times For each example (xi, yi) Do: (with the current weight vector w) Predict: perform Inference with the current weight vector yi’ = argmaxy 2 Y wT Á ( xi ,y) Check the learning constraints Is the score of the current prediction better than of (xi, yi)? If Yes – a mistaken prediction Update w Otherwise: no need to update w on this example EndFor

Structured Prediction: Learning Algorithm Solution I: decompose the scoring function to EASY and HARD parts For each example (xi, yi) Do: Predict: perform Inference with the current weight vector yi’ = argmaxy 2 Y wEASYT ÁEASY ( xi ,y) + wHARDT ÁHARD ( xi ,y) Check the learning constraint Is the score of the current prediction better than of (xi, yi)? If Yes – a mistaken prediction Update w Otherwise: no need to update w on this example EndDo EASY: could be feature functions that correspond to an HMM, a linear CRF, or even ÁEASY (x,y) = Á(x), omiting dependence on y, corresponding to classifiers. May not be enough if the HARD part is still part of each inference step.

Structured Prediction: Learning Algorithm Solution II: Disregard some of the dependencies: assume a simple model. For each example (xi, yi) Do: Predict: perform Inference with the current weight vector yi’ = argmaxy 2 Y wEASYT ÁEASY ( xi ,y) + wHARDT ÁHARD ( xi ,y) Check the learning constraint Is the score of the current prediction better than of (xi, yi)? If Yes – a mistaken prediction Update w Otherwise: no need to update w on this example EndDo

Structured Prediction: Learning Algorithm For each example (xi, yi) Do: Predict: perform Inference with the current weight vector yi’ = argmaxy 2 Y wEASYT ÁEASY ( xi ,y) + wHARDT ÁHARD ( xi ,y) Check the learning constraint Is the score of the current prediction better than of (xi, yi)? If Yes – a mistaken prediction Update w Otherwise: no need to update w on this example EndDo Solution III: Disregard some of the dependencies during learning; take into account at decision time This is the most commonly used solution in NLP today

Next steps… Sequence prediction After sequences Markov model Predicting a sequence Viterbi algorithm Training MEMM, CRF, structured perceptron for sequences After sequences General representation of probabilistic models Bayes Nets and Markov Random Fields Generalization of global training algorithms to arbitrary conditional models Inference techniques More on Conditional models, constraints on inference