Zero-Shot Relation Extraction via Reading Comprehension

Slides:



Advertisements
Similar presentations
Latent Variables Naman Agarwal Michael Nute May 1, 2013.
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Overview of the TAC2013 Knowledge Base Population Evaluation: English Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji, and.
Information Extraction from HTML: General Machine Learning Approach Using SRV.
Distributed Representations of Sentences and Documents
Introduction to Machine Learning Approach Lecture 5.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
1 I.Introduction to Algorithm and Programming Algoritma dan Pemrograman – Teknik Informatika UK Petra 2009.
Author: William Tunstall-Pedoe Presenter: Bahareh Sarrafzadeh CS 886 Spring 2015.
The Necessity of Combining Adaptation Methods Cognitive Computation Group, University of Illinois Experimental Results Title Ming-Wei Chang, Michael Connor.
CPS 270: Artificial Intelligence Machine learning Instructor: Vincent Conitzer.
School of Engineering and Computer Science Victoria University of Wellington Copyright: Peter Andreae, VUW Image Recognition COMP # 18.
Task analysis Chapter 5. By the end of this chapter you should be able to... Describe HTA and its features Explain the purpose of task analysis and modelling.
Chapter 7 The Practices: dX. 2 Outline Iterative Development Iterative Development Planning Planning Organizing the Iterations into Management Phases.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
DeepDive Introduction Dongfang Xu Ph.D student, School of Information, University of Arizona Sept 10, 2015.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Paul van Mulbregt Sheera Knecht Jon Yamron Dragon Systems Detection at Dragon Systems.
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
Binary Representation in Text
Statistical Machine Translation Part II: Word Alignments and EM
Unsupervised Learning of Video Representations using LSTMs
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.
End-To-End Memory Networks
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
The Relationship between Deep Learning and Brain Function
A Deep Memory Network for Chinese Zero Pronoun Resolution
Predicting Interface Failures For Better Traffic Management.
Chapter 5 Task analysis.
for the Offline and Computing groups
Tokenizer and Sentence Splitter CSCI-GA.2591
ICS103 Programming in C Lecture 1: Overview of Computers & Programming
Attention Is All You Need
CSE 311 Foundations of Computing I
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
Life is a by Jack London.
Intelligent Information System Lab
Generating Natural Answers by Incorporating Copying and Retrieving Mechanisms in Sequence-to-Sequence Learning Shizhu He, Cao liu, Kang Liu and Jun Zhao.
Roberto Battiti, Mauro Brunato
Continuous Automated Chatbot Testing
LESSON 12 - Loops and Simulations
Deep Learning based Machine Translation
Data Mining Practical Machine Learning Tools and Techniques
Recognizing Partial Textual Entailment
Introduction to Computer Programming
Number and String Operations
Theory of Computation Turing Machines.
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 8, 2018.
What Are They? Who Needs ‘em? An Example: Scoring in Tennis
iSRD Spam Review Detection with Imbalanced Data Distributions
Coding Concepts (Data- Types)
Programming We have seen various examples of programming languages
Introduction to Machine Reading Comprehension
S.T.A.I.R CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson
Mastering Memory Modes
Leveraging Textual Specifications for Grammar-based Fuzzing of Network Protocols Samuel Jero, Maria Leonor Pacheco, Dan Goldwasser, Cristina Nita-Rotaru.
Word embeddings (continued)
Deep Learning for the Soft Cutoff Problem
Attention for translation
Rachit Saluja 03/20/2019 Relation Extraction with Matrix Factorization and Universal Schemas Sebastian Riedel, Limin Yao, Andrew.
The Winograd Schema Challenge Hector J. Levesque AAAI, 2011
Information Retrieval
CoQA: A Conversational Question Answering Challenge
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
A Joint Model of Orthography and Morphological Segmentation
Sequence-to-Sequence Models
Peng Cui Tsinghua University
Presentation transcript:

Zero-Shot Relation Extraction via Reading Comprehension Omer Levy, Minjoon Seo, Eunsol Choi, Luke Zettlemoyer University of Washington Allen Institute for Artificial Intelligence I’d like to talk to you about how we can leverage recent progress in reading comprehension to achieve a very interesting result in relation extraction.

Relation Extraction (Slot Filling) Relation: educated_at(x,?) Relation Extraction Model Entity: x = Turing So in this talk, I’m going to refer to a specific type of relation extraction: slot-filling. In this task, we’re given: a relation – like “educated at” an entity – like Alan Turing and a sentence – “Alan Turing graduated from Princeton”. The goal is to fill the missing slot in the relation from the information in the sentence, if possible. In this case, the answer is “Princeton”. Sentence: “Alan Turing graduated from Princeton.” Answer: Princeton

Relation Extraction (Slot Filling) Relation: educated_at(x,?) Relation Extraction Model Entity: x = Turing Obviously, there are many cases where the sentence cannot complete the relation, in which case we expect the model to indicate that there is no answer. Sentence: “Turing was an English mathematician.” Answer: <null>

Reading Comprehension Question: “Where did Turing study?” In the task of reading comprehension, we’re also given a text, but instead of a relation and an entity, we have a natural-language question – “Where did Turing study?” The goal is to answer the question from the text by selecting a span (or a set of spans) from the text. Sentence: “Alan Turing graduated from Princeton.” Reading Comprehension Model Answer: Princeton

Relation Extraction via Reading Comprehension Relation: educated_at(x,?) Entity: x = Turing Now, our main observation is that the task of relation extraction can be reduced to the reading comprehension. What we basically need to do is translate the knowledge-base relation and entity Into a natural language question. Sentence: “Alan Turing graduated from Princeton.” Reading Comprehension Model Answer: Princeton

Relation Extraction via Reading Comprehension Relation: educated_at(x,?) Question Template: “Where did x study?” Querification Entity: x = Turing We do this by first converting the relation into a question template, a process we call “querification”, Sentence: “Alan Turing graduated from Princeton.” Reading Comprehension Model Answer: Princeton

Relation Extraction via Reading Comprehension Relation: educated_at(x,?) Question Template: “Where did x study?” Querification Entity: x = Turing Question: “Where did Turing study?” Instantiation And then simply instantiating the template with the entity, to produce a natural-language question. Sentence: “Alan Turing graduated from Princeton.” Reading Comprehension Model Answer: Princeton

Advantages So, what is it good for? What do we get from reducing relation extraction to reading comprehension?

Advantage: Generalize to Unseen Questions Provides a natural-language API for defining and querying relations educated_at(Turing, ?) ≈ “Where did Turing study?” “Which university did Turing go to?” Well, first of all, a model based on reading comprehension can generalize to unseen questions, Which allows us to use natural language questions like “where did someone study”, instead of relation identifiers like “educated at”, to both define our schema and also query it at test time.

Advantage: Generalize to Unseen Relations Enables zero-shot relation extraction Train: educated_at, occupation, spouse, … Test: country Impossible for many relation-extraction systems And perhaps an even more interesting advantage is that it can generalize to questions about completely new relations, which were not even part of the schema during training. In other words, reducing relation extraction to reading comprehension allows us to do zero-shot relation extraction, Which is an impossible feat for many relation extraction models, including some of the Universal Schema approaches.

Challenges Translating relations into question templates Schema Querification Generated over 30,000,000 examples Modeling reading comprehension Plenty of research on SQuAD (Rajpurkar et al, EMNLP 2016) Model based on BiDAF (Seo et al, ICLR 2017) Predicting negative instances Modified BiDAF can indicate no answer Naturally, there are also some technical challenges. For instance, how do we translate relations into questions in a scalable way, so that we can collect enough data to train our reading comprehension model? We’ll talk about schema querification, which is a very efficient process for doing this translation. In fact, it’s so efficient, that we were able to generate more than 30 million reading comprehension examples with a very modest budget. Once we have data, how do we actually model reading comprehension? Well, there’s been a huge amount of research on SQuAD over the past year, and it so happens that one of the more successful models was Minjoon’s bi-directional attention flow (or “BiDAF”) model, which we used. Now, SQuAD and relation extraction are not quite the same. In particular, SQuAD assumes that the raw text we’re given as input must contain the answer, But in relation extraction, most of the sentences in a given document are actually *not* going to fill the missing slot. To address this, we modified BiDAF to consider the “no answer” option in addition to any potential answer spans.

Challenges So let’s dive into the challenges first.

Instance Querification educated_at(Turing, Princeton) → “Where did Turing study?” “Where did Turing graduate from?” “Which university did Turing go to?” And we’ll start with the problem of translating relations into questions. Some QA datasets were built by taking instances from a KB, like “educated_at(Turing, Princeton)”, and then asking people to generate natural-language questions from them. Now, this works well, but it also scales linearly with your budget, because you’re annotating at the instance level. And if you want to annotate millions of examples, it’s going to cost you. Problem: scaling to millions of examples Large-Scale Simple Question Answering with Memory Networks (Bordes et al, 2015)

Schema Querification: The Challenge educated_at(x,?) → “Where did x study?” “Where did x graduate from?” “Which university did x go to?” So one way to avoid it is by annotating at the schema level. Instead of asking a question about a specific entity, we use a placeholder X. Now, this is actually a hard problem, because the relation name alone doesn’t give the annotator enough information. For example, can “educated at” also refer to the high school in which X studied? Or the country? Also, how do the annotators come up with the phrasing? Problem: not enough information

Schema Querification: Crowdsourcing Solution Ask a single question about x whose answer is, for each sentence, the underlined spans. The wine is produced in the x region of France. x, the capital of Mexico, is the most populous city in North America. x is an unincorporated and organized territory of the United States. The x mountain range stretched across the United States and Canada. So what we want to do, is basically prime and constrain annotators with actual data. We give them 4 sentences with a placeholder X, and ask them to think of a single question about X Whose answer is, for each sentence, the underlined spans. <<<wait>>> So in this example, a good question would be: “In which country is X located?” Because it fits well with all four sentences. “In which country is x located?”

Dataset Annotated 120 relations from WikiReading (Hewlett et al, ACL 2016) Collected 10 templates per relation with high agreement Generated over 30,000,000 reading comprehension examples Generated negative examples by mixing questions about same entity So overall we crowdsourced questions for 120 relations from the WikiReading dataset, Collecting an average of 10 question templates per relation, which had high agreement, Allowing us to generate over 30 million reading comprehension examples. We also generated some negative examples by taking two examples about the same entity and then mixing their questions.

Reading Comprehension Model: BiDAF Pre-trained word embeddings Character embeddings Bi-directional LSTMs for contextualization Special attention mechanism: Attends on both question and sentence Computed independently for each token in the sentence For our reading comprehension model, we used an adaptation of BiDAF. BiDAF follows a lot of standard practices like word embeddings, character embeddings, and bi-directional LSTMs, But it also has a special attention mechanism, that computes a weighted average of both the question and the sentence, for each token. Bi-Directional Attention Flow for Machine Comprehension (Seo et al, ICLR 2017)

Reading Comprehension Model: BiDAF Output Layer: Alan Turing graduated from [Princeton] [<null>] Begin: End: 0.1 0.3 0.1 0.1 0.4 Now the way BiDAF predicts the answer span, is by computing 2 softmaxes over the sentence: one to mark the beginning of the answer, and another for the end. 0.1 0.1 0.1 0.1 0.6 Bi-Directional Attention Flow for Machine Comprehension (Seo et al, ICLR 2017)

Reading Comprehension Model: BiDAF Output Layer: Alan Turing graduated from [Princeton] [<null>] Begin: End: 0.1 0.3 0.1 0.1 0.4 From each softmax, BiDAF basically take the index with the highest confidence, as long as the beginning appears before the end. 0.1 0.1 0.1 0.1 0.6 Bi-Directional Attention Flow for Machine Comprehension (Seo et al, ICLR 2017)

Predicting Negative Instances Output Layer: Alan Turing graduated from [Princeton] [<null>] Begin: End: 0.01 0.03 0.01 0.01 0.04 0.9 Now, to handle negative examples, which don’t have an answer, we add a NULL token at the end of the sentence. 0.01 0.01 0.01 0.01 0.06 0.9 Add <null> token to the sentence

Predicting Negative Instances Output Layer: Alan Turing graduated from [Princeton] [<null>] Begin: End: 0.01 0.03 0.01 0.01 0.04 0.9 If the null token is selected by the model, we predict “no answer”. 0.01 0.01 0.01 0.01 0.06 0.9 if argmax = <null>, predict no answer

Experiments So now that we’ve addressed the challenges, let’s see how our approach works in practice.

Generalizing to Unseen Questions Model is trained on several question templates per relation “Where did Alan Turing study?” “Where did Claude Shannon graduate from?” “Which university did Edsger Dijkstra go to?” User asks about the relation using a different form “Which university awarded Noam Chomsky a PhD?” So first, we want to check if the model can generalize to questions that it hasn’t seen before. This basically tests what happens when the model is trained using several question templates that allude to the same relation, Like “where did turing study?” or “where did Shannon graduate from?”, And then at test-time, the user asks about the same relation, but uses a completely different template to phrase the question. For example, “which university awarded Chomsky a PhD?”

Generalizing to Unseen Questions Experiment: split the data by question templates Performance on seen question templates: 86.6% F1 Performance on unseen question templates: 83.1% F1 Our method is robust to new descriptions of existing relations So we took our dataset and split it according to question templates, And that allowed us to test how well our model performed on instances with seen templates vs new/unseen templates. As you can see, there is some difference, which is expected, but it’s relatively small. And what this basically means it that our model is robust to new descriptions of relations that it saw during training.

Generalizing to Unseen Relations Model is trained on several relations “Where did Alan Turing study?” (educated_at) “What is Ivanka Trump’s job?” (occupation) “Who is Justin Trudeau married to?” (spouse) User asks about a new, unseen relation “In which country is Seattle located?” (country) But what about relations that it didn’t see during training? How well can our model, or any other model, for that matter, generalize to a completely new relation? In this scenario, we train our model on a set of questions that pertain to a certain set of relations, Like “educated_at”, “occupation”, and “spouse”, And at test time, ask it questions about a completely different relation, like “country”.

Generalizing to Unseen Relations Experiment: split the data by relations Results Random named-entity baseline: 12.2% F1 Off-the-shelf RE system: impossible BiDAF w/ relation name as query: 33.4% F1 BiDAF w/ querified relation as query: 39.6% F1 BiDAF w/ + multiple questions at test: 41.1% F1 So this time, we split the data according to relations, and tested how well our model, as well as some others, perform on unseen relations. As a simple unsupervised baseline, we just picked one of the named entities at random, and that gave us about 12 points F1. We also tried an off-the-shelf relation extraction model, but, as expected, it didn’t get anything correct. Now, it’s not that there’s anything wrong with the model, it just wasn’t designed for the zero-shot scenario. The same is true for many other models. Mathematically, the only way you can try to solve this problem with a supervised approach is by featurizing the relation. One way to do it is to use the relation’s name, the actual string, as a question, and that does much than the random basline. Now, when you use natural-language questions, which is basically what we’re proposing in this work, you get even better results. You can even improve those results a bit more if you allow the model to look at multiple questions during test time.

Why does a reading comprehension model enable zero-shot relation extraction? It can learn answer types that are used across relations Q: When was the Snow Hawk released? S: The Snow Hawk is a 1925 film… It can detect paraphrases of relations Q: Who started the Furstenberg China Factory? S: The Furstenberg China Factory was founded by Johann Georg… So what is the reading comprehension model learning, that allows it to generalize to new relations? From analyzing the results, we found 2 interesting properties: First, the model is able to learn answer types that are common to many relations. For example, “when” typically refers to a date, and “where” is often a country or a city. It’s also able to detect paraphrases of relations, like “started” and “was founded by”. And we suspect that it’s able to do this with the help of pre-trained word embeddings.

Conclusion So, in conclusion,

Conclusion Relation extraction can be reduced to reading comprehension Provides a natural-language API for defining and querying relations Enables zero-shot relation extraction Challenging dataset: nlp.cs.washington.edu/zeroshot/ We showed that the task of relation extraction can be reduced to reading comprehension, Providing a natural-language API for defining and querying relations, That can even extract new relation types that were never observed during training. This task is far from solved, so we’ve made all our code and data publicly available, In hope that we, as a community, can use this benchmark to advance research in reading comprehension. Thank you! Thank you!