CoQA: A Conversational Question Answering Challenge

Slides:



Advertisements
Similar presentations
WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.
Advertisements

MCTest-160 [1] contains pairs of narratives and multiple choice questions. An example of narrative QA: Story: “Sally had a very exciting summer vacation.
ACT Question Analysis and Strategies for Science Presentation A.
Opinion Holders in Opinion Text from Online Newspapers Youngho Kim, Yuchul Jung and Sung-Hyon Myaeng Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
Minimally Supervised Event Causality Identification Quang Do, Yee Seng, and Dan Roth University of Illinois at Urbana-Champaign 1 EMNLP-2011.
By: Mrs. Abdallah. The way we taught students in the past simply does not prepare them for the higher demands of college and careers today and in the.
Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Updates to Y2 SATs New New curriculum, new standards, new tests Extensive changes Previous tasks and tests replaced by new set of tests. New test.
Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks EMNLP 2008 Rion Snow CS Stanford Brendan O’Connor Dolores.
Welcome Parents! FCAT Information Session. O Next Generation Sunshine State Standards O Released Test Items O Sample Test.
Key Stage One Parent Workshop New Curriculum and New Statutory Assessment Tests.
NJ HSPA LANGUAGE ARTS : OVERVIEW. TWO READING PASSAGES NONFICTION PASSAGE FICTION PASSAGE MULTIPLE CHOICE QUESTIONS OPEN-ENDED QUESTIONS: VERY IMPORTANT!!
Priors Wood Primary School KS2 SATS 2017 Y6 Information Evening Monday 17th October 2016.
Language Identification and Part-of-Speech Tagging
Willow Farm Primary School KS2 SATS 2017
R-NET: Machine Reading Comprehension With Self-Matching Networks
Key Stage 1 National Curriculum
Key Stage 1 National Curriculum
C3 Student Growth Percentiles: An Introduction for Consumers of the Data MSTC, February 19, 2015.
An Information Evening for Parents
Writing a Research Report (Adapted from “Engineering Your Report: From Start to Finish” by Krishnan, L.A. et. al., 2003) Writing a Research Write the introduction.
Welcome! Thank you so much for attending our workshop today. Your support is very much appreciated! We hope that it will be informative, helpful and enable.
Parents’ Information Evening Thursday 23rd March 2017
ACT Question Analysis and Strategies for Science
Experimental Psychology
Key Stage 1 National Curriculum
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
Lesson 6: Focus
What are the SATS tests? The end of KS2 assessments 14th May 2018.
New Key Stage 3 Assessment
Generating Natural Answers by Incorporating Copying and Retrieving Mechanisms in Sequence-to-Sequence Learning Shizhu He, Cao liu, Kang Liu and Jun Zhao.
Writing – Plagiarism What is academic dishonesty?
Strategies for Critical Reading and Writing Success
Y6 Parents’ Information Thursday 8th March 2018
Your introduction to this year’s English exam.
School Improvement Plans and School Data Teams
St Mary’s Catholic Primary School KS2 SATS 2018
Requirements Analysis
Add notes to (a text or diagram) giving explanation or comment.
SATS Meeting Welcome to the key stage 1 SATs meeting
Introduction to Machine Reading Comprehension
SAT and ACT.
New to Year
Key Stage 2 SATs Monday 13th May – Thursday 16th May 2019.
Charnwood Primary School KS2 SATS 2018
Expectations in the new National Curriculum Tests
“The Tell-Tale Heart” Essay
Hierarchical Relational Models for Document Networks
Y6 Information Afternoon/Evening
Bell Work What is an infographic? Why is it important to know how to read them?
ELA CAHSEE Preparation
EBPS Year 6 SATs evening.
An Information Evening for Parents
What are the SATS tests? SATS week begins on 13th May 2019.
Key Stage 1 National Curriculum
LANGUAGE ARTS : OVERVIEW
Parents Information Meeting
Key Stage 1 National Curriculum
Testing Schedule.
Summarizing vs. Analyzing
Building Dynamic Knowledge Graphs From Text Using Machine Reading Comprehension Rajarshi Das, Tsendsuren Munkhdalai, Xingdi Yuan, Adam Trischler, Andrew.
Lesson 8: Analyze an Argument
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Week 3 Presentation Ngoc Ta Aidean Sharghi.
St. Margaret’sPrimary School KS2 SATS 2019
Lesson 6: Focus King Arthur is not allowed into the castle because.
Megan Smoot 4th Quarter Project 5/1/19
Strategies Knowledge Habits of learning
Presentation transcript:

CoQA: A Conversational Question Answering Challenge Siva Reddy, Danqi Chen, Christopher D. Manning

Introduction Jessica went to sit in her rocking chair. Today was her birthday and she was turning 80. Her granddaughter Annie was coming over in the afternoon and Jessica was very excited to see her. Her daughter Melanie and Melanie’s husband Josh were coming as well. Jessica had … Q1: Who had a birthday? A1: Jessica R1: Jessica went to sit in her rocking chair. Today was her birthday and she was turning 80.

Introduction Q2: How old would she be? A2: 80 R2: she was turning 80 Q3: Did she plan to have any visitors? A3: Yes R3: Her granddaughter Annie was coming over

Introduction Q4: How many? A4: Three R4: Her granddaughter Annie was coming over in the afternoon and Jessica was very excited to see her. Her daughter Melanie and Melanie’s husband Josh were coming as well. Q5: Who? A5: Annie, Melanie and Josh R5: Her granddaughter Annie was coming over in the afternoon and Jessica was very excited to see her. Her daughter Melanie and Melanie’s husband Josh were coming as well.

Introduction

Introduction The first concerns the nature of questions in a human conversation. As is well known, state-of-the-art models rely heavily on lexical similarity between a question and a passage. At present, there are no large scale reading comprehension datasets which contain questions that depend on a conversation history.

Introduction The second goal of CoQA is to ensure the naturalness of answers in a conversation. Many existing QA datasets restrict answers to a contiguous span in a given passage, also known as extractive answers. We propose that the answers can be free-form text (abstractive answers).

Introduction The third goal of CoQA is to enable building QA systems that perform robustly across domains. we collect our dataset from seven different domains — children’s stories, literature, middle and high school English exams, news, Wikipedia, science and Reddit.

Dataset Collection For each conversation, we employ two annotators, a questioner and an answerer. We use the Amazon Mechanical Turk (AMT) to pair workers on a on a passage for which we use the ParlAI MTurk API (Miller et al., 2017). On average, each passage costs 3.6 USD for conversation collection and another 4.5 USD for collecting three additional answers for development and test data. We want questioners to avoid using exact words in the passage in order to increase lexical diversity. We want answerers to stick to the vocabulary in the passage.

Dataset Collection We select passages with multiple entities, events and pronominal references using Stanford CoreNLP. We truncate long articles to the first few paragraphs that result in around 200 words.

Dataset Collection

Dataset Collection

Dataset Collection

Dataset Analysis

Dataset Analysis While coreferences are nonexistent in SQuAD, almost every sector of CoQA contains coreferences (he, him, she, it, they) indicating CoQA is highly conversational. While nearly half of SQuAD questions are dominated by what questions, the distribution of CoQA is spread across multiple question types. Several sectors indicated by prefixes did, was, is, does and and are frequent in CoQA but are completely absent in SQuAD.

Dataset Analysis Our conjecture is that human factors such as wage may have influenced workers to ask questions that elicit faster responses by selecting text. It is worth noting that CoQA has 11.1% and 8.7% questions with yes or no as answers whereas SQuAD has 0%. Both datasets have a high number of named entities and noun phrases as answers.

Dataset Analysis Paraphrasing contain phenomena such as synonymy, antonymy, hypernymy, hyponymy and negation. Pragmatics include phenomena like common sense and presupposition.

Conversational Model We generate the answer using a LSTM decoder which attends to the encoder states. Additionally, as the answer words are likely to appear in the original passage, we employ a copy mechanism in the decoder which allows to (optionally) copy a word from the passage. We call this Pointer-Generator network, PGNet.

Conversational Model (See et al., 2017)

Reading Comprehension Model We use the Document Reader (DrQA) model of Chen et al. (2017), which has demonstrated strong performance on multiple datasets. Since DrQA requires text spans as answers during training, we select the span which has the highest lexical overlap (F1 score) with the original answer as the gold answer. If the answer appears multiple times in the story we use rationale to find the correct one. If any answer word does not appear in the story, we fall back to an additional unknown token as the answer (about 17%). We prepend each question with its past questions and answers to account for conversation history.

Reading Comprehension Model

A Combined Model In this combined model, DrQA first points to the answer evidence in text, and PGNet naturalizes the evidence into an answer. For example, we expect that DrQA first predicts the rationale R5, and then PGNet generates A5 from R5. For DrQA, we only use rationales as answers for the questions with non-extractive answers. For PGNet, we only provide current question and DrQA’s span predictions as input to the encoder

Evaluation Following SQuAD, we use macro-average F1 score of word overlap as our main evaluation metric. In our final evaluation, we use n = 4 human answers for every question (the original answer and 3 additionally collected answers). The articles a, an and the and punctuations are excluded in evaluation.

Evaluation

Evaluation For in-domain results, both the best model and humans find the literature domain harder than the others since literature’s vocabulary requires proficiency in English. For out-of-domain results, the Reddit domain is apparently harder. This could be because Reddit requires reasoning on longer passages. While humans achieve high performance on children’s stories, models perform poorly, probably due to the fewer training examples in this domain compared to others.

Evaluation While humans find the questions without coreferences easier than those with coreferences, the models behave sporadically. It is not clear why humans find implicit coreferences easier than explicit coreferences. A conjecture is that implicit coreferences depend directly on the previous turn whereas explicit coreferences may have long distance dependency on the conversation.

Evaluation

Conclusions In this paper, we introduced CoQA, a large scale dataset for building conversational question answering systems. Unlike existing reading comprehension datasets, CoQA contains conversational questions, natural answers along with extractive spans as rationales, and text passages from diverse domains. We hope this work will stir more research in conversational modeling, which is a key ingredient for enabling natural human-machine communication.