Ling573 NLP Systems and Applications May 30, 2013

Slides:

Advertisements

Similar presentations

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.

Advertisements

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.

Beyond TREC-QA Ling573 NLP Systems and Applications May 28, 2013.

A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Original slides by Iman Sen Edited by Ralph Grishman.

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?

Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.

A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Presented by Iman Sen.

Learning syntactic patterns for automatic hypernym discovery Rion Snow, Daniel Jurafsky and Andrew Y. Ng Prepared by Ang Sun

AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

1 Statistical NLP: Lecture 10 Lexical Acquisition.

2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.

Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.

Triplet Extraction from Sentences Lorand Dali Blaž “Jožef Stefan” Institute, Ljubljana 17 th of October 2008.

CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.

Semi-supervised Training of Statistical Parsers CMSC Natural Language Processing January 26, 2006.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

Minimally Supervised Event Causality Identification Quang Do, Yee Seng, and Dan Roth University of Illinois at Urbana-Champaign 1 EMNLP-2011.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.

Deep Questions without Deep Understanding

Supertagging CMSC Natural Language Processing January 31, 2006.

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.

Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,

© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.

1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.

The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.

Question Classification Ling573 NLP Systems and Applications April 25, 2013.

Language Identification and Part-of-Speech Tagging

Ensembling Diverse Approaches to Question Answering

Automatic Writing Evaluation

Measuring Monolinguality

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

Statistical NLP: Lecture 7

Lecture – VIII Monojit Choudhury RS, CSE, IIT Kharagpur

Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.

11. Assessing Grammar & Vocabulary

Relation Extraction CSCI-GA.2591

Erasmus University Rotterdam

Entity- & Topic-Based Information Ordering

Robust Semantics, Information Extraction, and Information Retrieval

Distant supervision for relation extraction without labeled data

Aspect-based sentiment analysis

Category-Based Pseudowords

Statistical NLP: Lecture 9

What Are Rubrics? Rubrics are components of:

Introduction Task: extracting relational facts from text

iSRD Spam Review Detection with Imbalanced Data Distributions

Review-Level Aspect-Based Sentiment Analysis Using an Ontology

CSCI 5832 Natural Language Processing

Family History Technology Workshop

Ensemble learning.

Text Mining & Natural Language Processing

Rachit Saluja 03/20/2019 Relation Extraction with Matrix Factorization and Universal Schemas Sebastian Riedel, Limin Yao, Andrew.

Introduction to Sentiment Analysis

Statistical NLP : Lecture 9 Word Sense Disambiguation

Statistical NLP: Lecture 10

Presentation transcript:

Ling573 NLP Systems and Applications May 30, 2013 Beyond TREC-QA II Ling573 NLP Systems and Applications May 30, 2013

Reminder New scoring script!! compute_mrr_rev.py

Roadmap Beyond TREC Question Answering Distant supervision for web-scale relation extraction Machine reading: Question Generation

New Strategy Distant Supervision: Key intuition: Secondary intuition: Supervision (examples) via large semantic database Key intuition: If a sentence has two entities from a Freebase relation, they should express that relation in the sentence Secondary intuition: Many witness sentences expressing relation Can jointly contribute to features in relation classifier Advantages: Avoids overfitting, uses named relations

Feature Extraction Lexical features: Conjuncts of

Feature Extraction Lexical features: Conjuncts of Astronomer Edwin Hubble was born in Marshfield,MO

Feature Extraction Lexical features: Conjuncts of Sequence of words between entities POS tags of sequence between entities Flag for entity order k words+POS before 1st entity k words+POS after 2nd entity Astronomer Edwin Hubble was born in Marshfield,MO

Feature Extraction Lexical features: Conjuncts of Sequence of words between entities POS tags of sequence between entities Flag for entity order k words+POS before 1st entity k words+POS after 2nd entity Astronomer Edwin Hubble was born in Marshfield,MO

Feature Extraction II Syntactic features: Conjuncts of:

Feature Extraction II

Feature Extraction II Syntactic features: Conjuncts of: Dependency path between entities, parsed by Minipar Chunks, dependencies, and directions Window node not on dependency path

High Weight Features

High Weight Features Features highly specific: Problem?

High Weight Features Features highly specific: Problem? Not really, attested in large text corpus

Evaluation Paradigm

Evaluation Paradigm Train on subset of data, test on held-out portion

Evaluation Paradigm Train on subset of data, test on held-out portion Train on all relations, using part of corpus Test on new relations extracted from Wikipedia text How evaluate newly extracted relations?

Evaluation Paradigm Train on subset of data, test on held-out portion Train on all relations, using part of corpus Test on new relations extracted from Wikipedia text How evaluate newly extracted relations? Send to human assessors Issue:

Evaluation Paradigm Train on subset of data, test on held-out portion Train on all relations, using part of corpus Test on new relations extracted from Wikipedia text How evaluate newly extracted relations? Send to human assessors Issue: 100s or 1000s of each type of relation

Evaluation Paradigm Train on subset of data, test on held-out portion Train on all relations, using part of corpus Test on new relations extracted from Wikipedia text How evaluate newly extracted relations? Send to human assessors Issue: 100s or 1000s of each type of relation Crowdsource: Send to Amazon Mechanical Turk

Results Overall: on held-out set Best precision combines lexical, syntactic Significant skew in identified relations @100,000: 60% location-contains, 13% person-birthplace

Results Overall: on held-out set Best precision combines lexical, syntactic Significant skew in identified relations @100,000: 60% location-contains, 13% person-birthplace Syntactic features helpful in ambiguous, long-distance E.g. Back Street is a 1932 film made by Universal Pictures, directed by John M. Stahl,…

Human-Scored Results

Human-Scored Results @ Recall 100: Combined lexical, syntactic best

Human-Scored Results @ Recall 100: Combined lexical, syntactic best @1000: mixed

Distant Supervision Uses large databased as source of true relations Exploits co-occurring entities in large text collection Scale of corpus, richer syntactic features Overcome limitations of earlier bootstrap approaches Yields reasonably good precision Drops somewhat with recall Skewed coverage of categories

Roadmap Beyond TREC Question Answering Distant supervision for web-scale relation extraction Machine reading: Question Generation

Question Generation Mind the Gap: Learning to Choose Gaps for Question Generation, Becker, Basu, Vanderwende ’12 Other side of question-answering Related to “machine reading” Generate questions based on arbitrary text

Motivation Why generate questions?

Motivation Why generate questions? Aside from playing Jeopardy!, of course

Motivation Why generate questions? Educational (self-)assessment Aside from playing Jeopardy!, of course Educational (self-)assessment Testing aids in retention of studied concepts Reading, review relatively passive Active retrieval, recall of concepts more effective

Motivation Why generate questions? Educational (self-)assessment Aside from playing Jeopardy!, of course Educational (self-)assessment Testing aids in retention of studied concepts Reading, review relatively passive Active retrieval, recall of concepts more effective Shifts to less-structured learning settings Online study, reading, MOOCs

Motivation Why generate questions? Educational (self-)assessment Aside from playing Jeopardy!, of course Educational (self-)assessment Testing aids in retention of studied concepts Reading, review relatively passive Active retrieval, recall of concepts more effective Shifts to less-structured learning settings Online study, reading, MOOCs Assessment difficult  automatically generate

Generating Questions Prior work: Shared task on question generation Focused on: Grammatical question generation Creating distractors for multiple choice questions

Generating Questions Prior work: Here: Generating Good Questions Shared task on question generation Focused on: Grammatical question generation Creating distractors for multiple choice questions Here: Generating Good Questions Given a text, decompose into two steps: What sentences form the basis of good questions? What aspects of these sentences should we focus on?

Overview Goal: Generate good gap-fill questions Allow focus on content vs form Later extend to wh-questions or multiple choice

Overview Goal: Generate good gap-fill questions Approach Allow focus on content vs form Later extend to wh-questions or multiple choice Approach Summarization-based sentence selection

Overview Goal: Generate good gap-fill questions Approach Allow focus on content vs form Later extend to wh-questions or multiple choice Approach Summarization-based sentence selection Machine learning-based gap selection Training data creation on Wikipedia data

Overview Goal: Generate good gap-fill questions Approach Allow focus on content vs form Later extend to wh-questions or multiple choice Approach Summarization-based sentence selection Machine learning-based gap selection Training data creation on Wikipedia data Preview: 83% true pos vs 19% false positives

Sentence Selection Intuition: When studying a new subject, focus on main concepts Nitty-gritty details later

Sentence Selection Intuition: Insight: Similar to existing NLP task  When studying a new subject, focus on main concepts Nitty-gritty details later Insight: Similar to existing NLP task 

Sentence Selection Intuition: Insight: Similar to existing NLP task  When studying a new subject, focus on main concepts Nitty-gritty details later Insight: Similar to existing NLP task  Summarization: Pick out key content first

Sentence Selection Intuition: Insight: Similar to existing NLP task  When studying a new subject, focus on main concepts Nitty-gritty details later Insight: Similar to existing NLP task  Summarization: Pick out key content first Exploit simple summarization approach Based on SumBasic Best sentences: most representative of article

Sentence Selection Intuition: Insight: Similar to existing NLP task  When studying a new subject, focus on main concepts Nitty-gritty details later Insight: Similar to existing NLP task  Summarization: Pick out key content first Exploit simple summarization approach Based on SumBasic Best sentences: most representative of article ~ Contain most frequent (non-stop) words in article

Gap Selection How can we pick gaps?

Gap Selection How can we pick gaps? Heuristics?

Gap Selection How can we pick gaps? Methodology: Generate and filter Heuristics? Possible, but unsatisfying Empirical approach using machine learning Methodology: Generate and filter

Gap Selection How can we pick gaps? Methodology: Generate and filter Heuristics? Possible, but unsatisfying Empirical approach using machine learning Methodology: Generate and filter Generate many possible questions per sentence Train discriminative classifier to filter Apply to new Q/A pairs

Generate What’s a good gap?

Generate What’s a good gap? Any word position?

Generate What’s a good gap? Any word position? No – stopwords: only good for a grammar test

Generate What’s a good gap? Any word position? No – stopwords: only good for a grammar test No – bad for training: too many negative examples

Generate What’s a good gap? Any word position? Who did what to whom… No – stopwords: only good for a grammar test No – bad for training: too many negative examples Who did what to whom…

Generate What’s a good gap? Any word position? Who did what to whom… No – stopwords: only good for a grammar test No – bad for training: too many negative examples Who did what to whom… Items filling main semantic roles: Verb predicate, child NP, AP Based syntactic parser, semantic role labeling

Processing Example

Classification Identify positive/negative examples Create single score > Threshold  positive, o.w. negative Extract feature vector Train logistic regression learner

Data Documents: 105 Wikipedia articles listed as vital/popular Across a range of domains

Data Documents: Sentences: 105 Wikipedia articles listed as vital/popular Across a range of domains Sentences: 10 selected by summarization measure 10 random: why?

Data Documents: Sentences: Labels:? 105 Wikipedia articles listed as vital/popular Across a range of domains Sentences: 10 selected by summarization measure 10 random: why?  avoid tuning Labels:?

Data Documents: Sentences: Labels:? 105 Wikipedia articles listed as vital/popular Across a range of domains Sentences: 10 selected by summarization measure 10 random: why?  avoid tuning Labels:? Human judgments via crowdsourcing Rate questions (Good/Okay/Bad)– alone?

Data Documents: Sentences: Labels:? 105 Wikipedia articles listed as vital/popular Across a range of domains Sentences: 10 selected by summarization measure 10 random: why?  avoid tuning Labels:? Human judgments via crowdsourcing Rate questions (Good/Okay/Bad)– alone? not reliable In context of original sentence and other alternatives

HIT Set-up Good: key concepts, fair to answer Okay: key concepts, odd/ambiguous/long to answer Bad: unimportant, uninteresting

HIT Set-up Good: key concepts, fair to answer Okay: key concepts, odd/ambiguous/long to answer Bad: unimportant, uninteresting Good = 1; o.w. 0

HIT Set-up Good: key concepts, fair to answer Okay: key concepts, odd/ambiguous/long to answer Bad: unimportant, uninteresting Good = 1; o.w. 0

Another QA Maintaining quality in crowdsourcing Basic crowdsourcing issue: Malicious or sloppy workers

Another QA Maintaining quality in crowdsourcing Basic crowdsourcing issue: Malicious or sloppy workers Validation:

Another QA Maintaining quality in crowdsourcing Basic crowdsourcing issue: Malicious or sloppy workers Validation: Compare across annotators: If average > 2 stdev from median  reject

Another QA Maintaining quality in crowdsourcing Basic crowdsourcing issue: Malicious or sloppy workers Validation: Compare across annotators: If average > 2 stdev from median  reject Compare questions: Exclude those with high variance (>0.3)

Another QA Maintaining quality in crowdsourcing Basic crowdsourcing issue: Malicious or sloppy workers Validation: Compare across annotators: If average > 2 stdev from median  reject Compare questions: Exclude those with high variance (>0.3) Yield: 1821 questions, 700 “Good”: alpha = 0.51

Features Diverse features:

Features Diverse features: Count: 5 Lexical: 11 Syntactic: 112 Semantic role: 40 Named entity: 11 Link: 3

Features Diverse features: Characterizing: Count: 5 Lexical: 11 Syntactic: 112 Semantic role: 40 Named entity: 11 Link: 3 Characterizing: Answer, source sentence, relation b/t

Features Token count features:

Features Token count features: Lexical features: Questions shouldn’t be too short Answer gaps shouldn’t be too long Lengths, overlaps Lexical features:

Features Token count features: Lexical features: Questions shouldn’t be too short Answer gaps shouldn’t be too long Lengths, overlaps Lexical features: Specific words? 

Features Token count features: Lexical features: Questions shouldn’t be too short Answer gaps shouldn’t be too long Lengths, overlaps Lexical features: Specific words?  Too specific Word class densities: Good and bad

Features Token count features: Lexical features: Questions shouldn’t be too short Answer gaps shouldn’t be too long Lengths, overlaps Lexical features: Specific words?  Too specific Word class densities: Good and bad Capitalized words in answer; pronoun, stopwords in ans.

Features Syntactic features: Syntactic structure: e.g. answer depth, relation to verb POS: composition, context of answer

Features Syntactic features: Semantic role label features: Syntactic structure: e.g. answer depth, relation to verb POS: composition, context of answer Semantic role label features: SRL spans relations to answer Primarily subject, object roles Verb predicate:

Features Syntactic features: Semantic role label features: Syntactic structure: e.g. answer depth, relation to verb POS: composition, context of answer Semantic role label features: SRL spans relations to answer Primarily subject, object roles Verb predicate: strong association, but BAD Named entity features: NE density, type frequency in answer; frequency in sent Likely in question, but across types, majority w/o

Features Syntactic features: Semantic role label features: Syntactic structure: e.g. answer depth, relation to verb POS: composition, context of answer Semantic role label features: SRL spans relations to answer Primarily subject, object roles Verb predicate: strong association, but BAD Named entity features: NE density, type frequency in answer; frequency in sent Likely in question, but across types, majority w/o Link features: cue to importance of linked spans Density and ratio of links

Results Equal error rate: 83% TP, 19% FP

Observations Performs well, can tune to balance errors Is it Wikipedia centric?

Observations Performs well, can tune to balance errors Is it Wikipedia centric? No, little change w/o Wikipedia features How much data does it need?

Observations Performs well, can tune to balance errors Is it Wikipedia centric? No, little change w/o Wikipedia features How much data does it need? Learning curve levels out around 1200 samples Which features matter?

Observations Performs well, can tune to balance errors Is it Wikipedia centric? No, little change w/o Wikipedia features How much data does it need? Learning curve levels out around 1200 samples Which features matter? All types, feature weights distributed across types

False Positives

False Negatives

Error Analysis False positives:

Error Analysis False positives: False negatives: Need some notion of predictability, repetition False negatives:

Look Back TREC QA vs Jeopardy!, web-scale relation extraction, question gen.

Look Back TREC QA vs Obvious differences: Jeopardy!, web-scale relation extraction, question gen. Obvious differences:

Look Back TREC QA vs Obvious differences: Core similarities: Jeopardy!, web-scale relation extraction, question gen. Obvious differences: Scale, web vs fixed documents, question-answer direction Task-specific constraints: betting, timing, evidence,… Core similarities:

Looking Back TREC QA vs Obvious differences: Core similarities: Jeopardy!, web-scale relation extraction, question gen. Obvious differences: Scale, web vs fixed documents, question-answer direction Task-specific constraints: betting, timing, evidence,… Core similarities: Similar wide range of features applied Deep + shallow approaches; learning&rules Relations b/t question/answer (pattern/relation)