Ang Sun Director of Research, Principal Scientist, inome

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
Overview of the TAC2013 Knowledge Base Population Evaluation: Temporal Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji,
Overview of the TAC2013 Knowledge Base Population Evaluation: English Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji, and.
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
Distant Supervision for Knowledge Base Population Mihai Surdeanu, David McClosky, John Bauer, Julie Tibshirani, Angel Chang, Valentin Spitkovsky, Christopher.
Beyond TREC-QA Ling573 NLP Systems and Applications May 28, 2013.
Problem Semi supervised sarcasm identification using SASI
Overview of the KBP 2013 Slot Filler Validation Track Hoa Trang Dang National Institute of Standards and Technology.
Mining External Resources for Biomedical IE Why, How, What Malvina Nissim
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Ang Sun Ralph Grishman Wei Xu Bonan Min November 15, 2011 TAC 2011 Workshop Gaithersburg, Maryland USA.
CS4705.  Idea: ‘extract’ or tag particular types of information from arbitrary text or transcribed speech.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Using Information Extraction for Question Answering Done by Rani Qumsiyeh.
Methods for Domain-Independent Information Extraction from the Web An Experimental Comparison Oren Etzioni et al. Prepared by Ang Sun
Supervised models for coreference resolution Altaf Rahman and Vincent Ng Human Language Technology Research Institute University of Texas at Dallas 1.
Learning Subjective Adjectives from Corpora Janyce M. Wiebe Presenter: Gabriel Nicolae.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Learning syntactic patterns for automatic hypernym discovery Rion Snow, Daniel Jurafsky and Andrew Y. Ng Prepared by Ang Sun
1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Andreea Bodnari, 1 Peter Szolovits, 1 Ozlem Uzuner 2 1 MIT, CSAIL, Cambridge, MA, USA 2 Department of Information Studies, University at Albany SUNY, Albany,
The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Intelius-NYU Cold Start System Ang Sun, Xin Wang, Sen Xu, Yigit Kiran, Shakthi Poornima, Andrew Borthwick (Intelius Inc.) Ralph Grishman (New York University)
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.
Survey of Semantic Annotation Platforms
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Illinois-Coref: The UI System in the CoNLL-2012 Shared Task Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Mark Sammons, and Dan Roth Supported by ARL,
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Overview of the KBP 2012 Slot-Filling Tasks Hoa Trang Dang (National Institute of Standards and Technology Javier Artiles (Rakuten Institute of Technology)
Open Information Extraction using Wikipedia
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
Querying Structured Text in an XML Database By Xuemei Luo.
Wei Xu, Ralph Grishman, Le Zhao (CMU) New York University Novmember 24, 2011.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
1 Automating Slot Filling Validation to Assist Human Assessment Suzanne Tamang and Heng Ji Computer Science Department and Linguistics Department, Queens.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
BioSnowball: Automated Population of Wikis (KDD ‘10) Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/11/30 1.
PRIS at Slot Filling in KBP 2012: An Enhanced Adaboost Pattern-Matching System Yan Li Beijing University of Posts and Telecommunications
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
RESEARCH POSTER PRESENTATION DESIGN © Triggers in Extraction 5. Experiments Data Development set: KBP SF 2012 corpus.
Named Entity Recognition in Query Jiafeng Guo 1, Gu Xu 2, Xueqi Cheng 1,Hang Li 2 1 Institute of Computing Technology, CAS, China 2 Microsoft Research.
Relational Duality: Unsupervised Extraction of Semantic Relations between Entities on the Web Danushka Bollegala Yutaka Matsuo Mitsuru Ishizuka International.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Class Imbalance in Text Classification
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.
Cold-Start KBP Something from Nothing Sean Monahan, Dean Carpenter Language Computer.
Integrating linguistic knowledge in passage retrieval for question answering J¨org Tiedemann Alfa Informatica, University of Groningen HLT/EMNLP 2005.
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
Learning to Extract CSCI-GA.2590 Ralph Grishman NYU.
KNN & Naïve Bayes Hongning Wang
BUPT_PRIS at TREC 2014 Knowledge Base
Relation Extraction CSCI-GA.2591
Entity- & Topic-Based Information Ordering
CSCE 590 Web Scraping – Information Retrieval
Introduction Task: extracting relational facts from text
Presentation transcript:

Ang Sun Director of Research, Principal Scientist, inome

 The Slot Filling Challenge  Overview of the NYU 2011 System  Pattern Filler  Distant Learning Filler

 Hand annotation performance  Precision: 70%  Recall: 54%  F-measure: 61%  Top systems rarely exceed 30% F-measure

Query: Jim Parsons eng-WL PER E per:date_of_birth, per:age, per:city_of_birth DOC : After graduating from high school, Jim Parsons received an undergraduate degree from the University of Houston. He was prolific during this time, appearing in 17 plays in 3 years. Response : SF114 per:schools_attended University of Houston

 Entry level is pretty high  High performance name extraction  High performance coreference resolution  … …  Extraction at large scale  2011: 1.8 million documents  2012: 3.7 million documents Jim Parsons was born and raised in Houston … … He attended Klein Oak High School in …

 Documents have not gone through a careful selection process  Evaluation in a real world scenario  Slot types are of different granularities  per:employee_of  org: top_members/employees  … …

RecallPrecisionF-measure

 Hand crafted patterns pattern setpatternsslots local patterns for person queriestitle of org, org title, org’s title, title title, employee_of title in GPE, GPE titleorigin, location_of_residence person, integer,age local patterns for org queriestitle of org, org title, org’s titletop_members/employees GPE’s org, GPE-based org, org of GPE, org in GPE location_of_headquarters org’s orgsubsidiaries / parent implicit organzationtitle [where there is a unique org mentioned in the current + prior sentence] employee_of [for person queries]; top_members/employees [for org queries] functional nounF of X, X’s F where F is a functional noun family relations; org parents and subsidiaries

 Hand crafted patterns pattern setpatternsslots local patterns for person queriestitle of org, org title, org’s title, title title, employee_of title in GPE, GPE titleorigin, location_of_residence person, integer,age local patterns for org queriestitle of org, org title, org’s titletop_members/employees GPE’s org, GPE-based org, org of GPE, org in GPE location_of_headquarters org’s orgsubsidiaries / parent implicit organzationtitle [where there is a unique org mentioned in the current + prior sentence] employee_of [for person queries]; top_members/employees [for org queries] functional nounF of X, X’s F where F is a functional noun family relations; org parents and subsidiaries

 Hand crafted patterns 

 Learned patterns (through bootstrapping) Basic Idea: It starts from some seed patterns which are used to extract named entity (NE) pairs, which in turn result in more semantic patterns learned from the corpus.

 Learned patterns (through bootstrapping) “, chairman of ”

 Learned patterns (through bootstrapping), … “, chairman of ”

 Learned patterns (through bootstrapping), … “, CEO of ”, “, director at”, … …

 Learned patterns (through bootstrapping) “, CEO of ”, “, director at”, … …, … …

 Learned patterns (through bootstrapping)  Problem: semantic drift  a pair of names may be connected by patterns belonging to multiple relations

 Learned patterns (through bootstrapping)  Problem: semantic drift  Solutions: ▪ Manually review top ranked patterns ▪ Guide bootstrapping with pattern clusters

President Clinton traveled to the Irish border for an evening ceremony. Dependency Parsing Tree Dependency Parsing Tree Shortest pathnsubj'_traveled_prep_to

 Distant Learning (the general algorithm)  Map relations in knowledge bases to KBP slots  Search corpora for sentences that contain name pairs  Generate positive and negative training examples  Train classifiers using generated examples  Fill slots using trained classifiers

 Distant Learning  Map 4.1M Freebase relation instances to 28 slots  Given a pair of names occurring together in a sentence in the KBP corpus, treat it as a ▪ positive example if it is a Freebase relation instance ▪ negative example if is not a Freebase instance but is an instance for some j'  j.  Train classifiers using MaxEnt  Fill slots using trained classifiers, in parallel with other components of NYU system

 Problems  Problem 1: Class labels are noisy ▪ Many False Positives because name pairs are often connected by non-relational contexts FALSE POSITIVES FALSE POSITIVES

 Problems  Problem 1: Class labels are noisy ▪ Many False Negatives because of incompleteness of current knowledge bases

 Problems  Problem 2: Class distribution is extremely unbalanced ▪ Treat as negative if is NOT a Freebase relation instance ▪ Positive VS negative: 1:37 ▪ Treat as negative if is NOT a Freebase instance but is an instance for some j'  j AND is separated by no more than 12 tokens ▪ Positive VS negative: 1:13 ▪ Trained classifiers will have low recall, biased towards negative

 Problems  Problem 3: training ignores co-reference info ▪ Training relies on full name match between Freebase and text ▪ But partial names (Bill, Mr. Gates …) occur often in text ▪ Use co-reference during training? ▪ Co-reference module itself might be inaccurate and adds noise to training ▪ But can it help during testing?

 Solutions to Problems  Problem 1: Class labels are noisy ▪ Refine class labels to reduce noise  Problem 2: Class distribution is extremely unbalanced ▪ Undersample the majority classes  Problem 3: training ignores co-reference info ▪ Incorporate coreference during testing

 The refinement algorithm I. Represent a training instance by its dependency pattern, the shortest path connecting the two names in the dependency tree representation of the sentence II. Estimate precision of the pattern Precision of a pattern p for the class C i is defined as the number of occurrences of p in the class C i divided by the number of occurrences of p in any of the classes C j III. Assign the instance the class that its dependency pattern is most precise about

 The refinement algorithm (cont)  Examples Jon Corzine, the former chairman and CEO of Goldman Sachs William S. Paley, chairman of CBS … … Example Sentence PERSON: Employee_of ORG: Founded_by Class prec(appos chairman prep_of, PERSON:Employee_of) = prec(appos chairman prep_of, ORG:Founded_by) = PERSON: Employee_of PERSON: Employee_of appos chairman prep_of

 Effort 1: multiple n-way instead of single n-way classification  single n-way: an n-way classifier for all classes ▪ Biased towards majority classes  multiple n-way: an n-way classifier for each pair of name types ▪ A classifier for PERSON and PERSON ▪ Another one for PERSON and ORGANIZATION ▪ … …  On average (10 runs on 2011 evaluation data) ▪ single n-way: 180 fills for 8 slots ▪ multiple n-way:240 fills for 15 slots

 Effort 2:  Even with multiple n-way classification approach  OTHER (not a defined KBP slot) is still the majority class for each such n-way classifier  Downsize OTHER by randomly selecting a subset of them

 No use of co-reference during training  Run Jet (NYU IE toolkit) to get co-referred names of the query  Use these names when filling slots for the query  Co-reference is beneficial to our official system  P/R/F of the distant filler itself ▪ With co-reference: 36.4/11.4/17.4 ▪ Without co-reference: 28.8/10.0/14.3

Undersampling Ratio: ratio between negatives and positives  Multiple n-way outperformed single n-way  Models with refinement:  higher performance  curves are much flatter  less sensitive to undersampling ratio  more robust to noise

 Models with refinement have better P, R  Multiple n-way outperforms single n-way mainly through improved recall

Thanks!

?

 Baseline: 2010 System (three basic components) 1) Document Retrieval  Use Lucene to retrieve a maximum of 300 documents  Query: the query name and some minor name variants 2) Answer Extraction  Begins with text analysis: POS tagging, chunking, name tagging, time expression tagging, and coreference  Coreference is used to fill alternate_names slots  Other slots are filled using patterns (hand-coded and created semi-automatically using bootstrapping) 3) Merging  Combines answers from different documents and passages, and from different answer extraction procedures

 Passage Retrieval (QA)  For each slot, a set of index terms is generated using distant supervision (using Freebase)  Terms are used to retrieve and rank passages for a specific slot  An answer is then selected based on name type and distance from the query name  Due to limitations of time, this procedure was only implemented for a few slots and was used as a fall-back strategy, if the other answer extraction components did not find any slot fill.

 Result Analysis (NYU2 R/P/F 25.5/35.0/29.5)