Overview of the KBP 2013 Slot Filler Validation Track Hoa Trang Dang National Institute of Standards and Technology.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Location Recognition Given: A query image A database of images with known locations Two types of approaches: Direct matching: directly match image features.
1 Opinion Summarization Using Entity Features and Probabilistic Sentence Coherence Optimization (UIUC at TAC 2008 Opinion Summarization Pilot) Nov 19,
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
Overview of the TAC2013 Knowledge Base Population Evaluation: Temporal Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji,
Text Analysis Conference Knowledge Base Population 2013 Hoa Trang Dang National Institute of Standards and Technology Sponsored by:
Overview of the TAC2013 Knowledge Base Population Evaluation: English Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji, and.
QA-LaSIE Components The question document and each candidate answer document pass through all nine components of the QA-LaSIE system in the order shown.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Tri-lingual EDL Planning Heng Ji (RPI) Hoa Trang Dang (NIST) WORRY, BE HAPPY!
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Evaluating Search Engine
Ang Sun Ralph Grishman Wei Xu Bonan Min November 15, 2011 TAC 2011 Workshop Gaithersburg, Maryland USA.
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
Automatic Classification of Semantic Relations between Facts and Opinions Koji Murakami, Eric Nichols, Junta Mizuno, Yotaro Watanabe, Hayato Goto, Megumi.
The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.
Overview of the Fourth Recognising Textual Entailment Challenge NIST-Nov. 17, 2008TAC Danilo Giampiccolo (coordinator, CELCT) Hoa Trang Dan (NIST)
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Overview of the KBP 2012 Slot-Filling Tasks Hoa Trang Dang (National Institute of Standards and Technology Javier Artiles (Rakuten Institute of Technology)
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Natural Language Based Reformulation Resource and Web Exploitation for Question Answering Ulf Hermjakob, Abdessamad Echihabi, Daniel Marcu University of.
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
QUALIFIER in TREC-12 QA Main Task Hui Yang, Hang Cui, Min-Yen Kan, Mstislav Maslennikov, Long Qiu, Tat-Seng Chua School of Computing National University.
21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.
1 Automating Slot Filling Validation to Assist Human Assessment Suzanne Tamang and Heng Ji Computer Science Department and Linguistics Department, Queens.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
Summarization Focusing on Polarity or Opinion Fragments in Blogs Yohei Seki Toyohashi University of Technology Visiting Scholar at Columbia University.
Ang Sun Director of Research, Principal Scientist, inome
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Chapter 23: Probabilistic Language Models April 13, 2004.
Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.
Linking Organizational Social Networking Profiles Research Wrap-Up – 28 August
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
August 17, 2005Question Answering Passage Retrieval Using Dependency Parsing 1/28 Question Answering Passage Retrieval Using Dependency Parsing Hang Cui.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Inference Protocols for Coreference Resolution Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Nick Rizzolo, Mark Sammons, and Dan Roth This research.
Ranking Definitions with Supervised Learning Methods J.Xu, Y.Cao, H.Li and M.Zhao WWW 2005 Presenter: Baoning Wu.
RESEARCH POSTER PRESENTATION DESIGN © Triggers in Extraction 5. Experiments Data Development set: KBP SF 2012 corpus.
AQUAINT IBM PIQUANT ARDACYCORP Subcontractor: IBM Question Answering Update piQuAnt ARDA/AQUAINT December 2002 Workshop This work was supported in part.
Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28,
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
AQUAINT AQUAINT Evaluation Overview Ellen M. Voorhees.
LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
1 INFILE - INformation FILtering Evaluation Evaluation of adaptive filtering systems for business intelligence and technology watch Towards real use conditions.
By Dan Roth and Wen-tau Yih PowerPoint by: Reno Kriz CIS
Presentation transcript:

Overview of the KBP 2013 Slot Filler Validation Track Hoa Trang Dang National Institute of Standards and Technology

Slot Filler Validation (SFV) Track Goals ▫Allow teams without a full slot-filling system to participate, focus on answer validation rather than document retrieval ▫Evaluate the contribution of RTE systems on KBP slot-filling ▫Allow teams to experiment with system voting and global SFV input: ▫Candidate slot filler ▫Possibly additional information about candidate slot fillers SFV output: ▫Binary classification (Correct / Incorrect) of each candidate slot filler Can only improve precision, not recall of full slot-filling systems Evaluation metrics depends on SFV use case and availability of additional information about candidate fillers TAC RTE KBP Validation task (2011) TAC KBP Slot Filler Validation task (2012)

TAC RTE KBP Validation task (2011) 1 RTE evaluation pair, where: T is the entire document supporting the slot filler H is a set of synonymous sentences, representing different realizations of the slot filler Each slot filler returned by SF systems

Use Case 1: SFV as Textual Entailment (2011) SFV input: ▫All regular English slot filling input (slot definitions, queries, source documents) ▫Individual candidate slot fillers (filler, provenance) Local Approach: ▫Generic textual entailment: H is relation implied by candidate slot filler (e.g., “Barack Obama has lived in Chicago”), T is provenance (entire document, or smaller regions defined by justification offsets) ▫Tailored textual entailment: train on different slot types; could be a validation module for a full slot filling system. Evaluation: ▫F score on entire pool of candidate slot fillers (unique slot filler, provenance) ▫Baseline: All T’s classified as entailing the corresponding H: P=R=percentage of entailing pairs in the pooled SF responses ▫Weak baseline, easily beat by all SFV systems; not a direct measure of utility of SFV to SF

Use Case 2: SFV impact on single SF systems SFV input: ▫All regular English slot filling input (slot definitions, queries, source documents) ▫Individual candidate slot fillers (filler, provenance, confidence)  Broken out into individual slot filling runs Global Approach: ▫System Voting, leveraging features across multiple SF runs Evaluation: ▫Filter out “Incorrect” slot fillers from each run, and score according to regular English SF; compare to score for original run

Slot Filler Validation (SFV) 2012 SFV input: ▫All regular English slot filling input (slot definitions, queries, source documents) ▫Individual candidate slot fillers (filler, provenance, confidence)  Broken out into individual slot filling runs ▫System profile for each SF run ▫Preliminary assessment of 10% of KBP 2013 Slot Filling queries SFV output: ▫Binary classification (Correct / Incorrect) of each candidate slot filler Evaluation: Filter out “Incorrect” slot fillers from each run, and score according to regular English SF; compare to score for original run

Slot Filler Validation (SFV) 2012 SFV input: ▫All regular English slot filling input (slot definitions, queries, source documents) ▫Individual candidate slot fillers (filler, provenance, confidence)  Broken out into individual slot filling runs ▫System profile for each SF run ▫Preliminary assessment of 10% of KBP 2013 Slot Filling queries SFV output: ▫Binary classification (Correct / Incorrect) of each candidate slot filler Evaluation: Filter out “Incorrect” slot fillers from each run, and score according to regular English SF; compare to score for original run One SFV submission, decreased F1 of almost all SF runs except poorest performing SF runs.

Slot Filler Validation (SFV) 2013 SFV input: ▫All regular English slot filling input (slot definitions, queries, source documents) ▫Individual candidate slot fillers (filler, provenance, confidence)  Broken out into individual slot filling runs SFV output: ▫Binary classification (Correct / Incorrect) of each candidate slot filler Evaluation: Filter out “Incorrect” slot fillers from each run, and score according to regular English SF; compare to score for original run

Slot Filler Validation (SFV) 2013 SFV input: ▫All regular English slot filling input (slot definitions, queries, source documents) ▫Individual candidate slot fillers (filler, provenance, confidence)  Broken out into individual slot filling runs ▫System profile for each SF run ▫Preliminary assessment of 10% of KBP 2013 Slot Filling queries SFV output: ▫Binary classification (Correct / Incorrect) of each candidate slot filler Evaluation: Filter out “Incorrect” slot fillers from each run, and score according to regular English SF; compare to score for original run Score only on the 90% of KBP 2013 slot filling queries that didn’t have preliminary assessments released as part of SFV input

SF System Profile SF Team ranks in KBP Did the system extract fillers from the KBP 2013 source corpus? Do the Confidence Values have meaning? Is the Confidence Value a probability? Tools or methods for: ▫Query expansion ▫Document retrieval ▫Sentence retrieval ▫NER nominal tagging ▫Coreference resolution ▫Third-party relation/event extraction ▫Dependency/Constituent parsing ▫POS tagging ▫Chunking ▫Main slot filling algorithm ▫Learning algorithm ▫Ensemble model ▫External resources

Slot Filler Validation Teams and Approaches BIT: Beijing Institute of Technology [local] ▫Generic RTE approach based on word overlap, cosine similarity, and token edit distance Stanford: Stanford University [local] ▫Based on Stanford’s full slot-filling system, especially component for checking consistency and validity of candidate fillers UI_CCG: University of Illinois at Urbana-Champaign [local] ▫Tailored RTE approach; check candidate for slot-specific constraints jhuapl: Johns Hopkins University Applied Physics Laboratory [weak global] ▫Consider only the confidence value associated with each candidate filler and aggregate confidence values across systems. RPI_BLENDER: Rensselaer Polytechnic Institute [strong global] ▫Based on RPI_BLENDER full slot-filling system (like Stanford), but also leveraged full set of SFV input (including SF system profile and preliminary assessments) to rank systems and apply tier-specific filtering.

Impact of RPI_BLENDER2 SFV on SF Runs SF RunF1 of original SF run  F1 after applying SFV filter lsv lsv lsv ARPANI lsv RPI_BLENDER RPI_BLENDER lsv RPI_BLENDER PRIS NYU UWashington UWashington UWashington SAFT_KRes CMUML TALP_UPC Top 10 SF runs Negatively impacted SF runs

Conclusion Leveraging global features boosts scores of individual SF runs…. If done discriminately ▫Don’t treat all slot filling systems the same Even weak global features (e.g. raw confidence values) may help in some cases Caveat: other evaluation metrics also valid depending on use case. ▫RTE KBP validation (2011) metric may be appropriate if goal is to make assessment more efficient