Tri-lingual EDL for 2017 and Beyond

Slides:



Advertisements
Similar presentations
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Advertisements

Overview of the TAC2013 Knowledge Base Population Evaluation: Temporal Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji,
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Text Analysis Conference Knowledge Base Population 2013 Hoa Trang Dang National Institute of Standards and Technology Sponsored by:
Overview of the TAC2013 Knowledge Base Population Evaluation: English Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji, and.
Tri-lingual EDL Planning Heng Ji (RPI) Hoa Trang Dang (NIST) WORRY, BE HAPPY!
Linguistic Resources for the 2013 TAC KBP Slot Filling Evaluations Joe Ellis (presenter), Jeremy Getman, Jonathan Wright, Stephanie Strassel Linguistic.
Search Engines and Information Retrieval
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
What’s New in Search? How destinations can leverage new search trends.
Search Engines and Information Retrieval Chapter 1.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
1 The BT Digital Library A case study in intelligent content management Paul Warren
University of Sheffield, NLP Entity Linking Kalina Bontcheva © The University of Sheffield, This work is licensed under the Creative Commons.
Tables to Linked Data Zareen Syed, Tim Finin, Varish Mulwad and Anupam Joshi University of Maryland, Baltimore County
Semantic Search: different meanings. Semantic search: different meanings Definition 1: Semantic search as the problem of searching documents beyond the.
Overview of the KBP 2012 Slot-Filling Tasks Hoa Trang Dang (National Institute of Standards and Technology Javier Artiles (Rakuten Institute of Technology)
1 Usability and accessibility of educational web sites Nigel Bevan University of York UK eTEN Tenuta support action.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Curriculum Project for Information Extraction. Task definitions Task 1: Entity detection and recognition Task 2: Relation detection and recognition Both.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Big Data Javad Azimi May First of All… Sorry about the language  Feel free to ask any question Please share similar experiences.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Multi-Source Information Extraction Valentin Tablan University of Sheffield.
How can I use a digital library to support my teaching? Find good resources to enhance existing curriculum  Search special collections aimed at your interests.
Data mining in web applications
Chapter 17 The Need for HTML 5.
Search Engine Optimization
Information Retrieval in Practice
SEARCH ENGINE OPTIMIZATION.
Automatically Labeled Data Generation for Large Scale Event Extraction
DR Seema Singhal MS, FACS, FICOG, FCLS, MNAMS Assistant Professor
A knowledge-based text annotation tool
What the problem looks like:
Introduction Multimedia initial focus
RECENT TRENDS IN METADATA GENERATION
INTRODUCTION.
12 Steps to Useful Software Metrics
Multi Rater Feedback Surveys FAQs for Participants
Multi Rater Feedback Surveys FAQs for Participants
Big Data Quality the next semantic challenge
Microsoft Word - Formatting Pages
Auditing in SQL Server 2008 DBA-364-M
Django Presentation: Project MiTunes 2.0
Fred Dirkse CEO, OIC Group, Inc.
Chapter 12: Automated data collection methods
Social Knowledge Mining
Experience Management
#VisualHashtags Visual Summarization of Social Media Events using Mid-Level Visual Elements Sonal Goel (IIIT-Delhi), Sarthak Ahuja (IBM Research, India),
From a thesaurus standard to a general knowledge organization standard?! 04/12/2018.
ECE 544 Software Project 3: Description and Timeline
English Language Proficiency Assessment for the 21st Century
ECE 544 Software Project 3: Description and Timeline
Introduction Task: extracting relational facts from text
Pilar Orero, Spain Yoshikazu SEKI, Japan 2018
LO4 - Be Able to Update Websites to Meet Business Needs
Capturing and Organizing Scientific Annotations
Lecture 8 Information Retrieval Introduction
Effective Entity Recognition and Typing by Relation Phrase-Based Clustering
Requirements Management
ECE 544 Software Project 3: Description and Timeline
ADCOS February.
ROLE OF «electronic virtual enhanced research-engaged student teams» WEB PORTAL IN SOLUTION OF PROBLEM OF COLLABORATION INTERNATIONAL TEAMS INSIDE ONE.
Creating your multimedia product – setting up pages and navigation
Adobe Acrobat DC Accessibility: Accessibility Checker
The Backbone of Genealogy
Kaspersky Social Channel
Presentation transcript:

Tri-lingual EDL for 2017 and Beyond Heng Ji (on behalf of KBP Organizing Committee) jih@rpi.edu

Cold-Start KBP++ Combine with tri-lingual slot filling, event extraction and sentiment/belief anaysis allow more development time to achieve end-to-end KBP performance DEFT teams: CMU, BBN, and TinkerBell (Columbia+Cornell+JHU+RPI+Stanford+UIUC+Upenn) Two weeks between the end of EDL and the start of SF? EDL teams run on data from previous years in by spring 17, so SF teams have chance to play with the results and try out effective ways of using EDL

Multiple Hypotheses and Confidence Estimation 14/08/12 Only two EDL teams in KBP2016 produced meaningful confidence values, but it will become a requirement in EDL2017 Produce multiple or n-best hypotheses at all levels, from individual assertions to subgraphs Perhaps produce canonical form for each NIL cluster All kb assertions to be tagged with meaningful confidence/certainty measures Important for both downstream applications and real-world users Improve EDL scorer so multi-layer confidence values can be incorporated

EDL for 300+ languages 14/08/12 EDL for all the languages in the world, at least for those we have Wikipedia entries Some teams including RPI and Sydney have been cleaning and enriching Wikipedia markups so we can get silver standard annotations for EDL for all the 300+ languages in Wikipedia Possible collaborations with other tasks by KBC / EU

LoReHLT – Low Resources HLT Open evaluations of component technologies relevant to LORELEI (Low Resource Languages for Emergent Incidents) LoReHLT16: NER, MT, and SF (not Slot Filling) in Uyghur LoReHLT17: EDL, MT and SF Two surprise incident languages (IL’s) in parallel Three evaluation checkpoints to gauge performance based on time and resources given EDL in LoReHLT similar to but more challenging than EDL in KBP 2~3 weeks in late summer https://www.nist.gov/itl/iad/mig/lorehlt-evaluations

KB Choice 14/08/12 BaseKB is difficult for human annotators to work with, due to the lack of effective search tools Most EDL systems use Wikipedia or DBPedia and map results back to BaseKB Any suggestions from the community? LORELEI EDL will use Geoname for GPE/LOC, an dWorld Leader Book and World Fact Book for PER/ORG Too specific on Incidents for TAC-KBP?

List-only EDL 14/08/12 Link mentions to entities in various automatically or manually constructed lists

Time Table 14/08/12 Evaluation Window 1: before SF/Event/Best Evaluations, August/September? Set up 2 weeks between the end of EDL and the start of SF Allow enough time for system development and focused research Evaluation Window 2: late October Constrained Resources Setting Difficult to disallow Google translation / Google search

Multi-media EDL and KBP Arguments Entity Type Definition-Text Definition-Image Participant PER ORG People or organizations who commit the attack Person/People with weapon(s) Target ORG GPE objects People or organization or location which is attacked object which is aimed or attacked Place GPE LOC FAC scene where the attack happens Generic description of the attack setting. Time TIMEX when the attack happens N/A Instruments Weapon The type of weapon used during the attack. The schema is defined by experts based on their experience. Red text is the new / modified items from the original ACE. Multimedia: entity subtypes, speechwriter, attacker, protester, relation, or multimedia event

Fine-grained Entity Typing Type Hierarchy from Freebase Merge this with next page

Streaming Data 14/08/12 The reference KB needs to be dynamically updated, when new entries are created from NIL mentions. Besides the quality measures, we need to check Run time of the algorithm based on different size of data as time goes by Check the tradeoff between quality decrease vs. making algorithms parallel Aging-off of low confidence assertions over-time What criteria should be used to age-off information over time? Are system allowed to go back and fix errors as they know more? Update entries but also slot fillers? KBP slot types or open slot types? Data RPI purchased 17 million tweets crawled from major cities influenced by Baltimore Riots during April 12th (the day when Freddie Gray was arrested) ~ May 7th Cities/districts: Baltimore + D.C., Philadelphia, New York City, Boston, Seattle, Chicago, Minneapolis + St. Paul, San Francisco, Los Angles Contains text and multimedia data (also links from Youtube and other image/video sharing websites) Protest tweets (532K ISIS, 29K Hong Kong, 37K New York) ). For example at Time X, my system thinks that "Whitey Bulger" and "James Bulger" are two different people (and adds two novel NILs into the KB). At Time Y, my system gets the sentence "James Bulger, who is known as Whitey Bulger". -- Can my system correct the mistake? There is a related question of what is required of the streaming system. On at time period n, can the system re-read the documents it saw at time period n-1 (and possibly reprocess everything, turning streaming into a task of working with an ever-growing amount of data). Or is it required to work with only the documents in time period n-1 and the KB? If the latter, what is allowed to be stored in the KB? Can my internal KB record every semantic role that a name string participates in? Or is it limited to the ~40-slots in the TAC ontology? yay for MT?