ACE Automatic Content Extraction A program to develop technology to extract and characterize meaning from human language.

Slides:



Advertisements
Similar presentations
Testing Relational Database
Advertisements

1 of 18 Information Dissemination New Digital Opportunities IMARK Investing in Information for Development Information Dissemination New Digital Opportunities.
Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
An Introduction to GATE
Foundational Objects. Areas of coverage Technical objects Foundational objects Lessons learned from review of Use Case content Simple Study Simple Questionnaire.
Text Analysis Conference Knowledge Base Population 2013 Hoa Trang Dang National Institute of Standards and Technology Sponsored by:
A Corpus for Cross- Document Co-Reference D. Day 1, J. Hitzeman 1, M. Wick 2, K. Crouch 1 and M. Poesio 3 1 The MITRE Corporation 2 University of Massachusetts,
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
IEC Substation Configuration Language and Its Impact on the Engineering of Distribution Substation Systems Notes Dr. Alexander Apostolov.
REACTION REACTION Workshop Task 1 – Progress Report & Plans Lisbon, PT and Austin, TX Mário J. Silva University of Lisbon, Portugal.
CDC Environmental Public Health Tracking Metadata Subgroup Meeting Sharon Shin, Metadata Coordinator, FGDC April 19, pm Eastern.
Planning a measurement program What is a metrics plan? A metrics plan must describe the who, what, where, when, how, and why of metrics. It begins with.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Requirements Analysis Concepts & Principles
1 CS 502: Computing Methods for Digital Libraries Lecture 17 Descriptive Metadata: Dublin Core.
Lecture Nine Database Planning, Design, and Administration
Chapter 12 Information Systems. 2 Chapter Goals Define the role of general information systems Explain how spreadsheets are organized Create spreadsheets.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
The chapter will address the following questions:
Information Retrieval in Practice
System Analysis Overview Document functional requirements by creating models Two concepts help identify functional requirements in the traditional approach.
Database System Development Lifecycle © Pearson Education Limited 1995, 2005.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Chapter 6 System Engineering - Computer-based system - System engineering process - “Business process” engineering - Product engineering (Source: Pressman,
WP5.4 - Introduction  Knowledge Extraction from Complementary Sources  This activity is concerned with augmenting the semantic multimedia metadata basis.
Behavioral Observation and Archives
A Flexible and Extensible Architecture for Linguistic Annotation Steven Bird *, David Day †, John Garofolo ‡, John Henderson †, Christophe Laprun ‡ and.
The PrestoSpace Project Valentin Tablan. 2 Sheffield NLP Group, January 24 th 2006 Project Mission The 20th Century was the first with an audiovisual.
Establishment of the Environmental Quality Standard for Water and Strengthening of Regional and District Environmental Offices for Implementation of Water.
Project Scoping Fundamentals Alan Lively Project Delivery Specialist Local Government Section April 6, 2010.
1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva
The TERN Task EVALITA 2007 Valentina Bartalesi Lenzi & Rachele Sprugnoli
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.
TimeML compliant text analysis for Temporal Reasoning Branimir Boguraev and Rie Kubota Ando.
Object-Oriented Modeling: Static Models. Object-Oriented Modeling Model the system as interacting objects Model the system as interacting objects Match.
Common Terminology Services 2 CTS 2 Submission Team Status Update HL7 Vocabulary Working Group May 17, 2011.
1 DUTIE Speech: Determining Utility Thresholds for Information Extraction from Speech John Makhoul, Rich Schwartz, Alex Baron, Ivan Bulyko, Long Nguyen,
Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington.
Faculty Faculty Richard Fikes Edward Feigenbaum (Director) (Emeritus) (Director) (Emeritus) Knowledge Systems Laboratory Stanford University “In the knowledge.
CSPC 464 Fall 2014 Son Nguyen. 1. The Process of Software Architecting, Peter Eeles, Peter Cripss 2. Software Architecture for Developers, Simon Brown.
United Nations Regional Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys, Asunción,
Reviews Crawler (Detection, Extraction & Analysis) FOSS Practicum By: Syed Ahmed & Rakhi Gupta April 28, 2010.
Curriculum Project for Information Extraction. Task definitions Task 1: Entity detection and recognition Task 2: Relation detection and recognition Both.
Software Configuration Management SEII-Lecture 21
Fundamentals, Design, and Implementation, 9/e Appendix B The Semantic Object Model.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
Standards for representing meeting metadata and annotations in meeting databases Standards for representing meeting metadata and annotations in meeting.
Data Mining for Surveillance Applications Suspicious Event Detection Dr. Bhavani Thuraisingham.
Chapter 29. Copyright 2003, Paradigm Publishing Inc. CHAPTER 29 BACKNEXTEND 29-2 LINKS TO OBJECTIVES Attach an XML Schema Attach an XML Schema Load XML.
1 Evaluation of Multi-Media Data QA Systems AQUAINT Breakout Session – June 2002 Howard Wactlar, Carnegie Mellon Yiming Yang, Carnegie Mellon Herb Gish,
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Introduction to Metadata March 2016 What is Metadata?
Of 24 lecture 11: ontology – mediation, merging & aligning.
BIM October-2014Dr. Walid Al-Awad 1 Building Information Modelling.
Data Mining for Surveillance Applications Suspicious Event Detection
Automatically Labeled Data Generation for Large Scale Event Extraction
Chapter 11: Software Configuration Management
Data Dictionaries ER Diagram.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Associative Query Answering via Query Feature Similarity
Social Knowledge Mining
Data Mining for Surveillance Applications Suspicious Event Detection
Chapter 11: Software Configuration Management
Family History Technology Workshop
Content Augmentation for Mixed-Mode News Broadcasts Mike Dowman
Data Mining for Surveillance Applications Suspicious Event Detection
Presentation transcript:

ACE Automatic Content Extraction A program to develop technology to extract and characterize meaning from human language

Government ACE Team Project Management NSACIA DIANIST Research Oversight JK Davis (NSA)Charles Wayne (NSA) Boyan Onyshkevych (NSA)Steve Dennis (NSA) George Doddington (NIST)John Garofolo (NIST)

ACE Five-Year Goals Develop automatic content extraction technology to extract information from human language in textual form: Text (newswire)Speech (ASR)Image (OCR) Enable new applications in: Data MiningBrowsingLink Analysis SummarizationVisualizationCollaboration TDTDRIE Provide major improvements in analyst access to relevant data

The ACE Processing Model A database maintenance task: ACE technology Source language data Content Content database Newswire (text) Broadcast News (ASR) Newspaper (OCR) Detection and tracking of entities Recognition of semantic relations Recognition of events  The ACE Pilot Study Visualization Data mining Browsing Link analysis analyst  

The ACE Pilot Study –Answer key questions: What are the right technical goals? What is the impact of degraded text? How should performance be measured? –Establish performance baselines –Choose initial research directions (Entity Detection and Tracking) –Begin developing content extraction technology Objective: To lay the groundwork for the ACE program.

The ACE Pilot Study Process May ’99 –Discuss/Explore candidate R&D tasks –Bimonthly meetings –Identify Data –Bimonthly site visits –Provide infrastructure support annotation / reconciliation / evaluation –Select/Define Pilot Study common task –Annotate Data –Implement and evaluate baseline systems –Final pilot study workshop (22-23 May ’00) May ’00

The Pilot Study R&D Task EDT – a suite of four tasks: 1) Detection of Entities – limited to five types: PER ORG GPE LOC FAC 2) Recognition of Entity Attributes – limited to: Type Name 3) Detection of Entity Mentions (i.e., entity tracking) 4) Recognition of Mention Extent EDT Entity Detection and Tracking (limited to “within-document” processing)

The Entity Detection Task This is the most basic common task. It is the foundation upon which the other tasks are built, and it is therefore a required task for all ACE technology developers. Recognition of entity type and entity attributes is separate from entity detection. Note, however, that detection is limited to entities of specified types.

Entity Types Entities to be detected and recognized will be limited to the following five types: 1 – Person. Person entities are limited to humans. A person may be a single individual or a group if the group has a group identity. 2 – Organization. Organization entities are limited to corporations, agencies, and other groups of people defined by an established organizational structure. Churches, schools, embassies and restaurants are examples of organization entities. 3 – GPE (A Geo-Political Entity). GPE entities are politically defined geographical regions. A GPE entity subsumes and does not distinguish between a geographical region, its government or its people. GPE entities include nations, states and cities.

Entity Types (continued) 4 – Location. Location entities are limited to geographic entities with physical extent. Location entities include geographical areas and landmasses, bodies of water, and geological formations. A politically defined geographic area is a GPE entity rather than a location entity. 5 – Facility. Facility entities are human-made artifacts falling under the domains of architecture and civil engineering. Facility entities include buildings such as houses, factories, stadiums, museums; and elements of transportation infrastructure such as streets, airports, bridges and tunnels.

The Entity Detection Process A system must output a representation of each entity mentioned in a document, at the end of that document: –Pointers to the beginning and end of the head of one or more mentions of the entity. (As an option, pointers to all mentions may be output, in order to support the evaluation of Mention Detection performance.) –Entity type and attribute (name) information. –Mention extent, in terms of pointers to the beginning and end of each mention. (optional – for evaluation of mention extent recognition performance only)

Evaluation of Entity Detection Entity Detection performance will be measured in terms of missed entities and false alarm entities. In order to measure misses and false alarms, each reference entity must first be associated with the appropriate corresponding system output entity. This is done by choosing, for each reference entity, that system output entity with the best matching set of mentions. Note, however, that a system output entity is permitted to map to at most one reference entity. miss –A miss occurs whenever a reference entity has no corresponding output entity. false alarm –A false alarm occurs whenever an output entity has no corresponding reference entity.

Recognition of Entity Attributes This is the basic task of characterizing entities. It includes recognition of entity type. It is a required task for all ACE technology developers. Performance is measured only for those entities that are mapped to reference entities. Evaluation of performance will be conditioned on entity and attribute type. For the EDT pilot study, the only attributes to be recognized are entity type and entity name. An entity name is “recognized” by detecting its presence and then correctly determining its extent.

Detection of Entity Mentions Mention detection measures the ability of the system to correctly detect and associate all of the mentions of an entity, for all correctly detected entities. It is in essence a co-reference task. Detection performance will be measured in terms of missed mentions and false alarm mentions. For each mapped reference entity: miss –a miss occurs for each reference mention of that entity without a matching mention in the corresponding output entity, and false alarm –a false alarm occurs for each mention in the corresponding output entity without a matching reference mention.

Recognition of Mention Extent Extent recognition measures the ability of the system to correctly determine the extent of the mentions, for all correctly detected mentions. This ability will be measured in terms of the classification error rate, which is simply the fraction of all mapped reference mentions that have extents that are not “identical” to the extents of the corresponding system output mentions.

Action Items that remain to be completed for the ACE pilot study Annotate the Pilot Corpus ASR: –Publish ASR transcription output –Produce timing information for ref transcripts OCR: –Produce and publish OCR recognition output –Produce bounding boxes for ref transcripts EDT technology development: –Implement EDT systems –Evaluate them

Training 01-02/98 Dev Test 03-04/98 Eval Test 05-06/98 Newswire 30,000 words 15,000 words Broadcast News 30,000 words 15,000 words Newspaper 30,000 words 15,000 words The ACE/EDT Pilot Corpus

Schedule for Pilot Corpus Annotation and EDT Evaluation

EDT Annotation Assignment for the Pilot Corpus

Pilot Study Planning Resolve remaining actions, issues and schedule –Mark Przybocki will provide ACE sites with sample ASR/OCR source files no later than Monday March 27. –David Day will provide working scripts for: converting ASR/OCR_source files to newswire_source files converting EDT_newswire_out files to EDT_ASR/OCR_out files no later than Monday April 17. Anything else?…

ACE Program Direction Proposed extensions to the EDT task Proposed new ACE tasks

Proposed extensions to the EDT task New entity types New entity attributes Role attribute for entity mentions Cross-document entity tracking Restrict entities to just the important ones Restrict mentions to those that are referential …

New Entity Types Current Facility GSP Location Organization Person Proposed FOG (a human-created enterprise = FAC+ORG) GPE (a geo-political entity = GSP) NGE (a natural geographic entity = LOC) PER (a person = PER) POS (a place, a spatially determined location)

New Entity Attributes –ORG: subtype = {government, business, other} –GPE: subtype = {nation, state, city, other} –NGE: subtype = {land, water, other} –PER: nationality = {…}; sex = {M, F, other} –POS: subtype = {point, line, other} Plurality dis/conjunctive

Introduce a new concept: The “role” of a mention “Entity” is a symbolic construct that represents an abstract identity. Entities have various aspects and functional roles that are associated with their identities. We would like to identify these functional roles in addition to identifying the (more abstract) entity identity. This may be done by tagging each mention of an entity with its “role”, which may be simply one of the (five) “fundamental” entity types.

Proposed new ACE tasks Unnumbered tasks –Predicate Argument Recognition (aka Proposition Bank) –… … Numbered tasks –… …

Program Planning Application ideas –Presentations (?) –Brainstorming Technical infrastructure needs –Corpora –Tools Program direction plans (Steve Dennis)

ACE Common Task Candidates (to be evaluated) EDT Intradoc facts/events (this includes temporal information) Xdoc EDT (+ attribute normalization) EDT+ (+ = mention roles, more types, metonymy tags, attribute normalization) Xdoc facts/events Intradoc facts/events+ (+ = modality) Predicate Argument Recognition

ACE program activity candidates Proposition Bank corpus development Create a comprehensive ACE database schema Identify a terrific demo for ACE technology