A Flexible and Extensible Architecture for Linguistic Annotation Steven Bird *, David Day †, John Garofolo ‡, John Henderson †, Christophe Laprun ‡ and.

Slides:



Advertisements
Similar presentations
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
Advertisements

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Language Specification using Metamodelling Joachim Fischer Humboldt University Berlin LAB Workshop Geneva
Chapter 10: Designing Databases
The eXtensible Markup Language (XML) An Applied Tutorial Kevin Thomas.
XML: Extensible Markup Language
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
JAXB Java Architecture for XML Binding Andy Fanton Khalid AlBayat.
E-Science Data Information and Knowledge Transformation The BinX Language.
SRDC Ltd. 1. Problem  Solutions  Various standardization efforts ◦ Document models addressing a broad range of requirements vs Industry Specific Document.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
1 IBM SanFrancisco Product Evaluation Negotiated Option Presentation By Les Beckford May 2001.
ModelicaXML A Modelica XML representation with Applications Adrian Pop, Peter Fritzson Programming Environments Laboratory Linköping University.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
DCS Architecture Bob Krzaczek. Key Design Requirement Distilled from the DCS Mission statement and the results of the Conceptual Design Review (June 1999):
XML Prashant Karmarkar Brendan Nolan Alexander Roda.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
File Systems and Databases
The Data Mining Visual Environment Motivation Major problems with existing DM systems They are based on non-extensible frameworks. They provide a non-uniform.
EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Requirements, Tools, and Architectures for Annotated Corpora Nancy Ide Vassar College Chris Brew Ohio State.
Overview of Search Engines
MDC Open Information Model West Virginia University CS486 Presentation Feb 18, 2000 Lijian Liu (OIM:
Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA.
Intelligent Workflow Management System(iWMS). Agenda Background Motivation Usage Potential application domains iWMS.
Twenty-First Century Automatic Speech Recognition: Meeting Rooms and Beyond ASR 2000 September 20, 2000 John Garofolo
Semantic Web. Course Content
MAHI Research Database Data Validation System Software Prototype Demonstration September 18, 2001
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
ATLAS Demystified: A Practical Introduction Christophe Laprun, Jonathan Fiscus, John Garofolo, Sylvain Pajot National Institute of Standards and Technology.
Architecture for a Database System
Intro. to XML & XML DB Bun Yue Professor, CS/CIS UHCL.
High Level Architecture Overview and Rules Thanks to: Dr. Judith Dahmann, and others from: Defense Modeling and Simulation Office phone: (703)
Data Management David Nathan & Peter Austin & Robert Munro.
Nancy Lawler U.S. Department of Defense ISO/IEC Part 2: Classification Schemes Metadata Registries — Part 2: Classification Schemes The revision.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
XML A web enabled data description language 4/22/2001 By Mark Lawson & Edward Ryan L’Herault.
XML and Digital Libraries M. Zubair Department of Computer Science Old Dominion University.
Development Process and Testing Tools for Content Standards OASIS Symposium: The Meaning of Interoperability May 9, 2006 Simon Frechette, NIST.
EU Project proposal. Andrei S. Lopatenko 1 EU Project Proposal CERIF-SW Andrei S. Lopatenko Vienna University of Technology
19/10/20151 Semantic WEB Scientific Data Integration Vladimir Serebryakov Computing Centre of the Russian Academy of Science Proposal: SkTech.RC/IT/Madnick.
Bringing “it” all Together !? Dean Djokic, ESRI David Maidment.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 1 Comparability of language data and analysis Using an ontology for linguistics Scott Farrar, U.
17 th October 2005CCP4 Database Meeting (York) CCP4(i)/BIOXHIT Database Project: Scope, Aims, Plans, Status and all that jazz Peter Briggs, Wanjuan Yang.
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
© Geodise Project, University of Southampton, Knowledge Management in Geodise Geodise Knowledge Management Team Barry Tao, Colin Puleston, Liming.
User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.
Working with Ontologies Introduction to DOGMA and related research.
ESIP Semantic Web Products and Services ‘triples’ “tutorial” aka sausage making ESIP SW Cluster, Jan ed.
Mining the Biomedical Research Literature Ken Baclawski.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Metadata “Data about data” Describes various aspects of a digital file or group of files Identifies the parts of a digital object and documents their content,
Web Technologies for Bioinformatics Ken Baclawski.
1 Class exercise II: Use Case Implementation Deborah McGuinness and Peter Fox CSCI Week 8, October 20, 2008.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
1 Advanced Software Architecture Muhammad Bilal Bashir PhD Scholar (Computer Science) Mohammad Ali Jinnah University.
1 Open Ontology Repository initiative - Planning Meeting - Thu Co-conveners: PeterYim, LeoObrst & MikeDean ref.:
Jemerson Pedernal IT 2.1 FUNDAMENTALS OF DATABASE APPLICATIONS by PEDERNAL, JEMERSON G. [BS-Computer Science] Palawan State University Computer Network.
VisTrails Second Provenance Challenge Tommy Ellkvist David Koop Juliana Freire Joint work with: Erik Andersen, Steven P. Callahan, Emanuele Santos, Carlos.
Manufacturing Systems Integration Division Development Process and Testing Tools for Content Standards Simon Frechette National Institute of Standards.
Knowledge Support for Modeling and Simulation Michal Ševčenko Czech Technical University in Prague.
Ontologies Reasoning Components Agents Simulations An Overview of Model-Driven Engineering and Architecture Jacques Robin.
Data Format Description Language (DFDL) WG Martin Westhead EPCC, University of Edinburgh
GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.
Managing Data Resources File Organization and databases for business information systems.
Database Processing with XML
Multi-agent system for web services
Methontology: From Ontological art to Ontological Engineering
Presentation transcript:

A Flexible and Extensible Architecture for Linguistic Annotation Steven Bird *, David Day †, John Garofolo ‡, John Henderson †, Christophe Laprun ‡ and Mark Liberman* * Linguistic Data Consortium, University of Pennsylvania † MITRE Corporation ‡ National Institute of Standards and Technology

Tradition: Create formats and tools for each research domain Existing bazaar of formats and tools discourages exchange and reuse SGML RDB

Background Participant “Troika” motivated by applications needs –NIST work in evaluation infrastructure –LDC work in corpus building and annotation graph research –MITRE work in multi-modal visualization/annotation, extraction technology, Alembic Workbench Began collaboration in early summer ‘99 –Initially, exploring feasibility of fitting together existing resources under Bird & Liberman annotation graph formalism Early goals –develop ability to construct flexible and extensible tools and data formats for existing research domains and applications –focus task to create formats to support ACE infrastructure Project has evolved substantially as we continue to explore new domains and uses

Base Ontology for Linguistic Annotation of Signals Establishing an annotation requires specifying –The source signal that is being annotated –The particular region of the signal about which one wants to say something –The content of the annotation being asserted about that region of the signal Signal Annotation Region

The Annotation Graph Model The Annotation Graph model, a proper subset of the more general case, addresses annotation for one- dimensional signals (text, audio) –intervals specified with start and end nodes nodes have (optional) offsets –annotations specified as labeled arcs between nodes labels are fielded records (attributes + values) –collection of annotations => annotation graph Formal definition –labeled directed acyclic graph, with a partial time function on nodes (see Bird & Liberman 2000)

ATLAS Generalized Model The generalized model has been designed to accommodate non-linear signals such as images: –annotation elements describing regions within signals with signal pointer(s) and content-bearing attributes Signal Content … … Annotation Region –annotation sets containing clusters of annotation elements annotations may be treated as signals themselves standoff annotations provide alignment of annotations & signals

Extensibility Impossible to anticipate all the varieties of “linguistic signals” and the ways one might wish to annotate them ATLAS includes a mechanism for declaring new signal classes and defining new ways of carving out regions of those signals via –the definition of an anchor type for the new signal class –the creation of an anchor “plug-in” component ATLAS will support general purpose signal classes for popular linguistic resource types –Signals: text, audio, images, video –Symbol tables: word lists, part-of-speech tagsets, … –Attribute value matrices: dictionaries, thesauri, knowledge representation propositions, … –Tree databases: Treebanks, … –Signal alignments: bilingual corpora, …

ATLAS Layers Approach: Separate/abstract physical and logical levels from application-specific levels for maximum flexibility. –Physical level provides a persistent representation of logical level data for long-term storage, exchange, and pipelining XML-based ATLAS Interchange Format (AIF) Relational database implementation –Logical level provides a structural framework for the manipulation of annotation data annotation elements and sets atomic operators (creation, manipulation, destruction) –Application level specifies semantic interpretation of annotation data and provides user interfaces application-specific (developer-provided)

Evaluation Software Conversion Tools Query Systems Layered Solution Visualization and Exploration Extraction Systems Annotation Tools Automatic Aligners RDB AIF Files ATLAS CORE ATLAS Physical Level Applications ATLAS Logical Level ATLAS API

ATLAS Architecture ATLAS Internal Representation Annotation AC1 AC2 ACn Visualization VC1 VC2 VCn Format Exchange EC1 EC2 ECn Search/Access SC1 SC2 SCn Persistent Storage RDBMS flat files (AIF) XML Processing DTD validation XML parser XSLT Data Access file sharing network protocols multi-user/collaboration privacy

ATLAS Interchange Format An Example … Annot element Source Signal Standoff Content Signal types Annot set

Potential ATLAS Applications Corpora: –data exchange/reuse, consistent meta data formats –multi-layered/multi-linked annotation –multi-lingual dictionaries, aligned multi-lingual data –aligned multi-modal data (audio/video/image/text) –lexicons with varying levels of structure Tools –modular/reusable annotation components –development infrastructure –conversion tools Applications –internal/external data representation –faster prototyping and development –evaluation –data pipelining and plug-and-play data exchange –document segmentation/zoning

ATLAS Projects Underway Evaluation Formats: –ACE Entity Detection and Tracking (EDT) Evaluation –DARPA/NIST ASR/Segmentation scoring Corpora: –NSF linguistic exploration project on low-density languages –NSF Talkbank –UMD Image Recognition Evaluation Corpus Tools: –LDC annotation tools –MITRE Alembic Workbench –Emu speech database access tools –DGA speech Transcriber –next generation SCLITE

Development Status ATLAS Prototype Suite implemented: –ATLAS Interchange Format (AIF) XML DTD –Annotation graph API definition –Core API implementations (C++, Java) for annotation graphs Extending the architecture for new signal types Defining query language Currently soliciting research community input –ACE, TIDES, DARPA ASR, ISLE, CES, industry... Complete ATLAS 1.0 (Beta) (Sep. 2000) –Internal representation, AIF, basic query language, sample applications (transcription/annotation tools, conversion tools) Open Source ATLAS (Winter, ) ATLAS Website: –