Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Flexible and Extensible Architecture for Linguistic Annotation Steven Bird *, David Day †, John Garofolo ‡, John Henderson †, Christophe Laprun ‡ and.

Similar presentations


Presentation on theme: "A Flexible and Extensible Architecture for Linguistic Annotation Steven Bird *, David Day †, John Garofolo ‡, John Henderson †, Christophe Laprun ‡ and."— Presentation transcript:

1 A Flexible and Extensible Architecture for Linguistic Annotation Steven Bird *, David Day †, John Garofolo ‡, John Henderson †, Christophe Laprun ‡ and Mark Liberman* * Linguistic Data Consortium, University of Pennsylvania † MITRE Corporation ‡ National Institute of Standards and Technology

2 Tradition: Create formats and tools for each research domain Existing bazaar of formats and tools discourages exchange and reuse SGML RDB

3 Background Participant “Troika” motivated by applications needs –NIST work in evaluation infrastructure –LDC work in corpus building and annotation graph research –MITRE work in multi-modal visualization/annotation, extraction technology, Alembic Workbench Began collaboration in early summer ‘99 –Initially, exploring feasibility of fitting together existing resources under Bird & Liberman annotation graph formalism Early goals –develop ability to construct flexible and extensible tools and data formats for existing research domains and applications –focus task to create formats to support ACE infrastructure Project has evolved substantially as we continue to explore new domains and uses

4 Base Ontology for Linguistic Annotation of Signals Establishing an annotation requires specifying –The source signal that is being annotated –The particular region of the signal about which one wants to say something –The content of the annotation being asserted about that region of the signal Signal Annotation Region

5 The Annotation Graph Model The Annotation Graph model, a proper subset of the more general case, addresses annotation for one- dimensional signals (text, audio) –intervals specified with start and end nodes nodes have (optional) offsets –annotations specified as labeled arcs between nodes labels are fielded records (attributes + values) –collection of annotations => annotation graph Formal definition –labeled directed acyclic graph, with a partial time function on nodes (see Bird & Liberman 2000)

6 ATLAS Generalized Model The generalized model has been designed to accommodate non-linear signals such as images: –annotation elements describing regions within signals with signal pointer(s) and content-bearing attributes Signal Content … … Annotation Region –annotation sets containing clusters of annotation elements annotations may be treated as signals themselves standoff annotations provide alignment of annotations & signals

7 Extensibility Impossible to anticipate all the varieties of “linguistic signals” and the ways one might wish to annotate them ATLAS includes a mechanism for declaring new signal classes and defining new ways of carving out regions of those signals via –the definition of an anchor type for the new signal class –the creation of an anchor “plug-in” component ATLAS will support general purpose signal classes for popular linguistic resource types –Signals: text, audio, images, video –Symbol tables: word lists, part-of-speech tagsets, … –Attribute value matrices: dictionaries, thesauri, knowledge representation propositions, … –Tree databases: Treebanks, … –Signal alignments: bilingual corpora, …

8 ATLAS Layers Approach: Separate/abstract physical and logical levels from application-specific levels for maximum flexibility. –Physical level provides a persistent representation of logical level data for long-term storage, exchange, and pipelining XML-based ATLAS Interchange Format (AIF) Relational database implementation –Logical level provides a structural framework for the manipulation of annotation data annotation elements and sets atomic operators (creation, manipulation, destruction) –Application level specifies semantic interpretation of annotation data and provides user interfaces application-specific (developer-provided)

9 Evaluation Software Conversion Tools Query Systems Layered Solution Visualization and Exploration Extraction Systems Annotation Tools Automatic Aligners RDB AIF Files ATLAS CORE ATLAS Physical Level Applications ATLAS Logical Level ATLAS API

10 ATLAS Architecture ATLAS Internal Representation Annotation AC1 AC2 ACn Visualization VC1 VC2 VCn Format Exchange EC1 EC2 ECn Search/Access SC1 SC2 SCn Persistent Storage RDBMS flat files (AIF) XML Processing DTD validation XML parser XSLT Data Access file sharing network protocols multi-user/collaboration privacy

11 ATLAS Interchange Format An Example 453 497 25 29 … Annot element Source Signal Standoff Content Signal types Annot set

12 Potential ATLAS Applications Corpora: –data exchange/reuse, consistent meta data formats –multi-layered/multi-linked annotation –multi-lingual dictionaries, aligned multi-lingual data –aligned multi-modal data (audio/video/image/text) –lexicons with varying levels of structure Tools –modular/reusable annotation components –development infrastructure –conversion tools Applications –internal/external data representation –faster prototyping and development –evaluation –data pipelining and plug-and-play data exchange –document segmentation/zoning

13 ATLAS Projects Underway Evaluation Formats: –ACE Entity Detection and Tracking (EDT) Evaluation –DARPA/NIST ASR/Segmentation scoring Corpora: –NSF linguistic exploration project on low-density languages –NSF Talkbank –UMD Image Recognition Evaluation Corpus Tools: –LDC annotation tools –MITRE Alembic Workbench –Emu speech database access tools –DGA speech Transcriber –next generation SCLITE

14 Development Status ATLAS Prototype Suite implemented: –ATLAS Interchange Format (AIF) XML DTD –Annotation graph API definition –Core API implementations (C++, Java) for annotation graphs Extending the architecture for new signal types Defining query language Currently soliciting research community input –ACE, TIDES, DARPA ASR, ISLE, CES, industry... Complete ATLAS 1.0 (Beta) (Sep. 2000) –Internal representation, AIF, basic query language, sample applications (transcription/annotation tools, conversion tools) Open Source ATLAS (Winter, 2000-2001) ATLAS Website: –http://www.nist.gov/speech/atlas


Download ppt "A Flexible and Extensible Architecture for Linguistic Annotation Steven Bird *, David Day †, John Garofolo ‡, John Henderson †, Christophe Laprun ‡ and."

Similar presentations


Ads by Google