EAGLES/ISLE Workshop LREC 2000 Athens, Greece Requirements, Tools, and Architectures for Annotated Corpora Nancy Ide Vassar College Chris Brew Ohio State University Data Architectures and Software Support for Large Corpora Towards an American National Corpus
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Resources are expensive! funders expect to amortize cost of resource creation over several projects researchers don't want to reinvent the wheel want to be able to accommodate uses for corpora and tools that may not yet be envisaged
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora cross-disciplinary acceptance no longer an option we need –reusability to avoid unnecessary labor and cost –flexibility and extensibility to accommodate different applications, different modes and media, different approaches, and potential future uses
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Areas for consideration Annotation formats format of annotations themselves Encoding formats markup scheme used to identify and delineate elements in the data Data architecture organization of data in terms of document structure, linkage Tools architecture framework for tool interoperability Tool support components facilities to enable tools to work efficiently
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Annotation Formats need not be identical to achieve commonality must work toward specifications that enable mapping among annotations of the same type EAGLES/ISLE guidelines –layered model universally agreed-upon and applicable specifications at the bottom modules for specific languages, applications, and/or theoretical approaches at higher levels.
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Encoding Formats standardized formats required for –data interchange –enabling easy human-readable display and access may or may not serve as direct input to tools but must be capable of capturing all information that is input and output of tools
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora XML international standard, web compatible used in several corpus-handling applications LT XML (Edinburgh) ATLAS (NIST) XCES (EAGLES) American National Corpus provides good tools for linkage, search and extraction, validation and error reduction
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Data Architectures must support : full range of annotation types alternative annotations and versions different languages different media and modalities (e.g., text, speech signal, audio, video, image) potentially complex linkage among documents, parts of documents, and different modalities
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora "Stand-off" Data Architecture annotations maintained in separate documents that point back to the original yields a “hyper-document” composed of the original text and all annotations increasingly accepted as the appropriate architecture for language resources –MULTEXT, LT NSL and LT XML, ATLAS, CES and XCES, ANC
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Advantages avoids unwieldy documents allows for versioning, alternative annotations XML mechanisms support complex inter- document linkage, linking various media XSLT enables selecting, transforming, adding to multiple documents to create new document
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Data Models XML support for easy transduction of tags makes common tag set less an issue But...must have a common underlying data model –formalized description of data objects composition, attributes, class membership, applicable procedures, etc relations among these, independent of instantiation in any particular form
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora must be able to capture structure and relations in diverse types of data and annotations impacts the design of annotation schema, encoding formats, data and tool architectures is the most important current need for corpus- based work The data model...
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Existing models TIPSTER –object-oriented –designed for use in IE ATLAS –annotation graph formalism –designed for use in speech Design strongly influenced by background assumptions that may not scale up
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Abstraction an annotation is a one- or two-way link between –an annotation object, and –a point or span (or a list/set of points or spans) within a base data set Links may or may not have a semantics Points and spans may be objects, or sets/lists of objects
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Observations assumes fundamental linearity of objects in the base time line (speech) sequence of characters, words, sentences, etc. pixel data etc. the granularity of the data representation and encoding is critical Targets may be individual objects or sets or lists of objects, so information with more than one dimension is accommodated
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Implications annotation scheme must be mappable to the structures defined for annotation objects encoding scheme must be able to capture the object structure and relations expressed in the model (e.g., class membership and inheritance) requires sophisticated means to specify linkage consider logistics of identifying spans by enclosing them in start and end tags (enabling hierarchical grouping of objects in the data), vs. explicit addressing of start and end points
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Implications... must be possible to represent objects and relations in some form that is both usable by a variety of tools and prevents information loss –ideally, in a variety of formats suitable to different tools and applications
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Recommendation Form a group to study this, consisting of representatives for –different areas of LE (text, speech, etc.) –different languages, geographical location –different media –different user needs –Information Retrieval and Computer Science
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Tools and Tool Architectures must support multi-lingual, multi-modal data must be flexible –adaptable to different annotation schemes, different applications must be extensible must be reusable
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Existing systems MULTEXT (1994) –developed fundamental data and tool architecture for corpora used in subsequent systems tool modularity, pipeline tool architecture API interface SGML encoding standard for linguistic annotation (CES) concept of "stand-off" annotation
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora LT XML (1999), U of Edinburgh –grew out of MULTEXT –views XML files as either flat stream of markup and text tree-structured XML –powerful query language
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora GATE (Sheffield) –implements TIPSTER data and tool architecture –object model for data and annotation –modular tool design, very extensible ATLAS (2000) (NIST) –still in development –layered data and tool architecture similar to previous systems –annotation graph formalism instantiated in XML
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Agreement on tools/systems tool architecture –"plug-and-play" –modular –layered design physical storage representation intermediate data representation (model) API to enable application development query capability stand-off data architecture
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Details to work out data model level to extend notion of modularity –gross function, or minimal function? best means to accommodate different languages, modalities –engine-based approach, language- or medium- specific knowledge as data?
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Tool Support Components resources are large compression and indexing required for a usable system –compression is easy excellent compression techniques for XML data –indexing is trickier good techniques for full-text search exist but...may not scale up to more complex data
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Documents with diagrams, engineering drawings. Illustrated books, with body text and illustration intermingled or overlaid Manuscripts in which the physical details of the calligraphy and media matter Interlinked texts, including output of machine translation systems, speech transcription efforts, lexicographic endeavors Databases of phonetic phenomena Personal and public information spaces: hard disk folder structures, mailing list archives, personal archives, voice mailboxes, etc. Dialogue etc. Non-traditional data
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Recommendations develop architectures that abandon the notion of a single distinguished time line adopt ideas from the database community –work on semi-structured data –work that views XML documents as a collection of documents with additional tags and relations between tags
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Conclusion design tools and resources not based on needs of a particular research community open architecture approach build on existing standards, emerging consensus (widely) distributed development involve other relevant communities (IR, CS)