EAGLES/ISLE Workshop LREC 2000 Athens, Greece Requirements, Tools, and Architectures for Annotated Corpora Nancy Ide Vassar College Chris Brew Ohio State.

Slides:



Advertisements
Similar presentations
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Database System Concepts and Architecture
XML: Extensible Markup Language
Gleaning Resource Descriptions from Dialects of Languages (GRDDL) W3C Team Submission 16 May 2005 Dominique Hazaël-Massieux, Dan Connolly Summarized by.
Using the Semantic Web to Construct an Ontology- Based Repository for Software Patterns Scott Henninger Computer Science and Engineering University of.
Information Retrieval in Practice
A New Learning Tools. Topic Maps is a standard for the representation and interchange of knowledge, with an emphasis on the findability of information.
DCS Architecture Bob Krzaczek. Key Design Requirement Distilled from the DCS Mission statement and the results of the Conceptual Design Review (June 1999):
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Presentation Outline  Project Aims  Introduction of Digital Video Library  Introduction of Our Work  Considerations and Approach  Design and Implementation.
EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.
ICS (072)Database Systems Background Review 1 Database Systems Background Review Dr. Muhammad Shafique.
Outline Chapter 1 Hardware, Software, Programming, Web surfing, … Chapter Goals –Describe the layers of a computer system –Describe the concept.
LREC 2000 Athens, Greece An XML-based Encoding Standard for Language Corpora Nancy Ide Vassar College Patrice Bonhomme LORIA/CNRS Laurent Romary LORIA/CNRS.
Course Instructor: Aisha Azeem
Introduction and Conceptual Modeling
Overview of Search Engines
Information Retrieval in Practice
Chapter 7 Requirement Modeling : Flow, Behaviour, Patterns And WebApps.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Dr. Kurt Fendt, Comparative Media Studies, MIT MetaMedia An Open Platform for Media Annotation and Sharing Workshop "Online Archives:
Object and component “wiring” standards This presentation reviews the features of software component wiring and the emerging world of XML-based standards.
An Introduction to Software Architecture
DCS Overview MCS/DCS Technical Interchange Meeting August, 2000.
2 1 Chapter 2 Data Models Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
Database System Concepts and Architecture
XML BIS4430 – unit 10. XML Origins Extensible Markup Language (XML) 1998 Inspired by Standard Generalized Markup Language (SGML) and HTML. SGML defines.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
A Flexible and Extensible Architecture for Linguistic Annotation Steven Bird *, David Day †, John Garofolo ‡, John Henderson †, Christophe Laprun ‡ and.
A Web Application for Customized Corpus Delivery Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science Vassar College USA.
Intro. to XML & XML DB Bun Yue Professor, CS/CIS UHCL.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
© Copyright 2008 STI INNSBRUCK NLP Interchange Format José M. García.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
XML A web enabled data description language 4/22/2001 By Mark Lawson & Edward Ryan L’Herault.
Manag ing Software Change CIS 376 Bruce R. Maxim UM-Dearborn.
Object Oriented Multi-Database Systems An Overview of Chapters 4 and 5.
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
Embedding Knowledge in HTML Some content from a presentations by Ivan Herman of the W3c.
1 CSCD 326 Data Structures I Software Design. 2 The Software Life Cycle 1. Specification 2. Design 3. Risk Analysis 4. Verification 5. Coding 6. Testing.
Introduction to the Semantic Web and Linked Data
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Model Design using Hierarchical Web-Based Libraries F. Bernardi Pr. J.F. Santucci {bernardi, University of Corsica SPE Laboratory.
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
Object-Oriented Parsing and Transformation Kenneth Baclawski Northeastern University Scott A. DeLoach Air Force Institute of Technology Mieczyslaw Kokar.
Towards Unifying Vector and Raster Data Models for Hybrid Spatial Regions Philip Dougherty.
Jemerson Pedernal IT 2.1 FUNDAMENTALS OF DATABASE APPLICATIONS by PEDERNAL, JEMERSON G. [BS-Computer Science] Palawan State University Computer Network.
SemAF – Basics: Semantic annotation framework Harry Bunt Tilburg University isa -6 Joint ISO - ACL/SIGSEM workshop Oxford, January 2011 TC 37/SC.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Oman College of Management and Technology Course – MM Topic 7 Production and Distribution of Multimedia Titles CS/MIS Department.
Ontologies Reasoning Components Agents Simulations An Overview of Model-Driven Engineering and Architecture Jacques Robin.
 XML derives its strength from a variety of supporting technologies.  Structure and data types: When using XML to exchange data among clients, partners,
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Application architectures Advisor : Dr. Moneer Al_Mekhlafi By : Ahmed AbdAllah Al_Homaidi.
XML Databases Presented By: Pardeep MT15042 Anurag Goel MT15006.
Information Retrieval in Practice
Databases (CS507) CHAPTER 2.
Search Engine Architecture
CS644 Advanced Topics in Networking
GATE and the Semantic Web
Web Engineering.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
9/22/2018.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
Chapter 13 Quality Management
Database Systems Instructor Name: Lecture-3.
Introduction to Systems Analysis and Design Stefano Moshi Memorial University College System Analysis & Design BIT
Information Retrieval and Web Design
Presentation transcript:

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Requirements, Tools, and Architectures for Annotated Corpora Nancy Ide Vassar College Chris Brew Ohio State University Data Architectures and Software Support for Large Corpora Towards an American National Corpus

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Resources are expensive! funders expect to amortize cost of resource creation over several projects researchers don't want to reinvent the wheel want to be able to accommodate uses for corpora and tools that may not yet be envisaged

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora cross-disciplinary acceptance no longer an option we need –reusability to avoid unnecessary labor and cost –flexibility and extensibility to accommodate different applications, different modes and media, different approaches, and potential future uses

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Areas for consideration  Annotation formats  format of annotations themselves  Encoding formats  markup scheme used to identify and delineate elements in the data  Data architecture  organization of data in terms of document structure, linkage  Tools architecture  framework for tool interoperability Tool support components facilities to enable tools to work efficiently

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Annotation Formats need not be identical to achieve commonality must work toward specifications that enable mapping among annotations of the same type EAGLES/ISLE guidelines –layered model universally agreed-upon and applicable specifications at the bottom modules for specific languages, applications, and/or theoretical approaches at higher levels.

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Encoding Formats standardized formats required for –data interchange –enabling easy human-readable display and access may or may not serve as direct input to tools but must be capable of capturing all information that is input and output of tools

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora XML international standard, web compatible used in several corpus-handling applications LT XML (Edinburgh) ATLAS (NIST) XCES (EAGLES) American National Corpus provides good tools for linkage, search and extraction, validation and error reduction

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Data Architectures must support :  full range of annotation types  alternative annotations and versions  different languages  different media and modalities (e.g., text, speech signal, audio, video, image)  potentially complex linkage among documents, parts of documents, and different modalities

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora "Stand-off" Data Architecture annotations maintained in separate documents that point back to the original yields a “hyper-document” composed of the original text and all annotations increasingly accepted as the appropriate architecture for language resources –MULTEXT, LT NSL and LT XML, ATLAS, CES and XCES, ANC

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Advantages avoids unwieldy documents allows for versioning, alternative annotations XML mechanisms support complex inter- document linkage, linking various media XSLT enables selecting, transforming, adding to multiple documents to create new document

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Data Models XML support for easy transduction of tags makes common tag set less an issue But...must have a common underlying data model –formalized description of data objects composition, attributes, class membership, applicable procedures, etc relations among these, independent of instantiation in any particular form

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora must be able to capture structure and relations in diverse types of data and annotations impacts the design of annotation schema, encoding formats, data and tool architectures is the most important current need for corpus- based work The data model...

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Existing models TIPSTER –object-oriented –designed for use in IE ATLAS –annotation graph formalism –designed for use in speech Design strongly influenced by background assumptions that may not scale up

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Abstraction an annotation is a one- or two-way link between –an annotation object, and –a point or span (or a list/set of points or spans) within a base data set Links may or may not have a semantics Points and spans may be objects, or sets/lists of objects

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Observations  assumes fundamental linearity of objects in the base  time line (speech)  sequence of characters, words, sentences, etc.  pixel data  etc.  the granularity of the data representation and encoding is critical Targets may be individual objects or sets or lists of objects, so information with more than one dimension is accommodated

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Implications  annotation scheme must be mappable to the structures defined for annotation objects  encoding scheme must be able to capture the object structure and relations expressed in the model (e.g., class membership and inheritance)  requires sophisticated means to specify linkage  consider logistics of identifying spans by enclosing them in start and end tags (enabling hierarchical grouping of objects in the data), vs. explicit addressing of start and end points

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Implications...  must be possible to represent objects and relations in some form that is both usable by a variety of tools and prevents information loss –ideally, in a variety of formats suitable to different tools and applications

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Recommendation Form a group to study this, consisting of representatives for –different areas of LE (text, speech, etc.) –different languages, geographical location –different media –different user needs –Information Retrieval and Computer Science

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Tools and Tool Architectures must support multi-lingual, multi-modal data must be flexible –adaptable to different annotation schemes, different applications must be extensible must be reusable

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Existing systems MULTEXT (1994) –developed fundamental data and tool architecture for corpora used in subsequent systems tool modularity, pipeline tool architecture API interface SGML encoding standard for linguistic annotation (CES) concept of "stand-off" annotation

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora LT XML (1999), U of Edinburgh –grew out of MULTEXT –views XML files as either flat stream of markup and text tree-structured XML –powerful query language

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora GATE (Sheffield) –implements TIPSTER data and tool architecture –object model for data and annotation –modular tool design, very extensible ATLAS (2000) (NIST) –still in development –layered data and tool architecture similar to previous systems –annotation graph formalism instantiated in XML

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Agreement on tools/systems tool architecture –"plug-and-play" –modular –layered design physical storage representation intermediate data representation (model) API to enable application development query capability stand-off data architecture

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Details to work out data model level to extend notion of modularity –gross function, or minimal function? best means to accommodate different languages, modalities –engine-based approach, language- or medium- specific knowledge as data?

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Tool Support Components resources are large compression and indexing required for a usable system –compression is easy excellent compression techniques for XML data –indexing is trickier good techniques for full-text search exist but...may not scale up to more complex data

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Documents with diagrams, engineering drawings.  Illustrated books, with body text and illustration intermingled or overlaid  Manuscripts in which the physical details of the calligraphy and media matter  Interlinked texts, including output of machine translation systems, speech transcription efforts, lexicographic endeavors  Databases of phonetic phenomena  Personal and public information spaces: hard disk folder structures, mailing list archives, personal archives, voice mailboxes, etc.  Dialogue  etc. Non-traditional data

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Recommendations develop architectures that abandon the notion of a single distinguished time line adopt ideas from the database community –work on semi-structured data –work that views XML documents as a collection of documents with additional tags and relations between tags

EAGLES/ISLE Workshop LREC 2000 Athens, Greece Data Architectures and Software Support for Large Corpora Data Architectures and Software Support for Large Corpora Conclusion design tools and resources not based on needs of a particular research community open architecture approach build on existing standards, emerging consensus (widely) distributed development involve other relevant communities (IR, CS)