Encoding language corpora: current trends and future directions Tomaž Erjavec Department of Knowledge Technologies Department of Knowledge Technologies.

Slides:



Advertisements
Similar presentations
CSCI N241: Fundamentals of Web Design Copyright ©2004 Department of Computer & Information Science Introducing XHTML: Module B: HTML to XHTML.
Advertisements

Can I Use It, and If so, How? Christian Lieske SAP AG – MultiLingual Technology Discussion of Consortium Proposal for OLIF2 File Header.
OLAC Metadata Steven Bird University of Melbourne / University of Pennsylvania OLAC Workshop 10 December 2002.
METS Awareness Training An Introduction to METS Digital libraries – where are we now? Digitisation technology now well established and well-understood.
Dr. Alexandra I. Cristea XHTML.
MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora Tomaž Erjavec Department of Knowledge Technologies Jožef.
Developing a Metadata Exchange Format for Mathematical Literature David Ruddy Project Euclid Cornell University Library DML 2010 Paris 7 July 2010.
ISO DSDL ISO – Document Schema Definition Languages (DSDL) Martin Bryan Convenor, JTC1/SC18 WG1.
Content and Systems Week 3. Today’s goals Obtaining, describing, indexing content –XML –Metadata Preparing for the installation of Dspace –Computers available.
An Introduction to MODS: The Metadata Object Description Schema Tech Talk By Daniel Gelaw Alemneh October 17, 2007 October 17, 2007.
MLIF: A Metamodel to Represent and Exchange Multilingual Textual Information ISO TC37 SC4 WG Samuel Cruz-Lara, Gil Francopoulo, Laurent Romary,
XML for Information Management – Day 2 Airi Salminen University of Erlangen-Nuremberg Computational Linguistics Instructor: Professor Airi Salminen
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
XHTML 16-Apr-17.
XML for Information Management – Day 2 Airi Salminen University of Erlangen-Nuremberg Computational Linguistics Instructor: Professor Airi Salminen
© Tefko Saracevic, Rutgers University1 metadata considerations for digital libraries.
17-Jun-15 XHTML 2 What is XHTML? XHTML stands for Extensible Hypertext Markup Language XHTML is aimed to replace HTML.
The RDF meta model: a closer look Basic ideas of the RDF Resource instance descriptions in the RDF format Application-specific RDF schemas Limitations.
XML Introduction What is XML –XML is the eXtensible Markup Language –Became a W3C Recommendation in 1998 –Tag-based syntax, like HTML –You get to make.
Upgrading to XHTML DECO 3001 Tutorial 1 – Part 1 Presented by Ji Soo Yoon 19 February 2004 Slides adopted from
Introducing XHTML: Module B: HTML to XHTML. Goals Understand how XHTML evolved as a language for Web delivery Understand the importance of DTDs Understand.
Ontology-based Access Ontology-based Access to Digital Libraries Sonia Bergamaschi University of Modena and Reggio Emilia Modena Italy Fausto Rabitti.
Developing a Basic Web Page Posting Files on UMBC
Introducing HTML & XHTML:. Goals  Understand hyperlinking  Understand how tags are formed and used.  Understand HTML as a markup language  Understand.
Metadata: Its Functions in Knowledge Representation for Digital Collections 1 Summary.
Digital Encoding What’s behind E-text Resources?.
Guest Lecture LIS 656, Spring 2011 Kathryn Lybarger.
UKOLUG - July Metadata for the Web RDF and the Dublin Core Andy Powell UKOLN, University of Bath UKOLN.
Data Exchange Tools (DExT) DExT PROJECTAN OPEN EXCHANGE FORMAT FOR DATA enables long-term preservation and re-use of metadata,
Dr. Kurt Fendt, Comparative Media Studies, MIT MetaMedia An Open Platform for Media Annotation and Sharing Workshop "Online Archives:
8/28/97Organization of Information in Collections Introduction to Description: Dublin Core and History University of California, Berkeley School of Information.
Practical RDF Chapter 1. RDF: An Introduction
Metadata: An Overview Katie Dunn Technology & Metadata Librarian
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Copyright, UCL LEADERS: Linking EAD to Electronically Retrievable Sources Interoperability: Where the irresistible force of flexibility meets the immovable.
Chapter 1 Understanding the Web Design Environment Principles of Web Design, 4 th Edition.
Week 1 Understanding the Web Design Environment. 1-2 HTML: Then and Now HTML is an application of the Standard Generalized Markup Language Intended to.
Introduction to XML. XML - Connectivity is Key Need for customized page layout – e.g. filter to display only recent data Downloadable product comparisons.
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
The Metadata Object Description Schema (MODS) NISO Metadata Workshop May 20, 2004 Rebecca Guenther Network Development and MARC Standards Office Library.
XHTML. Introduction to XHTML What Is XHTML? – XHTML stands for EXtensible HyperText Markup Language – XHTML is almost identical to HTML 4.01 – XHTML is.
TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.
© Copyright 2008 STI INNSBRUCK NLP Interchange Format José M. García.
ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary.
Scalable Metadata Definition Frameworks Raymond Plante NCSA/NVO Toward an International Virtual Observatory How do we encourage a smooth evolution of metadata.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
Content and Computer Platforms Week 3. Today’s goals Obtaining, describing, indexing content –XML –Metadata Preparing for the installation of Dspace –Computers.
Lifecycle Metadata for Digital Objects (INF 389K) September 18, 2006 The Big Metadata Picture, Web Access, and the W3C Context.
XML and Digital Libraries M. Zubair Department of Computer Science Old Dominion University.
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
Evolving MARC 21 for the future Rebecca Guenther CCS Forum, ALA Annual July 10, 2009.
XML for Text Markup An introduction to XML markup.
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
Slavic Digital Text Workshop 2006 The Open Archives Initiative Protocol for Metadata Harvesting: an Opportunity for Sharing Content in a Distributed Environment.
The future of the Web: Semantic Web 9/30/2004 Xiangming Mu.
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 4 1COMP9321, 15s2, Week.
Metadata Metadata Mark-up and Management © Adolf Knoll, National Library of the Czech Republic.
Metadata “Data about data” Describes various aspects of a digital file or group of files Identifies the parts of a digital object and documents their content,
Standards for digital encoding Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture 2: TEI.
Towards a roadmap for standardization in language technology Laurent Romary & Nancy Ide Loria-INRIA — Vassar College.
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
Standards for representing meeting metadata and annotations in meeting databases Standards for representing meeting metadata and annotations in meeting.
From XML to DAML – giving meaning to the World Wide Web Katia Sycara The Robotics Institute
The Semantic Web. What is the Semantic Web? The Semantic Web is an extension of the current Web in which information is given well-defined meaning, enabling.
Differences and distinctions: metadata types and their uses Stephen Winch Information Architecture Officer, SLIC.
Delivering textual and visual resources. Overview Case studies Methods for providing access Structures for delivery Full text Marked-up Image and text.
Describing resources II: Dublin Core CERN-UNESCO School on Digital Libraries Rabat, Nov 22-26, 2010 Annette Holtkamp CERN.
Geospatial metadata Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
XML QUESTIONS AND ANSWERS
WHAT DOES THE FUTURE HOLD? Ann Ellis Dec. 18, 2000
Presentation transcript:

Encoding language corpora: current trends and future directions Tomaž Erjavec Department of Knowledge Technologies Department of Knowledge Technologies Jožef Stefan InstituteJožef Stefan Institute, Jožef Stefan Institute LjubljanaLjubljana, Slovenia Slovenia LjubljanaSlovenia National Institute for Japanese Language National Institute for Japanese Language

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Overview 1. History and current practices in corpus encoding: TEI P4, CES 2. Open issues: multiple annotations, metadata and analytical tools 3. Future directions: TEI P5, ISO TC 37

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute I. Some history 80’s: corpora (and other language resources) encoded in idiosyncratic formats, usu. bound to specific tools 80’s: corpora (and other language resources) encoded in idiosyncratic formats, usu. bound to specific tools corpora expensive to produce but corpora expensive to produce but difficult exchange and reuse difficult exchange and reuse quickly became obsolete quickly became obsolete to address these problems, the Text Encoding Initiative is established in 1987 to address these problems, the Text Encoding Initiative is established in 1987Text Encoding Initiative Text Encoding Initiative initiative comes from humanities computing: sponsorship by ACH, ALLC, ACL initiative comes from humanities computing: sponsorship by ACH, ALLC, ACL

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Text Encoding Initiative TEI is the only systematized attempt to develop a fully general text encoding model and set of encoding conventions based upon it TEI is the only systematized attempt to develop a fully general text encoding model and set of encoding conventions based upon it intended for processing and analysis of any type of text, in any language intended for processing and analysis of any type of text, in any language main result: the TEI Guidelines for Electronic Text Encoding and Interchange main result: the TEI Guidelines for Electronic Text Encoding and Interchange SGML was chosen as the underlying standard for the TEI Guidelines. SGML was chosen as the underlying standard for the TEI Guidelines. drafts: TEI P1 (1990), TEI P2 (1993) drafts: TEI P1 (1990), TEI P2 (1993)

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute TEI P3 and P4 the third version of the Guidelines, TEI P3 (1994) published in two substantial green volumes (1200pp) and soon also on the Web. the third version of the Guidelines, TEI P3 (1994) published in two substantial green volumes (1200pp) and soon also on the Web. A major revision, the TEI P4 published in 2002 A major revision, the TEI P4 published in 2002TEI P4 TEI P4 TEI P4 addresses the following issues: TEI P4 addresses the following issues: –error correction –provides equal support for XML and SGML –retains backward compatibility with TEI P3 Today, TEI P4 is the most widely used version of TEI: over 130 projects listed on the TEI web pages Today, TEI P4 is the most widely used version of TEI: over 130 projects listed on the TEI web pages

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute The TEI scheme TEI P4 consists of the written guidelines + a set of DTD fragments TEI P4 consists of the written guidelines + a set of DTD fragments to obtain a project specific DTD (TEI parameterisation) the DTDs fragments are combined: to obtain a project specific DTD (TEI parameterisation) the DTDs fragments are combined: 1. core tagset (always present) includes the TEI header 2. base tagsets (specific text types) e.g. prose, dictionaries, drama 3. additional tagsets (particular analyses) e.g. dates&times, certainty, simple linguistic analysis 4. user extensions, which extend or modify the TEI a widely used parameterisation of TEI: TEI Lite a widely used parameterisation of TEI: TEI Lite

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute What is good about TEI is a “standard” is a “standard” offers a rich vocabulary of tags with extensive documentation offers a rich vocabulary of tags with extensive documentation can be extended and modified can be extended and modified many best practice scenarios many best practice scenarios software and user community support (tei-c web pages & tei-l mailing list) software and user community support (tei-c web pages & tei-l mailing list) tutorials teaching TEI tutorials teaching TEI

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute What is bad about TEI steep learning curve (difficult to start using it) steep learning curve (difficult to start using it) TEI is general, so tags are often too generic for the needs of particular projects; also, too deeply nested (tag bloat) TEI is general, so tags are often too generic for the needs of particular projects; also, too deeply nested (tag bloat) it is often not clear to how encode a particular phenomenon (more than one possibility exists) it is often not clear to how encode a particular phenomenon (more than one possibility exists) while TEI is modular, it will still allow lots of tags that a project (encoder) has no need for while TEI is modular, it will still allow lots of tags that a project (encoder) has no need for never really became accepted in the comp. ling. community never really became accepted in the comp. ling. community some areas missing or not up-to date: computational lexicons, terminological databases, complex linguistic annotations some areas missing or not up-to date: computational lexicons, terminological databases, complex linguistic annotations

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute TEI for corpus encoding base module: TEI.prose base module: TEI.prose additional modules: additional modules: –TEI.corpus additional tags in the header –TEI.analysis tags for simple analytic mechanisms –TEI.linking tags for linking, segmentation, and alignment –TEI.fs tags for feature structure analysis

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Example annotated text " " Big Big Brother Brother is is watching watching you you " " the the caption caption said said.. </seg>

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Example morphosyntactic encoding In text: ženskama ženskama In the MSD specification: <fsLib>......</fsLib><fLib>......</fLib>

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute CES: the Corpus Encoding Standard CES was developed in the scope of EU EAGLES, the Expert Advisory Group on Language Engineering Standards (1996) CES was developed in the scope of EU EAGLES, the Expert Advisory Group on Language Engineering Standards (1996) CES is a SGML DTD and is a particular parameterization (and modification) of TEI P3 CES is a SGML DTD and is a particular parameterization (and modification) of TEI P3 XCES (2002) is the XML version of CES XCES (2002) is the XML version of CES (X)CES has been used in a number of corpus projects, mainly because it is simpler to use and understand than the full TEI (X)CES has been used in a number of corpus projects, mainly because it is simpler to use and understand than the full TEI however, there is not prescribed way how to modify or extend it however, there is not prescribed way how to modify or extend it also, less strictly maintained than the TEI also, less strictly maintained than the TEI

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute II. Open issues multiple annotations multiple annotations metadata metadata corpus analytical tools corpus analytical tools

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Multiple annotations More and more linguistic annotation is being added to the data, e.g. sentences, words, punctuation, part-of-speech, (morphosyntactic) tags, multi-word units (terms), named entities, syntactic structure, co-reference annotation (anaphora), word-sense information sentences, words, punctuation, part-of-speech, (morphosyntactic) tags, multi-word units (terms), named entities, syntactic structure, co-reference annotation (anaphora), word-sense information also rhetorical structure: quoted speech, paragraphs, lists, … also rhetorical structure: quoted speech, paragraphs, lists, … even more annotation can be added to multimodal data, e.g. speech signals even more annotation can be added to multimodal data, e.g. speech signals furthermore, the same level of analysis can be marked-up by more than one tool / annotator furthermore, the same level of analysis can be marked-up by more than one tool / annotator

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute How to combine these annotations? simply have distinct tags & attributes for each of the phenomena covered simply have distinct tags & attributes for each of the phenomena covered easy to understand and hand-edit easy to understand and hand-edit easy to validate easy to validate easy to process easy to process but XML requires a tree-structure; what if the tags do not nest properly? but XML requires a tree-structure; what if the tags do not nest properly?

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Crossing hierarchies simple example - page breaks v.s. paragraph boundaries: … …. … simple example - page breaks v.s. paragraph boundaries: … …. … a well known problem for XML encoding, but with multiple annotations it is now becoming more severe a well known problem for XML encoding, but with multiple annotations it is now becoming more severe

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Solutions to crossing hierarchies Discussed in TEI chapter 14 “Linking, Segmentation, and Alignment”: split elements: … … split elements: … … “milestones” i.e. empty elements: … …. … “milestones” i.e. empty elements: … …. … but somewhat difficult to process and not very general but somewhat difficult to process and not very general

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Stand-off markup General solution to crossing hierarchies is to keep markup in separate documents that only point into the text (or other markup) Several specific recommendations and projects: TEI P5 and TEI Workgroup on Stand-Off Markup, XLink and Xpointer TEI P5 and TEI Workgroup on Stand-Off Markup, XLink and XpointerTEI Workgroup on Stand-Off Markup, XLink and XpointerTEI Workgroup on Stand-Off Markup, XLink and Xpointer Annotation Graphs with AGTK Annotation Graphs with AGTK Annotation Graphs Annotation Graphs TIGER annotation scheme TIGER annotation scheme TIGER annotation scheme TIGER annotation scheme

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Stand-off markup example: TIGER …. ….

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Problems with stand-off markup need tools to link the data: more difficult processing and editing need tools to link the data: more difficult processing and editing no automatic validity checking: consistency, cycles no automatic validity checking: consistency, cycles difficult to change (correct) primarily data or downstream annotations difficult to change (correct) primarily data or downstream annotations

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Metadata description of the corpus or corpus elements description of the corpus or corpus elements traditional bibliographic standards (MARC) traditional bibliographic standards (MARC) but computer corpora need to be documented also along other dimensions: availability, size, markup used, relation of digital file to source text, etc. but computer corpora need to be documented also along other dimensions: availability, size, markup used, relation of digital file to source text, etc. EAD developed for archives, but many similarities to corpus description EAD developed for archives, but many similarities to corpus description EAD a meta-data recommendation closely coupled with the data itself is the TEI header a meta-data recommendation closely coupled with the data itself is the TEI header

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute TEI header is an obligatory part of every TEI document and consists of: is an obligatory part of every TEI document and consists of:, file description full bibliographical description of the computer file itself; includes information about the source or sources of the electronic text, file description full bibliographical description of the computer file itself; includes information about the source or sources of the electronic text, encoding description describes relationship between electronic text and its source: normalization, ambiguity resolution, levels of encoding or analysis, etc., encoding description describes relationship between electronic text and its source: normalization, ambiguity resolution, levels of encoding or analysis, etc., text profile classificatory & contextual information, e.g. subject matter. Important for corpora, to perform retrievals from a body of text in terms of text type or origin (taxonomies), text profile classificatory & contextual information, e.g. subject matter. Important for corpora, to perform retrievals from a body of text in terms of text type or origin (taxonomies), revision history history of changes made during the development of the electronic text, revision history history of changes made during the development of the electronic text

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute TEI header II. an example of a TEI header an example of a TEI headerexample of a TEI headerexample of a TEI header very detailed information is possible, but again, many ways to express the same information (e.g. free text or structured in elements) very detailed information is possible, but again, many ways to express the same information (e.g. free text or structured in elements) stricter, but poorer alternatives exists: Dublin Core stricter, but poorer alternatives exists: Dublin Core

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Dublin Core Dublin Core Metadata Initiative (DCMI) was founded in 1995 with the aim to create a core set of meta-data descriptions for Web-based resources that would be useful for categorizing the Web for easier search and retrieval. Dublin Core Metadata Initiative (DCMI) was founded in 1995 with the aim to create a core set of meta-data descriptions for Web-based resources that would be useful for categorizing the Web for easier search and retrieval. Dublin Core Metadata Element Set (DCES) defines 15 elements, i.e.: Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Language, Relation, Coverage, Rights Dublin Core Metadata Element Set (DCES) defines 15 elements, i.e.: Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Language, Relation, Coverage, Rights can be extended can be extended DC is used e.g. by the Open Language Archives Community (OLAC) DC is used e.g. by the Open Language Archives Community (OLAC)

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Corpus analytical tools Currently, many corpus exploration tools exists, and they typically offer: search with regular expressions over strings search with regular expressions over strings sometimes search over (lemma/PoS) annotations sometimes search over (lemma/PoS) annotations concordance and word frequency list display of results concordance and word frequency list display of results sometimes search and display of parallel corpora sometimes search and display of parallel corpora sometimes basic statistic tests (keywordness, collocation strength) sometimes basic statistic tests (keywordness, collocation strength) examples: WordSmith, MonoConc, IMS CQP, Manatee/Bonito, SARA/Xaira, Tigersearch examples: WordSmith, MonoConc, IMS CQP, Manatee/Bonito, SARA/Xaira, Tigersearch

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute What is missing possibility to combine different types of annotation in queries and displays, esp. for multimodal corpora possibility to combine different types of annotation in queries and displays, esp. for multimodal corpora integration of more powerful statistical methods, esp. for collocations and parallel corpora integration of more powerful statistical methods, esp. for collocations and parallel corpora tools targeted to different types of users (e.g. Sketch Engine) tools targeted to different types of users (e.g. Sketch Engine) merging of digital library viewers with corpus concordancing software merging of digital library viewers with corpus concordancing software

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Corpora v.s. digital libraries classical reference corpora were composed of samples, and interesting only for their linguistic content classical reference corpora were composed of samples, and interesting only for their linguistic content today, more and more corpora contain integral texts, which are of interest in themselves (e.g. historical texts) today, more and more corpora contain integral texts, which are of interest in themselves (e.g. historical texts) conversely, digital libraries are growing in size and accessibility and becoming interesting also for linguistic research conversely, digital libraries are growing in size and accessibility and becoming interesting also for linguistic research what is needed is a system that can perform two tasks: enable selection of (fragments of) heavily structured (multimedia, text-critical) texts for reading and allow for concordance views of selections what is needed is a system that can perform two tasks: enable selection of (fragments of) heavily structured (multimedia, text-critical) texts for reading and allow for concordance views of selections currently the only available (OS) system that attempts this is Philologic from University of Chicago currently the only available (OS) system that attempts this is Philologic from University of Chicago Philologic

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute III. Future directions Two directions in standardisation of corpus and language resource annotation: next version of TEI, version P5 next version of TEI, version P5 work by ISO TC 37 SC4 work by ISO TC 37 SC4

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute TEI P5 the next version of TEI, currently at beta stage: available, but not stable the next version of TEI, currently at beta stage: available, but not stable significantly revised and brought in line with current practices significantly revised and brought in line with current practices not backward compatible with P3/P4 (although scripts exists for conversion) not backward compatible with P3/P4 (although scripts exists for conversion) formal specification based on the ISO Relax NG schema language (although DTD and W3C schemas also available) formal specification based on the ISO Relax NG schema language (although DTD and W3C schemas also available) parameterisation also produces dedicated documentation parameterisation also produces dedicated documentation

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute ISO TC 37 ISO TC 37: ISO Technical Committee on Terminology, est ISO TC 37: ISO Technical Committee on Terminology, est maybe best known for ISO 639 and MARTIF maybe best known for ISO 639 and MARTIF in 2002 changed name to Technical Committee on Terminology and Other Language Resources in 2002 changed name to Technical Committee on Terminology and Other Language Resources also established ISO TC 37/SC 4 Sub-Committee on Language Resource Management also established ISO TC 37/SC 4 Sub-Committee on Language Resource ManagementISO TC 37/SC 4ISO TC 37/SC 4

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute ISO TC 37 SC4 WGs WG 1 : Basic descriptors and mechanisms for language resources WG 1 : Basic descriptors and mechanisms for language resources WG 1 : Basic descriptors and mechanisms for language resources WG 1 : Basic descriptors and mechanisms for language resources –terminology used in language resources, –basic mechanisms and data structures for linguistic representation –meta-data representation scheme to document linguistic information structures and processes WG 2 : Representation schemes WG 2 : Representation schemes WG 2 : Representation schemes WG 2 : Representation schemes –definition of annotation/representation schemes for morpho-syntax and syntax –representation scheme for the semantic content of multimodal information, –metadata for discourse level representation scheme –metadata for discourse level representation scheme WG 3 : Multilingual text representation WG 3 : Multilingual text representation WG 3 : Multilingual text representation WG 3 : Multilingual text representation –translation memory and alignment of parallel corpora, –segmentation and counting algorithms, –meta-markup for Globalization, Internationalization and Localization (GIL) –meta-markup for Globalization, Internationalization and Localization (GIL) WG 4 : Lexical databases WG 4 : Lexical databases WG 4 : Lexical database WG 4 : Lexical database –standardization of lexical representation formats for the various types of NLP applications (Machine Readable Lexica) –standardization of lexical representation formats for the various types of NLP applications (Machine Readable Lexica) WG 5 : Workflow of language resource management WG 5 : Workflow of language resource management WG 5 : Workflow of language resource management WG 5 : Workflow of language resource management –Standardization of guidelines for language validation and net-based distributed cooperative work

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute WG4 standards Language Resource Management — Feature Structures Language Resource Management — Feature Structures Language resource management — Lexical markup framework (LMF) Language resource management — Lexical markup framework (LMF) Language Resource Management — Morpho-syntactic Annotation Framework (MAF) Language Resource Management — Morpho-syntactic Annotation Framework (MAF)MAF all under development! all under development!

National Institute for Japanese Language Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Conclusions I presented some history, current state and possible future directions in the field of encoding standardisation of, mainly, corpora I presented some history, current state and possible future directions in the field of encoding standardisation of, mainly, corpora the main recommendation (for me!) still seems to be TEI: combines tradition with innovation the main recommendation (for me!) still seems to be TEI: combines tradition with innovation

Thank you!