Encoding language corpora: current trends and future directions Tomaž Erjavec Department of Knowledge Technologies Department of Knowledge Technologies.

Encoding language corpora: current trends and future directions Tomaž Erjavec Department of Knowledge Technologies Department of Knowledge Technologies Jožef Stefan InstituteJožef Stefan Institute, Jožef Stefan Institute LjubljanaLjubljana, Slovenia Slovenia LjubljanaSlovenia tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si, http://nl.ijs.si/et/ http://nl.ijs.si/et/ tomaz.erjavec@ijs.sihttp://nl.ijs.si/et/ National Institute for Japanese Language National Institute for Japanese Language2006-09-28

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Overview 1. History and current practices in corpus encoding: TEI P4, CES 2. Open issues: multiple annotations, metadata and analytical tools 3. Future directions: TEI P5, ISO TC 37

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute I. Some history 80’s: corpora (and other language resources) encoded in idiosyncratic formats, usu. bound to specific tools 80’s: corpora (and other language resources) encoded in idiosyncratic formats, usu. bound to specific tools corpora expensive to produce but corpora expensive to produce but difficult exchange and reuse difficult exchange and reuse quickly became obsolete quickly became obsolete to address these problems, the Text Encoding Initiative is established in 1987 to address these problems, the Text Encoding Initiative is established in 1987Text Encoding Initiative Text Encoding Initiative initiative comes from humanities computing: sponsorship by ACH, ALLC, ACL initiative comes from humanities computing: sponsorship by ACH, ALLC, ACL

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Text Encoding Initiative TEI is the only systematized attempt to develop a fully general text encoding model and set of encoding conventions based upon it TEI is the only systematized attempt to develop a fully general text encoding model and set of encoding conventions based upon it intended for processing and analysis of any type of text, in any language intended for processing and analysis of any type of text, in any language main result: the TEI Guidelines for Electronic Text Encoding and Interchange main result: the TEI Guidelines for Electronic Text Encoding and Interchange SGML was chosen as the underlying standard for the TEI Guidelines. SGML was chosen as the underlying standard for the TEI Guidelines. drafts: TEI P1 (1990), TEI P2 (1993) drafts: TEI P1 (1990), TEI P2 (1993)

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute TEI P3 and P4 the third version of the Guidelines, TEI P3 (1994) published in two substantial green volumes (1200pp) and soon also on the Web. the third version of the Guidelines, TEI P3 (1994) published in two substantial green volumes (1200pp) and soon also on the Web. A major revision, the TEI P4 published in 2002 A major revision, the TEI P4 published in 2002TEI P4 TEI P4 TEI P4 addresses the following issues: TEI P4 addresses the following issues: –error correction –provides equal support for XML and SGML –retains backward compatibility with TEI P3 Today, TEI P4 is the most widely used version of TEI: over 130 projects listed on the TEI web pages Today, TEI P4 is the most widely used version of TEI: over 130 projects listed on the TEI web pages

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute The TEI scheme TEI P4 consists of the written guidelines + a set of DTD fragments TEI P4 consists of the written guidelines + a set of DTD fragments to obtain a project specific DTD (TEI parameterisation) the DTDs fragments are combined: to obtain a project specific DTD (TEI parameterisation) the DTDs fragments are combined: 1. core tagset (always present) includes the TEI header 2. base tagsets (specific text types) e.g. prose, dictionaries, drama 3. additional tagsets (particular analyses) e.g. dates&times, certainty, simple linguistic analysis 4. user extensions, which extend or modify the TEI a widely used parameterisation of TEI: TEI Lite a widely used parameterisation of TEI: TEI Lite

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute What is good about TEI is a “standard” is a “standard” offers a rich vocabulary of tags with extensive documentation offers a rich vocabulary of tags with extensive documentation can be extended and modified can be extended and modified many best practice scenarios many best practice scenarios software and user community support (tei-c web pages & tei-l mailing list) software and user community support (tei-c web pages & tei-l mailing list) tutorials teaching TEI tutorials teaching TEI

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute What is bad about TEI steep learning curve (difficult to start using it) steep learning curve (difficult to start using it) TEI is general, so tags are often too generic for the needs of particular projects; also, too deeply nested (tag bloat) TEI is general, so tags are often too generic for the needs of particular projects; also, too deeply nested (tag bloat) it is often not clear to how encode a particular phenomenon (more than one possibility exists) it is often not clear to how encode a particular phenomenon (more than one possibility exists) while TEI is modular, it will still allow lots of tags that a project (encoder) has no need for while TEI is modular, it will still allow lots of tags that a project (encoder) has no need for never really became accepted in the comp. ling. community never really became accepted in the comp. ling. community some areas missing or not up-to date: computational lexicons, terminological databases, complex linguistic annotations some areas missing or not up-to date: computational lexicons, terminological databases, complex linguistic annotations

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute TEI for corpus encoding base module: TEI.prose base module: TEI.prose additional modules: additional modules: –TEI.corpus additional tags in the header –TEI.analysis tags for simple analytic mechanisms –TEI.linking tags for linking, segmentation, and alignment –TEI.fs tags for feature structure analysis

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Example annotated text " " Big Big Brother Brother is is watching watching you you " " the the caption caption said said.. </seg>

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Example morphosyntactic encoding In text: ženskama ženskama In the MSD specification: <fsLib>......</fsLib><fLib>......</fLib>

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute CES: the Corpus Encoding Standard CES was developed in the scope of EU EAGLES, the Expert Advisory Group on Language Engineering Standards (1996) CES was developed in the scope of EU EAGLES, the Expert Advisory Group on Language Engineering Standards (1996) CES is a SGML DTD and is a particular parameterization (and modification) of TEI P3 CES is a SGML DTD and is a particular parameterization (and modification) of TEI P3 XCES (2002) is the XML version of CES XCES (2002) is the XML version of CES (X)CES has been used in a number of corpus projects, mainly because it is simpler to use and understand than the full TEI (X)CES has been used in a number of corpus projects, mainly because it is simpler to use and understand than the full TEI however, there is not prescribed way how to modify or extend it however, there is not prescribed way how to modify or extend it also, less strictly maintained than the TEI also, less strictly maintained than the TEI

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute II. Open issues multiple annotations multiple annotations metadata metadata corpus analytical tools corpus analytical tools

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Multiple annotations More and more linguistic annotation is being added to the data, e.g. sentences, words, punctuation, part-of-speech, (morphosyntactic) tags, multi-word units (terms), named entities, syntactic structure, co-reference annotation (anaphora), word-sense information sentences, words, punctuation, part-of-speech, (morphosyntactic) tags, multi-word units (terms), named entities, syntactic structure, co-reference annotation (anaphora), word-sense information also rhetorical structure: quoted speech, paragraphs, lists, … also rhetorical structure: quoted speech, paragraphs, lists, … even more annotation can be added to multimodal data, e.g. speech signals even more annotation can be added to multimodal data, e.g. speech signals furthermore, the same level of analysis can be marked-up by more than one tool / annotator furthermore, the same level of analysis can be marked-up by more than one tool / annotator

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute How to combine these annotations? simply have distinct tags & attributes for each of the phenomena covered simply have distinct tags & attributes for each of the phenomena covered easy to understand and hand-edit easy to understand and hand-edit easy to validate easy to validate easy to process easy to process but XML requires a tree-structure; what if the tags do not nest properly? but XML requires a tree-structure; what if the tags do not nest properly?

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Crossing hierarchies simple example - page breaks v.s. paragraph boundaries: … …. … simple example - page breaks v.s. paragraph boundaries: … …. … a well known problem for XML encoding, but with multiple annotations it is now becoming more severe a well known problem for XML encoding, but with multiple annotations it is now becoming more severe

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Solutions to crossing hierarchies Discussed in TEI chapter 14 “Linking, Segmentation, and Alignment”: split elements: … … split elements: … … “milestones” i.e. empty elements: … …. … “milestones” i.e. empty elements: … …. … but somewhat difficult to process and not very general but somewhat difficult to process and not very general

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Stand-off markup General solution to crossing hierarchies is to keep markup in separate documents that only point into the text (or other markup) Several specific recommendations and projects: TEI P5 and TEI Workgroup on Stand-Off Markup, XLink and Xpointer TEI P5 and TEI Workgroup on Stand-Off Markup, XLink and XpointerTEI Workgroup on Stand-Off Markup, XLink and XpointerTEI Workgroup on Stand-Off Markup, XLink and Xpointer Annotation Graphs with AGTK Annotation Graphs with AGTK Annotation Graphs Annotation Graphs TIGER annotation scheme TIGER annotation scheme TIGER annotation scheme TIGER annotation scheme

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Stand-off markup example: TIGER …. ….

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Problems with stand-off markup need tools to link the data: more difficult processing and editing need tools to link the data: more difficult processing and editing no automatic validity checking: consistency, cycles no automatic validity checking: consistency, cycles difficult to change (correct) primarily data or downstream annotations difficult to change (correct) primarily data or downstream annotations

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Metadata description of the corpus or corpus elements description of the corpus or corpus elements traditional bibliographic standards (MARC) traditional bibliographic standards (MARC) but computer corpora need to be documented also along other dimensions: availability, size, markup used, relation of digital file to source text, etc. but computer corpora need to be documented also along other dimensions: availability, size, markup used, relation of digital file to source text, etc. EAD developed for archives, but many similarities to corpus description EAD developed for archives, but many similarities to corpus description EAD a meta-data recommendation closely coupled with the data itself is the TEI header a meta-data recommendation closely coupled with the data itself is the TEI header

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute TEI header is an obligatory part of every TEI document and consists of: is an obligatory part of every TEI document and consists of:, file description full bibliographical description of the computer file itself; includes information about the source or sources of the electronic text, file description full bibliographical description of the computer file itself; includes information about the source or sources of the electronic text, encoding description describes relationship between electronic text and its source: normalization, ambiguity resolution, levels of encoding or analysis, etc., encoding description describes relationship between electronic text and its source: normalization, ambiguity resolution, levels of encoding or analysis, etc., text profile classificatory & contextual information, e.g. subject matter. Important for corpora, to perform retrievals from a body of text in terms of text type or origin (taxonomies), text profile classificatory & contextual information, e.g. subject matter. Important for corpora, to perform retrievals from a body of text in terms of text type or origin (taxonomies), revision history history of changes made during the development of the electronic text, revision history history of changes made during the development of the electronic text

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute TEI header II. an example of a TEI header an example of a TEI headerexample of a TEI headerexample of a TEI header very detailed information is possible, but again, many ways to express the same information (e.g. free text or structured in elements) very detailed information is possible, but again, many ways to express the same information (e.g. free text or structured in elements) stricter, but poorer alternatives exists: Dublin Core stricter, but poorer alternatives exists: Dublin Core

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Dublin Core Dublin Core Metadata Initiative (DCMI) was founded in 1995 with the aim to create a core set of meta-data descriptions for Web-based resources that would be useful for categorizing the Web for easier search and retrieval. Dublin Core Metadata Initiative (DCMI) was founded in 1995 with the aim to create a core set of meta-data descriptions for Web-based resources that would be useful for categorizing the Web for easier search and retrieval. Dublin Core Metadata Element Set (DCES) defines 15 elements, i.e.: Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Language, Relation, Coverage, Rights Dublin Core Metadata Element Set (DCES) defines 15 elements, i.e.: Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Language, Relation, Coverage, Rights can be extended can be extended DC is used e.g. by the Open Language Archives Community (OLAC) DC is used e.g. by the Open Language Archives Community (OLAC)

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Corpus analytical tools Currently, many corpus exploration tools exists, and they typically offer: search with regular expressions over strings search with regular expressions over strings sometimes search over (lemma/PoS) annotations sometimes search over (lemma/PoS) annotations concordance and word frequency list display of results concordance and word frequency list display of results sometimes search and display of parallel corpora sometimes search and display of parallel corpora sometimes basic statistic tests (keywordness, collocation strength) sometimes basic statistic tests (keywordness, collocation strength) examples: WordSmith, MonoConc, IMS CQP, Manatee/Bonito, SARA/Xaira, Tigersearch examples: WordSmith, MonoConc, IMS CQP, Manatee/Bonito, SARA/Xaira, Tigersearch

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute What is missing possibility to combine different types of annotation in queries and displays, esp. for multimodal corpora possibility to combine different types of annotation in queries and displays, esp. for multimodal corpora integration of more powerful statistical methods, esp. for collocations and parallel corpora integration of more powerful statistical methods, esp. for collocations and parallel corpora tools targeted to different types of users (e.g. Sketch Engine) tools targeted to different types of users (e.g. Sketch Engine) merging of digital library viewers with corpus concordancing software merging of digital library viewers with corpus concordancing software

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Corpora v.s. digital libraries classical reference corpora were composed of samples, and interesting only for their linguistic content classical reference corpora were composed of samples, and interesting only for their linguistic content today, more and more corpora contain integral texts, which are of interest in themselves (e.g. historical texts) today, more and more corpora contain integral texts, which are of interest in themselves (e.g. historical texts) conversely, digital libraries are growing in size and accessibility and becoming interesting also for linguistic research conversely, digital libraries are growing in size and accessibility and becoming interesting also for linguistic research what is needed is a system that can perform two tasks: enable selection of (fragments of) heavily structured (multimedia, text-critical) texts for reading and allow for concordance views of selections what is needed is a system that can perform two tasks: enable selection of (fragments of) heavily structured (multimedia, text-critical) texts for reading and allow for concordance views of selections currently the only available (OS) system that attempts this is Philologic from University of Chicago currently the only available (OS) system that attempts this is Philologic from University of Chicago Philologic

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute III. Future directions Two directions in standardisation of corpus and language resource annotation: next version of TEI, version P5 next version of TEI, version P5 work by ISO TC 37 SC4 work by ISO TC 37 SC4

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute TEI P5 the next version of TEI, currently at beta stage: available, but not stable the next version of TEI, currently at beta stage: available, but not stable significantly revised and brought in line with current practices significantly revised and brought in line with current practices not backward compatible with P3/P4 (although scripts exists for conversion) not backward compatible with P3/P4 (although scripts exists for conversion) formal specification based on the ISO Relax NG schema language (although DTD and W3C schemas also available) formal specification based on the ISO Relax NG schema language (although DTD and W3C schemas also available) parameterisation also produces dedicated documentation parameterisation also produces dedicated documentation

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute ISO TC 37 ISO TC 37: ISO Technical Committee on Terminology, est. 1952 ISO TC 37: ISO Technical Committee on Terminology, est. 1952 maybe best known for ISO 639 and MARTIF maybe best known for ISO 639 and MARTIF in 2002 changed name to Technical Committee on Terminology and Other Language Resources in 2002 changed name to Technical Committee on Terminology and Other Language Resources also established ISO TC 37/SC 4 Sub-Committee on Language Resource Management also established ISO TC 37/SC 4 Sub-Committee on Language Resource ManagementISO TC 37/SC 4ISO TC 37/SC 4

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute ISO TC 37 SC4 WGs WG 1 : Basic descriptors and mechanisms for language resources WG 1 : Basic descriptors and mechanisms for language resources WG 1 : Basic descriptors and mechanisms for language resources WG 1 : Basic descriptors and mechanisms for language resources –terminology used in language resources, –basic mechanisms and data structures for linguistic representation –meta-data representation scheme to document linguistic information structures and processes WG 2 : Representation schemes WG 2 : Representation schemes WG 2 : Representation schemes WG 2 : Representation schemes –definition of annotation/representation schemes for morpho-syntax and syntax –representation scheme for the semantic content of multimodal information, –metadata for discourse level representation scheme –metadata for discourse level representation scheme WG 3 : Multilingual text representation WG 3 : Multilingual text representation WG 3 : Multilingual text representation WG 3 : Multilingual text representation –translation memory and alignment of parallel corpora, –segmentation and counting algorithms, –meta-markup for Globalization, Internationalization and Localization (GIL) –meta-markup for Globalization, Internationalization and Localization (GIL) WG 4 : Lexical databases WG 4 : Lexical databases WG 4 : Lexical database WG 4 : Lexical database –standardization of lexical representation formats for the various types of NLP applications (Machine Readable Lexica) –standardization of lexical representation formats for the various types of NLP applications (Machine Readable Lexica) WG 5 : Workflow of language resource management WG 5 : Workflow of language resource management WG 5 : Workflow of language resource management WG 5 : Workflow of language resource management –Standardization of guidelines for language validation and net-based distributed cooperative work

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute WG4 standards Language Resource Management — Feature Structures Language Resource Management — Feature Structures Language resource management — Lexical markup framework (LMF) Language resource management — Lexical markup framework (LMF) Language Resource Management — Morpho-syntactic Annotation Framework (MAF) Language Resource Management — Morpho-syntactic Annotation Framework (MAF)MAF all under development! all under development!

National Institute for Japanese Language 2006-09-28 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Conclusions I presented some history, current state and possible future directions in the field of encoding standardisation of, mainly, corpora I presented some history, current state and possible future directions in the field of encoding standardisation of, mainly, corpora the main recommendation (for me!) still seems to be TEI: combines tradition with innovation the main recommendation (for me!) still seems to be TEI: combines tradition with innovation

Thank you!

Encoding language corpora: current trends and future directions Tomaž Erjavec Department of Knowledge Technologies Department of Knowledge Technologies.

Similar presentations

Presentation on theme: "Encoding language corpora: current trends and future directions Tomaž Erjavec Department of Knowledge Technologies Department of Knowledge Technologies."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Encoding language corpora: current trends and future directions Tomaž Erjavec Department of Knowledge Technologies Department of Knowledge Technologies.

Similar presentations

Presentation on theme: "Encoding language corpora: current trends and future directions Tomaž Erjavec Department of Knowledge Technologies Department of Knowledge Technologies."— Presentation transcript:

Similar presentations

About project

Feedback