Standards for digital encoding Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž.

Slides:



Advertisements
Similar presentations
Ontology Assessment – Proposed Framework and Methodology.
Advertisements

A Common Standard for Data and Metadata: The ESDS Qualidata XML Schema Libby Bishop ESDS Qualidata – UK Data Archive E-Research Workshop Melbourne 27 April.
Music Encoding Initiative (MEI) DTD and the OCVE
Developing a Metadata Exchange Format for Mathematical Literature David Ruddy Project Euclid Cornell University Library DML 2010 Paris 7 July 2010.
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING NLP-AI IIIT-Hyderabad CIIL, Mysore ICON DECEMBER, 2003.
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
SLIDE 1IS 257 – Fall 2007 Codes and Rules for Description: History 2 University of California, Berkeley School of Information IS 245: Organization.
Information Retrieval in Practice
3. Technical and administrative metadata standards Metadata Standards and Applications.
© Tefko Saracevic, Rutgers University1 metadata considerations for digital libraries.
1 CS 502: Computing Methods for Digital Libraries Lecture 17 Descriptive Metadata: Dublin Core.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
OLC Spring Chapter Conferences Metadata, Schmetadata … Tell Me Why I Should Care? OLC Spring Chapter Conferences, 2004 Margaret.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
UKOLUG - July Metadata for the Web RDF and the Dublin Core Andy Powell UKOLN, University of Bath UKOLN.
Publishing Digital Content to a LOR Publishing Digital Content to a LOR 1.
ORGANIZING AND STRUCTURING DATA FOR DIGITAL PROJECTS Suzanne Huffman Digital Resources Librarian Simpson Library.
RDF (Resource Description Framework) Why?. XML XML is a metalanguage that allows users to define markup XML separates content and structure from formatting.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Corpus linguistics for translators Amanda Saksida University of Nova Gorica.
1 © Netskills Quality Internet Training, University of Newcastle Metadata Explained © Netskills, Quality Internet Training.
8/28/97Organization of Information in Collections Introduction to Description: Dublin Core and History University of California, Berkeley School of Information.
Mark Sullivan University of Florida Libraries Digital Library of the Caribbean.
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.
TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Meta Tagging / Metadata Lindsay Berard Assisted by: Li Li.
389F/Description1 ARCHIVAL DESCRIPTION. 389F/Description2 INTRODUCTION Finding Aid Any descriptive medium that establishes physical, administrative and/or.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing.
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Discovery Metadata for Special Collections Concepts, Considerations, Choices William E. Moen School of Library and Information Sciences Texas Center for.
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 1 Comparability of language data and analysis Using an ontology for linguistics Scott Farrar, U.
Introduction to metadata
Encoding language corpora: current trends and future directions Tomaž Erjavec Department of Knowledge Technologies Department of Knowledge Technologies.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
Intellectual Works and their Manifestations Representation of Information Objects IR Systems & Information objects Spring January, 2006 Bharat.
Strategies for subject navigation of linked Web sites using RDF topic maps Carol Jean Godby Devon Smith OCLC Online Computer Library Center Knowledge Technologies.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
WP 3: Standardisation of shared metadata Mode of operation –All partners are involved –Building on practice outside the project Achievements of Year 1.
Standards for representing meeting metadata and annotations in meeting databases Standards for representing meeting metadata and annotations in meeting.
Sally McCallum Library of Congress
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Introduction to metadata for IDAH fellows Jenn Riley Metadata Librarian Digital Library Program.
Describing resources II: Dublin Core CERN-UNESCO School on Digital Libraries Rabat, Nov 22-26, 2010 Annette Holtkamp CERN.
An Application Profile and Prototype Metadata Management System for Licensed Electronic Resources Adam Chandler Information Technology Librarian Central.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
INTRODUCTION TO APPLIED LINGUISTICS
Geospatial metadata Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
Some basic concepts Week 1 Lecture notes INF 384C: Organizing Information Spring 2016 Karen Wickett UT School of Information.
Information Retrieval in Practice
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Demystifying Digital Scholarship 10: TEI
Computational and Statistical Methods for Corpus Analysis: Overview
Natural Language Processing (NLP)
Introduction to Metadata
Introduction to Semantic Metadata & Semantic Web
Natural Language Processing (NLP)
FRBR and FRAD as Implemented in RDA
Natural Language Processing (NLP)
Presentation transcript:

Standards for digital encoding Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec

Overview 1. what are corpora 2. structure of a TEI corpus 3. linguistic annotation of corpora 4. alignment 5. TEI (corpus) header Practicum: XSLT: making an index and ToC XSLT: making an index and ToC TEI encoding of a corpus TEI encoding of a corpus

I. What is a corpus? The Collins English Dictionary (1986): 1. a collection or body of writings, esp. by a single author or topic. Guidelines of the Expert Advisory Group on Language Engineering Standards, EAGLES: Corpus : A collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language. Computer corpus : a corpus which is encoded in a standardised and homogeneous way for open-ended retrieval tasks. Its constituent pieces of language are documented as to their origins and provenance. EAGLES Corpus Computer corpus EAGLES Corpus Computer corpus

Using corpora Research on actual language: descriptive approach, study of performance, empirical linguistics. Research on actual language: descriptive approach, study of performance, empirical linguistics. Applied linguistics: Applied linguistics:  Lexicography: mono-lingual dictionaries, terminological, bi-lingual  Language studies: hypothesis verification, knowledge discovery (lexis, morphology, syntax,...)  Translation studies: a source translation equivalents and their contexts translation memories, machine aided translations  Language learning: real-life examples "idiomatic teaching", curriculum development Language technology: Language technology:  testing set for developed methods;  training set for inductive learning  (statistical Natural Language Processing) statistical Natural Language Processingstatistical Natural Language Processing

Characteristics of a corpus Quantity: Quantity:  the bigger, the better  but processing / sampling problems Quality : Quality :  the texts are authentic (character sets, spacing, form);  the mark-up is validated (XML) Simplicity: Simplicity:  computer representation is understandable  markup easily separated from the text (XML) Documented: Documented:  corpus contains bibliographic  and other meta-data (teiHeader / Dublin Core / …)

Typology of corpora Corpora of written language, spoken and speech corpora Corpora of written language, spoken and speech corpora  authenticity vs. price  e.g. the agency ELRA catalog ELRAcatalogELRAcatalog Reference corpora and specialised (sub-language) corpora Reference corpora and specialised (sub-language) corpora  representativeness and balance  e.g. BNC, ICE, COLT BNCICECOLTBNCICECOLT Corpora with integral texts or of text samples Corpora with integral texts or of text samples  historical and legal reasons, e.g. Brown Brown  but also balance: the whelk problem Static and monitor corpora (language change) Static and monitor corpora (language change) Monolingual and multilingual parallel and comparable corpora e.g. Hansard, Europarl Monolingual and multilingual parallel and comparable corpora e.g. Hansard, EuroparlHansardEuroparlHansardEuroparl Plain text and annotated corpora Plain text and annotated corpora  the “original” should be archived, ideally linked to the corpus

The history of computer corpora: First milestones: Brown (AE, 1 million words) 1964 First milestones: Brown (AE, 1 million words) 1964Brown LOB et al. (~ Brown but BE) 1974 LOB et al. (~ Brown but BE) 1974 LOB 100M reference corpora: Cobuild Bank of English (monitor) 1980; BNC 1995; Czech CNC 1998; Slovene FIDA, Nova Beseda 1998; Croatian HNK, Hungarian HNC, M reference corpora: Cobuild Bank of English (monitor) 1980; BNC 1995; Czech CNC 1998; Slovene FIDA, Nova Beseda 1998; Croatian HNK, Hungarian HNC,...BNCCNCFIDANova BesedaHNKHNCBNCCNCFIDANova BesedaHNKHNC EU corpus oriented projects in the '90: EAGLES, MULTEXT, SQEL, MULTEXT-East,... EU corpus oriented projects in the '90: EAGLES, MULTEXT, SQEL, MULTEXT-East,...MULTEXT-East Language resources brokers: LDC 1992, ELRA 1995 Language resources brokers: LDC 1992, ELRA 1995LDCELRALDCELRA

Steps in the preparation of a corpus 1. Which texts? Sampling “the universe of discourse” linguistic and non-linguistic criteria; availability; simplicity; size linguistic and non-linguistic criteria; availability; simplicity; size 2. Legalities: © sensitivity of source (privacy / publishing considerations); sensitivity of source (privacy / publishing considerations); agreement with providers; usage, publication agreement with providers; usage, publication 3. How to get them? Acquiring digital originals typing, OCR, get CD-ROM, download from provider typing, OCR, get CD-ROM, download from provider Web, Google, BootCat,...? Web, Google, BootCat,...? 4. Up-translation. Conversion to standard format (+documentation) consistency; character set encodings consistency; character set encodings 5. (Linguistic annotation) Segmentation and classification errors, language dependency errors, language dependency 6. Documentation Metadata TEI header; Dublin Core; Open Archives etc. TEI header; Dublin Core; Open Archives etc. 7. Using it yourself, or letting others too? (Web-based) concordancers for linguists (Web-based) concordancers for linguists download needed for HLT use download needed for HLT use licences for use licences for use

What annotation can be added to the text of the corpus? Annotation = interpretation Annotation = interpretation Documentation about the corpus and corpus component ( s) Documentation about the corpus and corpus component ( s) Document structure (,, etc.) usu. not a priority! Document structure (,, etc.) usu. not a priority! Basic linguistic markup: sentences, words, punctuation (, ) Basic linguistic markup: sentences, words, punctuation (, ) Lemmas and morphosyntactic descriptions Lemmas and morphosyntactic descriptions Syntax (treebanks) Syntax (treebanks) Multilinguality (sentence and word alignment) Multilinguality (sentence and word alignment) Terms, semantics, anaphora, pragmatics, intonation,... Terms, semantics, anaphora, pragmatics, intonation,...

Markup Methods hand annotation: documentation, first steps generic (XML, spreadsheet) editors or specialised editors hand annotation: documentation, first steps generic (XML, spreadsheet) editors or specialised editors semi-automatic: morphosyntactic and other linguistic annotation cyclic approach: machine, hand, validate, correct, machine,... semi-automatic: morphosyntactic and other linguistic annotation cyclic approach: machine, hand, validate, correct, machine,... machine, with hand-written rules: tokenisation regular expression machine, with hand-written rules: tokenisation regular expression machine, with inductively built models from annotated data: "supervised learning"; HMMs, decision trees, inductive logic programming,... machine, with inductively built models from annotated data: "supervised learning"; HMMs, decision trees, inductive logic programming,... machine, with inductively built models from un-annotated data: "unsupervised leaning"; clustering techniques; text mining machine, with inductively built models from un-annotated data: "unsupervised leaning"; clustering techniques; text mining overview of the field overview of the field overview of the field overview of the field

The future of corpus and data- driven linguistics Size: Size:  Larger quantities of readily accessible data (Web as corpus)  Larger storage and processing power (Moore law) Complexity: Complexity:  Deeper analysis: syntax, deixis, semantic roles, dialogue acts,...  Multimodal corpora: speech, film, transcriptions,...  Annotation levels and linking: co-existence and linking of varied types of annotations; ambiguity  Development of tools and platforms: precision, robustness, unsupervised learning, meta-learning

II. Types of annotation Segmentation Segmentation  paragraphs, sentences, words, phonemes, … Categorisation Categorisation  word-level morphosyntactic information (PoS)  word lemmas or stems  syntactic structures Alignment and correspondence Alignment and correspondence  translation equivalence  anaphoric reference Metadata Metadata  text source: title, date, author, …  text type: register, publ. type, …

Segmentation Corpora (and texts) are generally organized hierarchically Segmentation and labelling allow their components to be identified and accessed at any level   for reference purposes e.g. this occurs at....   for scoping purposes e.g. find this within a that   for analytic purposes e.g. 90% of these of type that contain a the other

Typical TEI corpus structure teiCorpus teiHeader TEI TEI TEI teiHeadertext div div div p p p s s s w w w

Example It It was was a a bright bright cold cold day day in in April April,, and and the the clocks clocks were were striking striking thirteen thirteen..

Representation of trees Example: (NC (N the method) (C to be used)) in TEI: the method to be used can be used for any type of linguistic segment more specific elements:,,,, all defined in module for linguistic analysis

Alignment and correspondence The spans to be annotated are not always structural Discontinuities are commonplace Parallel structures are the norm Solutions:   pointers   milestone (empty) elements   stand-off markup

Alignment of multiple speakers A: Have you heard the It’s a disaster! B: the election results? It’s a miracle! Have you heard the the election results? its a disaster its a miracle

Synchronization using pointers of whole elements of points in time its a disaster its a miracle Have you heard the the election results?

Discontinuity: using pointers “You put it,” Quill reminded him, “in the safe.” Encoding: Encoding: can also use PART attribute to indicate that segments are incomplete ”You put it,” Quill reminded him, “in the safe.”

Translation equivalence using pointers For a long time I used to go to bed early … Longtemps je me couchais de bonne heure

Translation equivalence using stand-off markup For a long time I used to go to bed early Longtemps je me couchais de bonne heure

Anaphoric reference Shirley, which made its Friday night debut only a month ago, was not listed on NBC 's new schedule, although the network says the show still is being considered. With pointers: With pointers: Stand-off: Stand-off:

Word class categorization Difficulty is being expressed... It was a bright cold day

Making analysis explicit The ana attribute is a pointer What does NN1 identify?   a prose description   an element   a feature structure Difficulty

for example… annotated corpora verb past tense plural common noun

Formal encoding of analyses Linguistic Annotation Frameworks and standards   the philosophers stone Generic feature structure system   any analysis can be represented by bundles of named feature-value pairs   embedded within text or indirectly linked Ancillary feature system declaration Theoretically neutral (?) pragmatic solution to real world problem of intermachine communication

Feature structures a feature structure consists of a set bundle of features a feature has a name and a value values may be binary switches, symbols, strings, feature structures, or operations on them bundling may constrained in various (not necessarily hierarchic) ways

... or, in XML: The element represents a feature structure, which contains... One or more elements, each of which has   a name   a value Feature values may be   atomic:   complex:   expressions:... or

Using a feature structure... corpora

An example of a lexical definition goose lemma: goose category: noun number: plural bar level: 0

A more complex example

The TEI header The TEI header gives the meta-data on the TEI document and consists of four elements: the file description (the only obligatory element) the file description (the only obligatory element) the encoding description the encoding description the text profile the text profile the revision history the revision history

file description, containing a full bibliographical description of the computer file itself file description, containing a full bibliographical description of the computer file itself e.g. the title statement, edition statement, size, … e.g. the title statement, edition statement, size, … includes also information about the source or sources ( ) from which the electronic text was derived includes also information about the source or sources ( ) from which the electronic text was derived

File description (2) all subelements all subelements </teiHeader> minimal header minimal header<teiHeader> </teiHeader>

fileDesc/titleStmt Short Example Short Example<titleStmt> Two stories by Edgar Allen Poe: electronic version Two stories by Edgar Allen Poe: electronic version Poe, Edgar Allen ( ) Poe, Edgar Allen ( ) compiled by compiled by James Benson James Benson </titleStmt> Longer example: Longer example: Yogadarśanam (arthāt yogasūtrap¯&tdot;ha&hdot;): a machine readable transcription. The Yogasūutras of Patañjali: a machine readable transcription. Wellcome Institute for the History of Medicine Dominik Wujastyk Wieslaw Mical data entry and proof correction Jan Hajic conversion to TEI-conformant markup Yogadarśanam (arthāt yogasūtrap¯&tdot;ha&hdot;): a machine readable transcription. The Yogasūutras of Patañjali: a machine readable transcription. Wellcome Institute for the History of Medicine Dominik Wujastyk Wieslaw Mical data entry and proof correction Jan Hajic conversion to TEI-conformant markup

fileDesc/publicationStmt publication statement groups information concerning the publication or distribution of an electronic or other text publication statement groups information concerning the publication or distribution of an electronic or other text It may contain either a simple prose description It may contain either a simple prose description or groups of the elements described below: or groups of the elements described below:  who is responsible for the publication publisher  who is responsible for the distribution distributor  (release authority) who is responsible for making an electronic file available, other than a publisher or distributor authority each of the above elements may be followed by one or more of the following elements:,,,, each of the above elements may be followed by one or more of the following elements:,,,, pubPlaceaddressidnoavailabilitydatepubPlaceaddressidnoavailabilitydate

fileDesc/sourceDesc The Source Description records details of the source or sources from which a computer file is derived. The Source Description records details of the source or sources from which a computer file is derived. An electronic file may also have no source, if what is being catalogued is an original text created in electronic form. An electronic file may also have no source, if what is being catalogued is an original text created in electronic form. The element may contain a simple prose description, or, more usefully, a bibliographic citation: The element may contain a simple prose description, or, more usefully, a bibliographic citation:  ,,,,

examples The first folio of Shakespeare, prepared by Charlton Hinman (The Norton Facsimile, 1968) The first folio of Shakespeare, prepared by Charlton Hinman (The Norton Facsimile, 1968) </sourceDesc> <sourceDesc> No source: created in machine-readable form. No source: created in machine-readable form. <sourceDesc> Eugène Sue Eugène Sue Martin, l'enfant trouvé Martin, l'enfant trouvé Mémoires d'un valet de chambre Mémoires d'un valet de chambre Bruxelles et Leipzig Bruxelles et Leipzig C. Muquardt C. Muquardt

encoding description encoding description describes the relationship between an electronic text and its source or sources describes the relationship between an electronic text and its source or sources allows for detailed description of whether (or how) the text was normalized during transcription, how the encoder resolved ambiguities in the source, what levels of encoding or analysis were applied, etc. allows for detailed description of whether (or how) the text was normalized during transcription, how the encoder resolved ambiguities in the source, what levels of encoding or analysis were applied, etc.

(2) The content of the encoding description may be a prose description, or it may contain elements from the following list, in the order given: (project description) (sampling declaration) (editorial practice declaration) (tagging declaration) (references declaration) (classification declarations) (FSD (feature-system declaration) declaration) (metrical declaration) declares method used to encode text-critical variants. For more details, see the TEI P5 guidelines, 5.3 The Encoding Description 5.3 The Encoding Description

text profile text profile contains classificatory and contextual information about the text contains classificatory and contextual information about the text e.g. its subject matter, the individuals described by or participating in producing it, etc. e.g. its subject matter, the individuals described by or participating in producing it, etc. of particular use in structured composite texts such as corpora, where it is often desirable to enforce a controlled descriptive vocabulary or to perform retrievals from a body of text in terms of text type or origin of particular use in structured composite texts such as corpora, where it is often desirable to enforce a controlled descriptive vocabulary or to perform retrievals from a body of text in terms of text type or origin For more details, see the TEI P5 guidelines, 5.4 The Profile Description 5.4 The Profile Description

revision history revision history allows the encoder to provide a history of changes made during the development of the electronic text allows the encoder to provide a history of changes made during the development of the electronic text important for version control and for resolving questions about the history of a file. important for version control and for resolving questions about the history of a file. For more details, see the TEI P5 guidelines, 5.5 The Revision Description 5.5 The Revision Description

Dublin core The Dublin Core Metadata Initiative (dublincore.org) was founded during a joint workshop of the National Center for Supercomputing Applications (NCSA) and the Online Computer Library Center (OCLC) held in Dublin, Ohio, March The Dublin Core Metadata Initiative (dublincore.org) was founded during a joint workshop of the National Center for Supercomputing Applications (NCSA) and the Online Computer Library Center (OCLC) held in Dublin, Ohio, March 1995.dublincore.org The aim was to create a core set of meta-data descriptions for Web-based resources that would be useful for categorizing the Web for easier search and retrieval. The aim was to create a core set of meta-data descriptions for Web-based resources that would be useful for categorizing the Web for easier search and retrieval. Dublin Core Metadata Element Set (DCES) defines 15 elements: Dublin Core Metadata Element Set (DCES) defines 15 elements:  Title: A name given to the resource.  Creator: An entity primarily responsible for making the content of the resource.  Subject: The topic of the content of the resource.  Description: An account of the content of the resource.  Publisher: An entity responsible for making the resource available.  Contributor: An entity responsible for making contributions to the content of the resource.  Date: A date associated with an event in the life cycle of the resource.  Type: The nature or genre of the content of the resource.  Format: The physical or digital manifestation of the resource.  Identifier: An unambiguous reference to the resource within a given context.  Source: A Reference to a resource from which the present resource is derived.  Language: A language of the intellectual content of the resource.  Relation: A reference to a related resource.  Coverage: The extent or scope of the content of the resource.  Rights: Information about rights held in and over the resource.

Tools for corpus building 1. Google, BootCat 2. Perl, Python,… 3. XML, XSLT 4. tokenisers, taggers, aligners, parsers… 5. annotation editors

And exploitation PC concordancers: PC concordancers:  WordSmith, ParaConc Web concordancers / viewers Web concordancers / viewers  With corpus: BNC View etc.  Engines: CQP, Manatee  Specialised: TIGER search Data extraction: Data extraction:  XSLT / XQuery  SAX + Perl, Java  Database Analysis: Analysis:  Excel  SQL HLT uses.. HLT uses..