Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.

Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec 3.11.2006

Overview 1. a few words about me 2. a few words about you 3. introduction to corpora and annotation Practicum: exploring publicly accessible corpora

Lecturer Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute Ljubljana Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute Ljubljana http://nl.ijs.si/et/ http://nl.ijs.si/et/ http://nl.ijs.si/et/ tomaz.erjavec@ijs.si tomaz.erjavec@ijs.si tomaz.erjavec@ijs.si corpora and other language resources, standards, annotation, text-critical editions corpora and other language resources, standards, annotation, text-critical editions Web page for this course: http://nl.ijs.si/et/teach/graz06/annotation/ Web page for this course: http://nl.ijs.si/et/teach/graz06/annotation/ http://nl.ijs.si/et/teach/graz06/annotation/

Students background: field of study, exposure to corpus linguistics? background: field of study, exposure to corpus linguistics? emails? emails? expectations? expectations?

What is a corpus? The Collins English Dictionary (1986): 1. a collection or body of writings, esp. by a single author or topic. Guidelines of the Expert Advisory Group on Language Engineering Standards, EAGLES: Corpus : A collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language. Computer corpus : a corpus which is encoded in a standardised and homogeneous way for open-ended retrieval tasks. Its constituent pieces of language are documented as to their origins and provenance. EAGLES Corpus Computer corpus EAGLES Corpus Computer corpus

Using corpora Research on actual language: descriptive approach, study of performance, empirical linguistics. Research on actual language: descriptive approach, study of performance, empirical linguistics. Applied linguistics: Applied linguistics:  Lexicography: mono-lingual dictionaries, terminological, bi-lingual  Language studies: hypothesis verification, knowledge discovery (lexis, morphology, syntax,...)  Translation studies: a source translation equivalents and their contexts translation memories, machine aided translations  Language learning: real-life examples "idiomatic teaching", curriculum development Language technology: Language technology:  testing set for developed methods;  training set for inductive learning  (statistical Natural Language Processing) statistical Natural Language Processingstatistical Natural Language Processing

Characteristics of a corpus Quantity: the bigger, the better Quantity: the bigger, the better Quality : the texts are authentic; the mark-up is validated Quality : the texts are authentic; the mark-up is validated Simplicity: the computer representation is understandable, with the markup easily separated from the text Simplicity: the computer representation is understandable, with the markup easily separated from the text Documented: the corpus contains bibliographic and other meta- data Documented: the corpus contains bibliographic and other meta- data

Typology of corpora Corpora of written language, spoken and speech corpora (authenticity/price) e.g. the agency ELRA catalog Corpora of written language, spoken and speech corpora (authenticity/price) e.g. the agency ELRA catalogELRAcatalogELRAcatalog Reference corpora (representative) and sub-language corpora (specialised) e.g. BNC, ICE, COLT Reference corpora (representative) and sub-language corpora (specialised) e.g. BNC, ICE, COLTBNCICECOLTBNCICECOLT Corpora with integral texts or of text samples (historical and legal reasons) e.g. Brown Corpora with integral texts or of text samples (historical and legal reasons) e.g. BrownBrown Static and monitor corpora (language change) Static and monitor corpora (language change) Monolingual and multilingual parallel and comparable corpora e.g. Hansard, Europarl Monolingual and multilingual parallel and comparable corpora e.g. Hansard, EuroparlHansardEuroparlHansardEuroparl Plain text and annotated corpora Plain text and annotated corpora

History Computational) linguistic paradigms: Computational) linguistic paradigms: 1950 -- 1960: empiricism weak computers: frequency lists 1950 -- 1960: empiricism weak computers: frequency lists 1970 -- 1980: cognitive modeling (generative approaches, artificial intelligence ) deep analysis / "basic science": computational linguistics 1970 -- 1980: cognitive modeling (generative approaches, artificial intelligence ) deep analysis / "basic science": computational linguistics 1990 --...: empiricist revival, also combined approaches quantity / usefulness: language technologies 1990 --...: empiricist revival, also combined approaches quantity / usefulness: language technologies 2000 --...: The Web 2000 --...: The Web

The history of computer corpora: First milestones: Brown (1 million words) 1964; LOB (also 1M) 1974 First milestones: Brown (1 million words) 1964; LOB (also 1M) 1974BrownLOBBrownLOB The spread of reference corpora: Cobuild Bank of English (monitor, 100..200..M) 1980; BNC (100M) 1995; Czech CNC (100M) 1998; Croatian HNK (100M) 1999... The spread of reference corpora: Cobuild Bank of English (monitor, 100..200..M) 1980; BNC (100M) 1995; Czech CNC (100M) 1998; Croatian HNK (100M) 1999...BNCCNCHNKBNCCNCHNK Slovene language reference corpora: FIDA (100M), Nova Beseda (100M...) 1998; Slovene language reference corpora: FIDA (100M), Nova Beseda (100M...) 1998;FIDA Nova BesedaFIDA Nova Beseda EU corpus oriented projects in the '90: NERC, MULTEXT-East,... EU corpus oriented projects in the '90: NERC, MULTEXT-East,... MULTEXT-East Language resources brokers: LDC 1992, ELRA 1995 Language resources brokers: LDC 1992, ELRA 1995LDCELRALDCELRA

Literature on corpora Corpus Linguistics by Tony McEnery and Andrew Wilson. Edinburgh: Edinburgh University Press, 1996 Corpus Linguistics by Tony McEnery and Andrew Wilson. Edinburgh: Edinburgh University Press, 1996 An Introduction to Corpus Linguistics by Graeme D. Kennedy. Studies in Language and Linguistics, London, 1998 An Introduction to Corpus Linguistics by Graeme D. Kennedy. Studies in Language and Linguistics, London, 1998 Corpus Linguistics: Investigating Language Structure and Use by Douglas Biber, Susan Conrad, Randi Reppen. Cambridge University Press, 1998 Corpus Linguistics: Investigating Language Structure and Use by Douglas Biber, Susan Conrad, Randi Reppen. Cambridge University Press, 1998 Uvod v korpusno jezikoslovje, Vojko Gorjanc. Domžale: Izolit, 2005 Uvod v korpusno jezikoslovje, Vojko Gorjanc. Domžale: Izolit, 2005 LREC conferences: Fifth international conference on Language Resources and Evaluation, LREC'06 LREC conferences: Fifth international conference on Language Resources and Evaluation, LREC'06LREC'06 Slovenian Conferences on LANGUAGE TECHNOLOGIES 2006, 2004,2002, 2000, 1998 Slovenian Conferences on LANGUAGE TECHNOLOGIES 2006, 2004,2002, 2000, 1998 20062004200220001998 20062004200220001998

Steps in the preparation of a corpus Choosing the component texts: linguistic and non-linguistic criteria; availability; simplicity; size Choosing the component texts: linguistic and non-linguistic criteria; availability; simplicity; size Copyright sensitivity of source (financial and privacy considerations); agreement with providers; usage, publication Copyright sensitivity of source (financial and privacy considerations); agreement with providers; usage, publication Acquiring digital originals Web transfer; visit; OCR Acquiring digital originals Web transfer; visit; OCR Up-translation conversion to standard format; consistency; character set encodings Up-translation conversion to standard format; consistency; character set encodings Linguistic annotation language dependent methods; errors Linguistic annotation language dependent methods; errors Documentation TEI header; Open Archives etc. Documentation TEI header; Open Archives etc. Use / Download Use / Download  (Web-based) concordancers for linguists  download needed for HLT use  licences for use

What annotation can be added to the text of the corpus? Annotation = interpretation Annotation = interpretation Documentation about the corpus (example) Documentation about the corpus (example)example Document structure (example) Document structure (example)example Basic linguistic markup: sentences, words (example), punctuation, abbreviations (example) Basic linguistic markup: sentences, words (example), punctuation, abbreviations (example)example Lemmas and morphosyntactic descriptions (example) Lemmas and morphosyntactic descriptions (example)example Syntax (example) Syntax (example)example Alignment (example) Alignment (example)example Terms, semantics, anaphora, pragmatics, intonation,... Terms, semantics, anaphora, pragmatics, intonation,...

Markup Methods hand annotation: documentation, first steps generic (XML, spreadsheet) editors or specialised editors hand annotation: documentation, first steps generic (XML, spreadsheet) editors or specialised editors semi-automatic: morphosyntactic and other linguistic annotation cyclic approach: machine, hand, validate, correct, machine,... semi-automatic: morphosyntactic and other linguistic annotation cyclic approach: machine, hand, validate, correct, machine,... machine, with hand-written rules: tokenisation regular expression machine, with hand-written rules: tokenisation regular expression machine, with inductivelly built models from annotated data: "supervised learning"; HMMs, decision trees, inductive logic programming,... machine, with inductivelly built models from annotated data: "supervised learning"; HMMs, decision trees, inductive logic programming,... machine, with inductivelly built models from un-annotated data: "unsupervised leaning"; clustering technigues machine, with inductivelly built models from un-annotated data: "unsupervised leaning"; clustering technigues overview of the field overview of the field overview of the field overview of the field

Computer coding of corpora Many corpora encoded in simple tabular format Many corpora encoded in simple tabular format A good encoding must ensure durability, enable interchange between computer platforms and applications A good encoding must ensure durability, enable interchange between computer platforms and applications The basic standard used is Extended Markup Language, XML The basic standard used is Extended Markup Language, XMLXML There are a number of companion standards and technologies: XML transformations (XSLT), data definition (DTD, XML Schema, ISO Relax NG), addressing and queries (XPath, XQuery),... There are a number of companion standards and technologies: XML transformations (XSLT), data definition (DTD, XML Schema, ISO Relax NG), addressing and queries (XPath, XQuery),... The vocabulary of annotations for corpora and other language resources are defined by the Text Encoding Initiative, TEI The vocabulary of annotations for corpora and other language resources are defined by the Text Encoding Initiative, TEITEI

Examples of use Concordances Concordances Collocations “You shall know a word by the company it keeps.” (Firth, 1957) Collocations “You shall know a word by the company it keeps.” (Firth, 1957) Induction of multilingual lexica Induction of multilingual lexica Automatic translation Automatic translation

The future of corpus and data- driven linguistics Size: Size:  Larger quantities of readily accessible data (Web as corpus)  Larger storage and processing power (Moore law) Complexity: Complexity:  Deeper analysis: syntax, deixis, semantic roles, dialogue acts,...  Multimodal corpora: speech, film, transcriptions,...  Annotation levels and linking: co-existence and linking of varied types of annotations; ambiguity  Development of tools and platforms: precision, robustness, unsupervised learning, meta-learning

Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.

Similar presentations

Presentation on theme: "Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.

Similar presentations

Presentation on theme: "Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec."— Presentation transcript:

Similar presentations

About project

Feedback