Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Building up Corpus of Technical Vocabulary – Strategies and Feasibility Presenters: Dr. Aparna Palle, Preetha Anthony GNITS, HYDERABAD.
MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora Tomaž Erjavec Department of Knowledge Technologies Jožef.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Uses of a Corpus “[E]xplore actual patterns of language use”
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
WG3: Innovative e-dictionaries Simon Krek „Jožef Stefan“ Institute, Ljubljana, Slovenia Carole Tiberius Institute of Dutch Lexicology, Leiden, the Netherlands.
Multilingual eLearning in LANGuage Engineering. Project Overview  Project span: Oct 2004 – Oct 2007  Kick-off meeting Oct  Project goals:
Natural Language and Speech Processing Creation of computational models of the understanding and the generation of natural language. Different fields coming.
New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
LELA English Corpus Linguistics
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
تمرين شماره 1 درس NLP سيلابس درس NLP در دانشگاه هاي ديگر ___________________________ راحله مکي استاد درس: دکتر عبدالله زاده پاييز 85.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
Presented by Jennifer Robison TexTESOL II March 12, 2010 San Antonio, TX.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
WG3: Innovative e-dictionaries Simon Krek „Jožef Stefan“ Institute, Ljubljana, Slovenia Carole Tiberius Institute of Dutch Lexicology, Leiden, the Netherlands.
Memory Strategy – Using Mental Images
CORPUS LINGUISTICS: AN INTRODUCTION Susi Yuliawati, M.Hum. Universitas Padjadjaran
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
CAREERS IN LINGUISTICS OUTSIDE OF ACADEMIA CAREERS IN INDUSTRY.
Corpus linguistics for translators Amanda Saksida University of Nova Gorica.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Claudia Marzi Institute for Computational Linguistics, “Antonio Zampolli” – Italian National Research Council University of Pavia – Dept. of Theoretical.
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
XP 1 CREATING AN XML DOCUMENT. XP 2 INTRODUCING XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of.
1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.
Chapter 1 Understanding the Web Design Environment Principles of Web Design, 4 th Edition.
Week 1 Understanding the Web Design Environment. 1-2 HTML: Then and Now HTML is an application of the Standard Generalized Markup Language Intended to.
Researching language with computers Paul Thompson.
Standards for digital encoding Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž.
©2006 Barry Natusch Tools for Language Researchers Barry Natusch “ Man is a tool-using animal. Without tools he is nothing, with tools he is all. ” - Thomas.
TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.
Linguistics & AI1 Linguistics and Artificial Intelligence Linguistics and Artificial Intelligence Frank Van Eynde Center for Computational Linguistics.
Suléne Pilon & Danie Prinsloo Overview: Teaching and Training in South Africa 25 November 2008;
XML and Digital Libraries M. Zubair Department of Computer Science Old Dominion University.
ISLE: International Standards for Language Engineering A European/US joint project Martha Palmer University of Pennsylvania Tides Kickoff March 22, 2000.
Language Technology I © 2005 Hans Uszkoreit Language Technology I 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.
Standards for digital encoding Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Advanced Language Technologies Information and Communication Technologies Research Area "Knowledge Technologies" Jožef Stefan International Postgraduate.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture 1: Overview
Discovery Metadata for Special Collections Concepts, Considerations, Choices William E. Moen School of Library and Information Sciences Texas Center for.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Encoding language corpora: current trends and future directions Tomaž Erjavec Department of Knowledge Technologies Department of Knowledge Technologies.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
Advanced Language Technologies Information and Communication Technologies Module "Knowledge Technologies" Jožef Stefan International Postgraduate School.
1 Branches of Linguistics. 2 Branches of linguistics Linguists are engaged in a multiplicity of studies, some of which bear little direct relationship.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Enda F. Scott 2001 Good morning An introduction to modern dictionary making.
Standards for digital encoding Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture 2: TEI.
The Unreasonable Effectiveness of Data
Introduction A field survey of Dutch language resources has been carried out within the framework of a project launched by the Dutch Language Union (Nederlandse.
1 An Introduction to Computational Linguistics Mohammad Bahrani.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
Part 1 The Basics of Information Systems. Purpose of Information Systems Information systems ◦ Collects, stores and organizes information ◦ Retrieves.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Computational and Statistical Methods for Corpus Analysis: Overview
Natural Language Processing (NLP)
Corpus-Based ELT CEL Symposium Creating Learning Designers
Natural Language Processing (NLP)
Information Retrieval
Natural Language Processing (NLP)
Presentation transcript:

Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec

Overview 1. a few words about me 2. a few words about you 3. introduction to corpora and annotation Practicum: exploring publicly accessible corpora

Lecturer Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute Ljubljana Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute Ljubljana corpora and other language resources, standards, annotation, text-critical editions corpora and other language resources, standards, annotation, text-critical editions Web page for this course: Web page for this course:

Students background: field of study, exposure to corpus linguistics? background: field of study, exposure to corpus linguistics? s? s? expectations? expectations?

What is a corpus? The Collins English Dictionary (1986): 1. a collection or body of writings, esp. by a single author or topic. Guidelines of the Expert Advisory Group on Language Engineering Standards, EAGLES: Corpus : A collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language. Computer corpus : a corpus which is encoded in a standardised and homogeneous way for open-ended retrieval tasks. Its constituent pieces of language are documented as to their origins and provenance. EAGLES Corpus Computer corpus EAGLES Corpus Computer corpus

Using corpora Research on actual language: descriptive approach, study of performance, empirical linguistics. Research on actual language: descriptive approach, study of performance, empirical linguistics. Applied linguistics: Applied linguistics:  Lexicography: mono-lingual dictionaries, terminological, bi-lingual  Language studies: hypothesis verification, knowledge discovery (lexis, morphology, syntax,...)  Translation studies: a source translation equivalents and their contexts translation memories, machine aided translations  Language learning: real-life examples "idiomatic teaching", curriculum development Language technology: Language technology:  testing set for developed methods;  training set for inductive learning  (statistical Natural Language Processing) statistical Natural Language Processingstatistical Natural Language Processing

Characteristics of a corpus Quantity: the bigger, the better Quantity: the bigger, the better Quality : the texts are authentic; the mark-up is validated Quality : the texts are authentic; the mark-up is validated Simplicity: the computer representation is understandable, with the markup easily separated from the text Simplicity: the computer representation is understandable, with the markup easily separated from the text Documented: the corpus contains bibliographic and other meta- data Documented: the corpus contains bibliographic and other meta- data

Typology of corpora Corpora of written language, spoken and speech corpora (authenticity/price) e.g. the agency ELRA catalog Corpora of written language, spoken and speech corpora (authenticity/price) e.g. the agency ELRA catalogELRAcatalogELRAcatalog Reference corpora (representative) and sub-language corpora (specialised) e.g. BNC, ICE, COLT Reference corpora (representative) and sub-language corpora (specialised) e.g. BNC, ICE, COLTBNCICECOLTBNCICECOLT Corpora with integral texts or of text samples (historical and legal reasons) e.g. Brown Corpora with integral texts or of text samples (historical and legal reasons) e.g. BrownBrown Static and monitor corpora (language change) Static and monitor corpora (language change) Monolingual and multilingual parallel and comparable corpora e.g. Hansard, Europarl Monolingual and multilingual parallel and comparable corpora e.g. Hansard, EuroparlHansardEuroparlHansardEuroparl Plain text and annotated corpora Plain text and annotated corpora

History Computational) linguistic paradigms: Computational) linguistic paradigms: : empiricism weak computers: frequency lists : empiricism weak computers: frequency lists : cognitive modeling (generative approaches, artificial intelligence ) deep analysis / "basic science": computational linguistics : cognitive modeling (generative approaches, artificial intelligence ) deep analysis / "basic science": computational linguistics : empiricist revival, also combined approaches quantity / usefulness: language technologies : empiricist revival, also combined approaches quantity / usefulness: language technologies : The Web : The Web

The history of computer corpora: First milestones: Brown (1 million words) 1964; LOB (also 1M) 1974 First milestones: Brown (1 million words) 1964; LOB (also 1M) 1974BrownLOBBrownLOB The spread of reference corpora: Cobuild Bank of English (monitor, M) 1980; BNC (100M) 1995; Czech CNC (100M) 1998; Croatian HNK (100M) The spread of reference corpora: Cobuild Bank of English (monitor, M) 1980; BNC (100M) 1995; Czech CNC (100M) 1998; Croatian HNK (100M) BNCCNCHNKBNCCNCHNK Slovene language reference corpora: FIDA (100M), Nova Beseda (100M...) 1998; Slovene language reference corpora: FIDA (100M), Nova Beseda (100M...) 1998;FIDA Nova BesedaFIDA Nova Beseda EU corpus oriented projects in the '90: NERC, MULTEXT-East,... EU corpus oriented projects in the '90: NERC, MULTEXT-East,... MULTEXT-East Language resources brokers: LDC 1992, ELRA 1995 Language resources brokers: LDC 1992, ELRA 1995LDCELRALDCELRA

Literature on corpora Corpus Linguistics by Tony McEnery and Andrew Wilson. Edinburgh: Edinburgh University Press, 1996 Corpus Linguistics by Tony McEnery and Andrew Wilson. Edinburgh: Edinburgh University Press, 1996 An Introduction to Corpus Linguistics by Graeme D. Kennedy. Studies in Language and Linguistics, London, 1998 An Introduction to Corpus Linguistics by Graeme D. Kennedy. Studies in Language and Linguistics, London, 1998 Corpus Linguistics: Investigating Language Structure and Use by Douglas Biber, Susan Conrad, Randi Reppen. Cambridge University Press, 1998 Corpus Linguistics: Investigating Language Structure and Use by Douglas Biber, Susan Conrad, Randi Reppen. Cambridge University Press, 1998 Uvod v korpusno jezikoslovje, Vojko Gorjanc. Domžale: Izolit, 2005 Uvod v korpusno jezikoslovje, Vojko Gorjanc. Domžale: Izolit, 2005 LREC conferences: Fifth international conference on Language Resources and Evaluation, LREC'06 LREC conferences: Fifth international conference on Language Resources and Evaluation, LREC'06LREC'06 Slovenian Conferences on LANGUAGE TECHNOLOGIES 2006, 2004,2002, 2000, 1998 Slovenian Conferences on LANGUAGE TECHNOLOGIES 2006, 2004,2002, 2000,

Steps in the preparation of a corpus Choosing the component texts: linguistic and non-linguistic criteria; availability; simplicity; size Choosing the component texts: linguistic and non-linguistic criteria; availability; simplicity; size Copyright sensitivity of source (financial and privacy considerations); agreement with providers; usage, publication Copyright sensitivity of source (financial and privacy considerations); agreement with providers; usage, publication Acquiring digital originals Web transfer; visit; OCR Acquiring digital originals Web transfer; visit; OCR Up-translation conversion to standard format; consistency; character set encodings Up-translation conversion to standard format; consistency; character set encodings Linguistic annotation language dependent methods; errors Linguistic annotation language dependent methods; errors Documentation TEI header; Open Archives etc. Documentation TEI header; Open Archives etc. Use / Download Use / Download  (Web-based) concordancers for linguists  download needed for HLT use  licences for use

What annotation can be added to the text of the corpus? Annotation = interpretation Annotation = interpretation Documentation about the corpus (example) Documentation about the corpus (example)example Document structure (example) Document structure (example)example Basic linguistic markup: sentences, words (example), punctuation, abbreviations (example) Basic linguistic markup: sentences, words (example), punctuation, abbreviations (example)example Lemmas and morphosyntactic descriptions (example) Lemmas and morphosyntactic descriptions (example)example Syntax (example) Syntax (example)example Alignment (example) Alignment (example)example Terms, semantics, anaphora, pragmatics, intonation,... Terms, semantics, anaphora, pragmatics, intonation,...

Markup Methods hand annotation: documentation, first steps generic (XML, spreadsheet) editors or specialised editors hand annotation: documentation, first steps generic (XML, spreadsheet) editors or specialised editors semi-automatic: morphosyntactic and other linguistic annotation cyclic approach: machine, hand, validate, correct, machine,... semi-automatic: morphosyntactic and other linguistic annotation cyclic approach: machine, hand, validate, correct, machine,... machine, with hand-written rules: tokenisation regular expression machine, with hand-written rules: tokenisation regular expression machine, with inductivelly built models from annotated data: "supervised learning"; HMMs, decision trees, inductive logic programming,... machine, with inductivelly built models from annotated data: "supervised learning"; HMMs, decision trees, inductive logic programming,... machine, with inductivelly built models from un-annotated data: "unsupervised leaning"; clustering technigues machine, with inductivelly built models from un-annotated data: "unsupervised leaning"; clustering technigues overview of the field overview of the field overview of the field overview of the field

Computer coding of corpora Many corpora encoded in simple tabular format Many corpora encoded in simple tabular format A good encoding must ensure durability, enable interchange between computer platforms and applications A good encoding must ensure durability, enable interchange between computer platforms and applications The basic standard used is Extended Markup Language, XML The basic standard used is Extended Markup Language, XMLXML There are a number of companion standards and technologies: XML transformations (XSLT), data definition (DTD, XML Schema, ISO Relax NG), addressing and queries (XPath, XQuery),... There are a number of companion standards and technologies: XML transformations (XSLT), data definition (DTD, XML Schema, ISO Relax NG), addressing and queries (XPath, XQuery),... The vocabulary of annotations for corpora and other language resources are defined by the Text Encoding Initiative, TEI The vocabulary of annotations for corpora and other language resources are defined by the Text Encoding Initiative, TEITEI

Examples of use Concordances Concordances Collocations “You shall know a word by the company it keeps.” (Firth, 1957) Collocations “You shall know a word by the company it keeps.” (Firth, 1957) Induction of multilingual lexica Induction of multilingual lexica Automatic translation Automatic translation

The future of corpus and data- driven linguistics Size: Size:  Larger quantities of readily accessible data (Web as corpus)  Larger storage and processing power (Moore law) Complexity: Complexity:  Deeper analysis: syntax, deixis, semantic roles, dialogue acts,...  Multimodal corpora: speech, film, transcriptions,...  Annotation levels and linking: co-existence and linking of varied types of annotations; ambiguity  Development of tools and platforms: precision, robustness, unsupervised learning, meta-learning