Advanced Language Technologies Information and Communication Technologies Module "Knowledge Technologies" Jožef Stefan International Postgraduate School.

Slides:

Advertisements

Similar presentations

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.

Advertisements

MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec Department of Knowledge Technologies.

MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora Tomaž Erjavec Department of Knowledge Technologies Jožef.

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.

What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.

The MULTEXT-East multilingual language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.

Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.

Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.

New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.

Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.

POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.

1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.

Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.

Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.

Part of speech (POS) tagging

Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.

BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen.

The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.

Overview of Search Engines

Research methods in corpus linguistics Xiaofei Lu.

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

Albert Gatt Corpora and Statistical Methods Lecture 9.

Advanced Language Technologies Information and Communication Technologies Module "Knowledge Technologies" Jožef Stefan International Postgraduate School.

An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.

Corpus linguistics for translators Amanda Saksida University of Nova Gorica.

Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.

Some Advances in Transformation-Based Part of Speech Tagging

BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation.

Survey of Semantic Annotation Platforms

1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.

Standards for digital encoding Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž.

인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.

Language Data Resources About Corpora. J. Sinclair: “Language looks rather different when you look at a lot of it at once.“ P. Eisner: “Znáte jej, ten.

27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.

Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.

Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.

Introduction to CL & NLP CMSC April 1, 2003.

Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.

S1: Chapter 1 Mathematical Models Dr J Frost Last modified: 6 th September 2015.

Advanced Language Technologies Information and Communication Technologies Research Area "Knowledge Technologies" Jožef Stefan International Postgraduate.

Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.

Improving Morphosyntactic Tagging of Slovene by Tagger Combination Jan Rupnik Miha Grčar Tomaž Erjavec Jožef Stefan Institute.

Tokenization & POS-Tagging

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.

CSKGOI'08 Commonsense Knowledge and Goal Oriented Interfaces.

Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.

For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.

Part-of-speech tagging

Shallow Parsing for South Asian Languages -Himanshu Agrawal.

Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.

Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.

Overview of Statistical NLP IR Group Meeting March 7, 2006.

NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software.

Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.

What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.

Part-Of-Speech Tagging Radhika Mamidi. POS tagging Tagging means automatic assignment of descriptors, or tags, to input tokens. Example: “Computational.

Information Retrieval in Practice

Language Identification and Part-of-Speech Tagging

Computational and Statistical Methods for Corpus Analysis: Overview

Natural Language Processing (NLP)

MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec Department of Knowledge.

Computer Corpora and their annotation

Natural Language Processing (NLP)

TECHNICAL REPORTS WRITING

Presentation transcript:

Advanced Language Technologies Information and Communication Technologies Module "Knowledge Technologies" Jožef Stefan International Postgraduate School Winter 2010 / Spring 2011 Jožef Stefan International Postgraduate School Tomaž Erjavec Lecture II. Computer Corpora

Overview of the lecture 1. Background 2. Corpus compilation and markup 3. Morphosyntactic tagging

Background What is a corpus? Using corpora Characteristics of a corpus Typology of corpora History Slovene language corpora

A corpus is: a large collection of texts a large collection of texts in digital format in digital format language “as it is” language “as it is” a sample of the language it is meant to represent a sample of the language it is meant to represent used for describing language (descriptive/empirical linguistics) used for describing language (descriptive/empirical linguistics)

A more precise definition Corpus (plural corpora) is Latin for body Corpus (plural corpora) is Latin for body Guidelines of the Expert Advisory Group on Language Engineering Standards, EAGLES: Guidelines of the Expert Advisory Group on Language Engineering Standards, EAGLES:EAGLES –Corpus : A collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language. Corpus Corpus –Computer corpus : a corpus which is encoded in a standardised and homogeneous way for open-ended retrieval tasks. Its constituent pieces of language are documented as to their origins and provenance. Computer corpus Computer corpus For computer scientists: a dataset For computer scientists: a dataset

Using corpora Applied linguistics: Applied linguistics: –Lexicography: making dictionaries (first users of corpora) –Translation studies: translation equivalents with contexts translation memories, machine aided translations –Language learning: real-life examples, curriculum development Corpus linguistics: Corpus linguistics: –linguistics based not on introspection, but on observation of real data Language technology: Language technology: –testing set for developed methods; –training set for inductive learning (statistical Natural Language Processing) statistical Natural Language Processingstatistical Natural Language Processing

Characteristics of a (good) corpus Quantity: the bigger, the better Quantity: the bigger, the better Quality : the texts are authentic; the mark-up is validated Quality : the texts are authentic; the mark-up is validated Simplicity: the computer representation is understandable, with the markup easily separated from the text Simplicity: the computer representation is understandable, with the markup easily separated from the text Documented: the corpus contains bibliographic and other meta- data Documented: the corpus contains bibliographic and other meta- data

Typology of corpora I. Medium: Medium: –written language –spoken language (spoken, but in writing / transcription) –speech corpora (actual speech signal) Content: Content: –reference corpora (representative), e.g. BNC BNC –sub-language corpora (specialised), e.g. COLT COLT Structure: Structure: –corpora with integral texts –corpora or of text samples (historical and legal reasons) e.g. Brown Brown

Typology of corpora II Time: Time: –static corpora –monitor corpora (language change) Languages: Languages: –monolingual corpora –multilingual parallel corpora (e.g. Hansard, Europarl, JRC Acquis) Hansard EuroparlJRC AcquisHansard EuroparlJRC Acquis –multilingual comparable corpora Annotation: Annotation: –plain text corpora –annotated corpora

Reference corpora Characteristics: Characteristics: –a sample of the “complete” language –large, expensive, detailed and explicit design criteria –typically of contemporary language –documented and annotated –legaly clean, available (but usu. only via a concordancer) Criteria for including texts: Criteria for including texts: –representativeness: corpus includes “all” text types –balance: the sizes of text type samples are in proportion to their “importance” for the speakers of the language metodhodology v.s. practical constraints metodhodology v.s. practical constraints

History of corpora First milestones: Brown (1 million words) 1964; LOB (also 1M) 1974 First milestones: Brown (1 million words) 1964; LOB (also 1M) 1974 BrownLOB BrownLOB The spread of reference corpora: Cobuild Bank of English (monitor, M) 1980; BNC (100M) 1995; Czech CNC (100M) 1998; Croatian HNK (100M) The spread of reference corpora: Cobuild Bank of English (monitor, M) 1980; BNC (100M) 1995; Czech CNC (100M) 1998; Croatian HNK (100M) BNCCNCHNKBNCCNCHNK Slovene reference corpora: FIDA (100M), Nova Beseda (100M...) 1998; FIDA+ (600M) 2006; gigaFIDA (2011?). Slovene reference corpora: FIDA (100M), Nova Beseda (100M...) 1998; FIDA+ (600M) 2006; gigaFIDA (2011?).FIDANova BesedaFIDA+FIDANova BesedaFIDA+ EU corpus oriented projects in the '90: NERC, MULTEXT-East,... EU corpus oriented projects in the '90: NERC, MULTEXT-East,... MULTEXT-East Language resources brokers: LDC 1992, ELRA 1995 Language resources brokers: LDC 1992, ELRA 1995LDCELRALDCELRA Web as Corpus (2000..): ukWaC, itWaC, … slWaC Web as Corpus (2000..): ukWaC, itWaC, … slWaC more, larger, for more languages, with diverse annotations: EUROPARL, PDT, … more, larger, for more languages, with diverse annotations: EUROPARL, PDT, …

Slovene language corpora Monolingual reference corpora: ZRC SAZU: Beseda, 1998; Nova beseda, ZRC SAZU: Beseda, 1998; Nova beseda, 2000-BesedaNova besedaBesedaNova beseda DZS, Amebis, FF, IJS: FIDA, 1998, FidaPlus, 2006 DZS, Amebis, FF, IJS: FIDA, 1998, FidaPlus, 2006FIDAFidaPlusFIDAFidaPlus IJS, FF: JOS corpora IJS, FF: JOS corporaJOS Parallel corpora: IJS: MULTEXT-East 1998-, SVEZ-IJS, 2004, JRC-ACQUIS, 2006 IJS: MULTEXT-East 1998-, SVEZ-IJS, 2004, JRC-ACQUIS, 2006MULTEXT-EastSVEZ-IJSJRC-ACQUISMULTEXT-EastSVEZ-IJSJRC-ACQUIS SVEZ: EuroKorpus SVEZ: EuroKorpus FF: TRANS, 2002 FF: TRANS, 2002TRANS Speech corpora: Laboratory for Digital Signal Processing, University of Maribor: SpeechDat, ONOMASTICA... Laboratory for Digital Signal Processing, University of Maribor: SpeechDat, ONOMASTICA... Laboratory for Digital Signal Processing, University of Maribor Laboratory for Digital Signal Processing, University of Maribor Laboratory of Articifical Perception, Systems and Cybernetics, University of Ljubljana: SQEL, GOPOLIS,... Laboratory of Articifical Perception, Systems and Cybernetics, University of Ljubljana: SQEL, GOPOLIS,... Laboratory of Articifical Perception, Systems and Cybernetics, University of Ljubljana Laboratory of Articifical Perception, Systems and Cybernetics, University of Ljubljana

II. Compilation and markup of corpora Steps in the preparation of a corpus What annotation can be added to the text Computer coding of corpora Markup Methods

Before making your own corpus check if an appropriate corpus is already available google google LDC, ELRA LDC, ELRA LDCELRA LDCELRA

Steps in the preparation of a corpus 1. Choosing the component texts and acquiring digital originals 2. Up-translation to standard format 3. Linguistic annotation 4. Documentation 5. Use and Dissemination

Getting the text 1. Choosing the component texts: linguistic and non-linguistic criteria; availability; simplicity; size 2. Copyright sensitivity of source (financial and privacy considerations); agreement with providers; usage, publication 3. Acquiring digital originals OCR; digital originals; Web BootCat BootCat

Processing 1. Conversion to common format consistency; character set encodings; structure Web as Corpus: Wacky tools Web as Corpus: Wacky tools 2. Documentation e.g. TEI header; Open Archives etc. 3. Linguistic annotation language dependent methods; errors

Use and dissemination Using the corpus: Using the corpus: –concordancer (linguists) e.g FidaPLUS, SKE, iKorpus, JOS, IMP FidaPLUSSKEiKorpusFidaPLUSSKEiKorpus –statistics extraction –development of new methods for analysis Dissemination: Dissemination: –legalities (source copyright, corpus use agreement) –mode: concordancer or dataset

Computer coding of corpora Encoding must ensure Encoding must ensure –durability –interchange between computer platforms –interchange between applications Basic standard: XML Basic standard: XMLXML –companion standards: W3C Schema, ISO Relax NG, XSLT, XPath, XQuery,... XML vocabulary of annotations of arbitrary texts: Text Encoding Initiative, TEI XML vocabulary of annotations of arbitrary texts: Text Encoding Initiative, TEITEI ISO TC 37 „Terminology and other language resources“: many standards for text encoding ISO TC 37 „Terminology and other language resources“: many standards for text encoding

Corpus annotation Annotation = interpretation Documentation about the corpus (example) Documentation about the corpus (example)example Document structure (example) Document structure (example)example Basic linguistic markup: sentences, words (example), punctuation, abbreviations (example) Basic linguistic markup: sentences, words (example), punctuation, abbreviations (example)example Lemmas and morphosyntactic descriptions (example) Lemmas and morphosyntactic descriptions (example)example Syntax (example) Syntax (example)example Alignment (example) Alignment (example)example Terms, semantics, anaphora, pragmatics, intonation,... Terms, semantics, anaphora, pragmatics, intonation,...

Example: TEI header Ekonomsko ogledalo; 13 številk 98/99 Ekonomsko ogledalo; 13 številk 98/99 Slovenian Economic Mirror; 13 issues, 98/99 Slovenian Economic Mirror; 13 issues, 98/99 Andrej Skubic, FF Andrej Skubic, FF Zagotovitev digitalnega originala, poravnava Zagotovitev digitalnega originala, poravnava Provision of digital original, alignment Provision of digital original, alignment Tomaž Erjavec, IJS Tomaž Erjavec, IJS Tokenizacija, pretvorba v TEI Tokenizacija, pretvorba v TEI Tokenisation, conversion to TEI Tokenisation, conversion to TEI......

Example: text structure Tam pod kostanjevim drevesom Tam pod kostanjevim drevesom izdala si me, izdala si me, izdal sem te, izdal sem te, ne da bi trenila z očesom. ne da bi trenila z očesom. Trije možje se niso niti ganili. Trije možje se niso niti ganili. Toda ko je Winston znova pogledal v Rutherfordov propadli obraz, je opazil, da so njegove oči polne solz.... Toda ko je Winston znova pogledal v Rutherfordov propadli obraz, je opazil, da so njegove oči polne solz....

Example: morphosyntactic tagging Bil Bil je je jasen, jasen, mrzel mrzel aprilski aprilski dan dan in in ure ure so so bile bile trinajst. trinajst. </s>

Example: alignment

Methods for linguistic markup hand annotation: documentation, first steps generic (XML, spreadsheet) editors or specialised editors hand annotation: documentation, first steps generic (XML, spreadsheet) editors or specialised editors semi-automatic: morphosyntactic and other linguistic annotation cyclic approach: machine, hand, validate, correct, machine,... semi-automatic: morphosyntactic and other linguistic annotation cyclic approach: machine, hand, validate, correct, machine,... machine, with hand-written rules: tokenisation regular expression machine, with hand-written rules: tokenisation regular expression machine, with inductively built models from annotated data: "supervised learning"; HMMs, decision trees, inductive logic programming,... machine, with inductively built models from annotated data: "supervised learning"; HMMs, decision trees, inductive logic programming,... machine, with inductively built models from un-annotated data: "unsupervised leaning"; clustering techniques machine, with inductively built models from un-annotated data: "unsupervised leaning"; clustering techniques overview of the field overview of the field overview of the field overview of the field

III. Morphosyntactic tagging Better known as part-of-speech (PoS) tagging Better known as part-of-speech (PoS) tagging Tagging is the task of labeling each word in a sequence of words with its appropriate part-of-speech Words are often ambiguous with respect to their POS: – –saw → singular noun „I brought a saw“ – –saw → past tense of verb „I saw a tree“ Purposes and applications (examples): – –pre-processing step for further analyses: lemmatisation syntactic structure, etc. – –text indexing, e.g. nouns are more useful than verbs – –pronunciation in speech processing

Steps in tagging for each word token in text the tagger needs to know all its possible tags (ambiguity class) → a morphological lexicon for each word token in text the tagger needs to know all its possible tags (ambiguity class) → a morphological lexicon given the context in which the word appears in, the tagger must decide in the correct tag: given the context in which the word appears in, the tagger must decide in the correct tag: –he saw/V a man carrying a saw/N so, tagging performs limited syntactic disambiguation so, tagging performs limited syntactic disambiguation

Example: Penn Treebank Under/IN the/DT proposal/NN,/, Delmed/NNP would/MD issue/VB about/IN 123.5/CD million/CD additional/JJ Delmed/NNP common/JJ shares/NNS to/TO Fresenius/NNP at/IN an/DT average/JJ price/NN of/IN about/IN 65/CD cents/NNS a/DT share/NN,/, though/IN under/IN no/DT circumstances/NNS more/JJR than/IN 75/CD cents/NNS a/DT share/NN./.

PoS taggers Most taggers induce the language model from a hand-annotated corpus Most taggers induce the language model from a hand-annotated corpus Typically, two resources are induced: Typically, two resources are induced: –lexicon, giving the ambiguity class of a word and their frequencies in the training corpus –tag n-grams

Tagging with Markov Models Sequence of tags in a text is regarded a Markov chain Limited horizon: A word’s tag only depends on the previous tag: p(x i+1 = t j | x 1,..., x i ) = p (x i+1 = t j | x i ) Time invariant: This dependency does not change over time: p(x i+1 = t j | x i ) = p(x 2 = t j | x 1 ) Task: Find the most probable tag sequence for a sequence of words Maximum likelihood estimate of tag t k following t j : p(t k | t j ) = f(t j,t k ) / f(t j ) Optimal tags for a sentence: t´ 1,n = arg max p(t 1,n | w 1,n ) = Π p(w i | t i ) p(t i | ti -1 )

Most popular Markov model tagger TnT (Trigrams ‘n Tags) TnT (Trigrams ‘n Tags) TnT induces lexicon and tag trigrams from the training corpus induces lexicon and tag trigrams from the training corpus has heuristics to tag unknown words has heuristics to tag unknown words has no problem with large tagsets has no problem with large tagsets fast in training and tagging fast in training and tagging freely available for non-commercial use freely available for non-commercial use but only as a Linux executable but only as a Linux executable OS alternative: hunpos OS alternative: hunposhunpos

Yet another Tagger For a while, trying out new approaches to tagging was in fashion Maximum Entropy taggers Maximum Entropy taggers Support Vector Machine taggers Support Vector Machine taggers Memory based taggers Memory based taggers …

Tagsets A tagset is a set of part-of-speech tags A tagset is a set of part-of-speech tags Classical 8 classes (Thrax, 100 BC): noun, verb, article, participle, pronoun, preposition, adverb, conjunction But all tagset use more tags than that! Criteria: Criteria: – –specifiability: degree to which humans use the tagset uniformly on the same text – –accuracy: evaluation of output on tagged text – –suitability for intended application

Tagsets for English For English, there exist several tagsets: Brown, CLAWS, Penn, … For English, there exist several tagsets: Brown, CLAWS, Penn, … English tagsets include PoS + some other morphological (inflectional) properties: tags English tagsets include PoS + some other morphological (inflectional) properties: tags Penn Treebank Tagset for English: 37 tags, e.g. Penn Treebank Tagset for English: 37 tags, e.g. – –JJ adjective, positive – –JJR adjective, comparative – –JJS adjective, superlative – –NN non-plural common noun – –NNS plural common noun – –NNP non-plural proper name – –NNPS plural proper name – –IN preposition –…

Morphosyntactic tagsets For inflectionaly rich languages (such as Slavic languages), tagsets contain much more information than just PoS For inflectionaly rich languages (such as Slavic languages), tagsets contain much more information than just PoS Slovene, Czech, etc. > 1,000 different morphosyntactic tags Slovene, Czech, etc. > 1,000 different morphosyntactic tags –gender, number, case, animacy, definiteness, … Efforts to standardise tagsets across languages: Efforts to standardise tagsets across languages: –Eagles –MULTEXT –MULTEXT-East

MULTEXT-East EU project in ’90s: development of language resources for Central and East-European languages EU project in ’90s: development of language resources for Central and East-European languages Several later releases, V4 in 2010 (17 languages) Several later releases, V4 in 2010 (17 languages) Development of morphosyntactic specifications, lexica and annotated corpus Development of morphosyntactic specifications, lexica and annotated corpus Parallel annotated corpus: Orwell’s 1984 Parallel annotated corpus: Orwell’s 1984 Orwell’s 1984 Orwell’s 1984 Web site: Web site:

MULTEXT-East morphosyntactic specifications Specify Specify –what morphosyntactic features particular languages distinguish, –what their names and values are, –how they can be mapped to –how they can be mapped to tags (morphosyntactic descriptions, MSDs) e.g. that e.g. that Ncms is: – –a valid for Slovene – –is equivalent to PoS:Noun, Type:common, Gender:masculine, Number:singular

JOS project JOS language resources are meant to facilitate developments of human language technologies and corpus linguistics for the Slovene language JOS language resources are meant to facilitate developments of human language technologies and corpus linguistics for the Slovene language Morphosyntactic specifications Morphosyntactic specifications Two annotated corpora (morphosyntactic descriptions and lemmas) Two annotated corpora (morphosyntactic descriptions and lemmas) –jos100k (hand validated) –jos1M (partially hand validated) Sampled from FidaPLUS corpus Sampled from FidaPLUS corpus jos100k: syntactic and semantic levels of linguistic description jos100k: syntactic and semantic levels of linguistic description Two web services Two web services –concordancer –text annotation tool Encoded in TEI P5 Encoded in TEI P5 Freely available (CC): Freely available (CC):

jos100k encoding To To je je <w xml:id="F " lemma="turističen„ <w xml:id="F " lemma="turističen„ msd="Ppnmein">turističen msd="Ppnmein">turističen kraj kraj.. </s>

Processing Historical Language interesting for diachronic linguistics and better access to digital libraries interesting for diachronic linguistics and better access to digital libraries problems: problems: –difficult to obtain good transcriptions –great variation in spelling –no resouces for tool training Historical slv: Late standardisation (XIX ≠ XX) Late standardisation (XIX ≠ XX) Before 1850: ſ ſh s sh z zh → s š z ž c č Before 1850: ſ ſh s sh z zh → s š z ž c č No corpora/lexica of historical Slovene No corpora/lexica of historical Slovene

AHLib (2004–08) Deutsch-slowenische/kroatische Übersetzung 1848–1918 AHLib (2004–08) Deutsch-slowenische/kroatische Übersetzung 1848–1918 –Scans + correction + (lemmatisation) of ger → slv books –AAS & Karl-Franzens University, Graz (prof. Erich Prunč) –JSI: correction & lemmatisation environment EU IP IMPACT (ext. 2010–2011) EU IP IMPACT (ext. 2010–2011) –Better OCR for historical texts –NUK: GTD transcriptions –JSI: (semi)manual lexicon construction Google award (2011) Developing language models for historical Slovene Google award (2011) Developing language models for historical Slovene –ZRC SAZU: transcriptions of old texts –JSI: annotating a corpus of XIX th century Slovene Background Tomaž Erjavec: Annotating Historical Slovene42

Producing the IMP corpus Representative & balanced, sampled Representative & balanced, sampled Corpus element: unbroken & contiguous text from 1 page Corpus element: unbroken & contiguous text from 1 page Sampled by decade & text Sampled by decade & text Target size: 1,000 pages (~200,000 words) Target size: 1,000 pages (~200,000 words) Encoded in TEI P5 Encoded in TEI P5 Automatically annotated Automatically annotated Tool for manual annotation: IMPACT INL Cobalt Tool for manual annotation: IMPACT INL Cobalt Annotator training & management: May Annotator training & management: May Manual correction: June–November Manual correction: June–November Tomaž Erjavec: Annotating Historical Slovene43

Annotation tool Approach: Modernise, then process as contemporary language Modernise, then process as contemporary language Language independent (trainable) modules Language independent (trainable) modulesSteps: 1. Tokenisation (mlToken) 2. Transcription (Vaam) 3. Tagging (TnT ) 4. Lemmatisation (CLOG) = ToTrTaLe Pipeline in Perl Pipeline in Perl TEI P5 I/O TEI P5 I/O Tomaž Erjavec: Annotating Historical Slovene44

Conclusions What is a corpus What is a corpus How to make it How to make it How to annotate it How to annotate it Case studies: MULTEXT-East, JOS, IMP Case studies: MULTEXT-East, JOS, IMP