MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec http://nl.ijs.si/et/ Department of Knowledge.

Slides:



Advertisements
Similar presentations
OLIF V2 Gr. Thurmair April OLIF April 2000 OLIF: Overview Rationale Principles Entries Descriptions Header Examples Status.
Advertisements

MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec Department of Knowledge Technologies.
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora Tomaž Erjavec Department of Knowledge Technologies Jožef.
Data Category specifications 19 June 20121CLARIN-NL 2012 ISOcat tutorial.
Totale Multilingual Tokenisation, Tagging and Lemmatisation Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Ljubljana, Slovenia JRC.
MP IP Strategy Stateye-GUI Provided by Edotronik Munich, May 05, 2006.
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
The University of Wisconsin-Madison Universal Morphological Analysis using Structured Nearest Neighbor Prediction Young-Bum Kim, João V. Graça, and Benjamin.
Prim(j)ena MULTEXT-East standarda i normi TEI u izradi paralelnih korpusa Applikation des MULTEXT-East und der TEI-Normen bei der Erstellung von Parallelkorpora.
The MULTEXT-East multilingual language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana
An interactive environment for creating and validating syntactic rules Panagiotis Bouros*, Aggeliki Fotopoulou, Nicholas Glaros Institute for Language.
Part of speech (POS) tagging
The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana
Template produced at the Graphics Support Workshop, Media Centre Combining the strengths of UMIST and The Victoria University of Manchester Aims The GerManC.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Advanced Language Technologies Information and Communication Technologies Module "Knowledge Technologies" Jožef Stefan International Postgraduate School.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Some Advances in Transformation-Based Part of Speech Tagging
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
JRC-Ispra, , Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged.
Max Planck Institute for Psycholinguistics Tool development report H. Brugman MPI Nijmegen.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
LIRICS Mid-term Review 1 LIRICS WP2 – NLP Lexica Monica Monachini CNR-ILC - Pisa 23rd May 2006.
© Copyright 2008 STI INNSBRUCK NLP Interchange Format José M. García.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Language Data Resources About Corpora. J. Sinclair: “Language looks rather different when you look at a lot of it at once.“ P. Eisner: “Znáte jej, ten.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic,
Advanced Language Technologies Information and Communication Technologies Research Area "Knowledge Technologies" Jožef Stefan International Postgraduate.
PLS Considerations on using PLS for Slovenian Pronunciation Lexicon Construction Jerneja Žganec Gros Alpineon d.o.o., Ljubljana, Slovenia
Publications Office Metadata Registry (MDR) INSPIRE Registry and Registers Workshop Willem van Gemert Publications Office of the EU Dissemniation and Reuse.
Tomaž Erjavec Dept. of Knowledge Technologies Jožef Stefan Institute Ljubljana, Slovenia Polishing BootCat corpora: XML validation and tagset unification.
Encoding language corpora: current trends and future directions Tomaž Erjavec Department of Knowledge Technologies Department of Knowledge Technologies.
XML technologies for text encoding Tamás Váradi
Improving Morphosyntactic Tagging of Slovene by Tagger Combination Jan Rupnik Miha Grčar Tomaž Erjavec Jožef Stefan Institute.
© Copyright 2013 STI INNSBRUCK “How to put an annotation in HTML?” Ioannis Stavrakantonakis.
PLS Considerations on using PLS for Slovenian Pronunciation Lexicon Construction Jerneja Žganec Gros Alpineon d.o.o., Ljubljana, Slovenia
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Dictionary based interchanges for iSURF -An Interoperability Service Utility for Collaborative Supply Chain Planning across Multiple Domains David Webber.
Standards for digital encoding Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture 2: TEI.
1 ISOCAT Proposed solutions for Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010.
1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
POS Tagger and Chunker for Tamil
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Formats, interoperability and standards Marc Kemps-Snijders.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
POS Tagging1 POS Tagging 1 POS Tagging Rule-based taggers Statistical taggers Hybrid approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
BRAT: a web based tool for manual annotation Hans Paulussen ITEC, KU Leuven KULAK.
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Beyond HTML: Extensible Markup Language (XML)
Implementing the TEI Feature System Declaration Gary F. Simons SIL International ___________________________ TEI Members Meeting 11 Oct 2002, Chicago.
Part-Of-Speech Tagging Radhika Mamidi. POS tagging Tagging means automatic assignment of descriptors, or tags, to input tokens. Example: “Computational.
Language Identification and Part-of-Speech Tagging
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Relations between Data Categories
Urdu-to-English Stat-XFER system for NIST MT Eval 2008
Universal Dependencies
Computer Corpora and their annotation
Topics in Linguistics ENG 331
ISOCAT ISOCAT Problems
Session 2: Metadata and Catalogues
Computational Linguistics: New Vistas
Natural Language Processing
Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006
Presentation transcript:

MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec http://nl.ijs.si/et/ Department of Knowledge Technologies Jožef Stefan Institute Ljubljana Slovenia Dublin April 3rd, 2009

Overview of the talk Part-of-speech tagging, tagsets and interoperability MULTEXT(-East) morphosyntactic specifications Languages, formats, transformations An application: JOS resources for Slovene Conclusions Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Part-of-speech tagging The task of assigning the correct PoS tag to each word in a running text, e.g. Under/IN the/DT proposal/NN ,/, Delmed/NNP would/MD issue/VB about/IN 123.5/CD million/CD additional/JJ Delmed/NNP common/JJ shares/NNS to/TO Fresenius/NNP … Important HLT infrastructure Very useful annotations for linguists Some applications: pre-processing step for further analyses: lemmas, syntactic structure, etc. text indexing, e.g. nouns are more useful than verbs Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Methods of PoS tagging PoS tagging: determine ambiguity class or word (saw → NN | VBD) disambiguate to correct tag in (local) context (“I saw/VBD a saw/NN “) Tagger training: manually annotated corpus: source of probabilities for tags given a (local) context + (lexicon: gives possible tags for each word-form) Popular taggers: TnT (HMM tagger), TreeTagger (decision trees), TBL (transformation based tagging) Tagging usefulness as well as accuracy crucially depends on the tagset Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

English tagsets Tagging first developed for English (Brown, CLAWS, PTB tagsets) English inflectionally very poor language → small tagsets ~ 50 different tags Tags are typically “synthetic”, i.e. the tag does not transparently map to features e.g. : to/TO (PoS?) Delmed/NNP (number?) shares/NNS (number?) Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Tagsets for other languages will often have many more morphosyntactic features associated with a word, so tagsets will be larger e.g. Slovene nouns: type: common, proper gender: masculine, feminine, neuter number: singular, dual, plural case: nom., gen., dat., acc., loc., ins. (animacy: yes, no) = 104 “PoS” tags just for Nouns Russian, Czech, Slovene ~ 1000-2000 word level syntactict tags Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

PoS tags vs. MSDs PoS tags: used in corpora for corpus annotations / tagging typically synthetic Morphosyntactic Descriptions (MSDs): used in inflectional lexica for lexical annotations / morphological analysis typically analytic Relation of PoS tagsets to MSD tagsets/features in general: |PoS| < |MSD| but in most MULTEXT-East languages: [PoS] ≡ [MSD] Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Developing a multilingual morphosyntactic framework Interoperability: Tagsets developed for various languages (or even for the same language) have no connection with each other and are often poorly documented Best practice: Languages that do not yet have a tagset could benefit from an operational framework in which to model it Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

so, wouldn’t it be nice to have: an open, standardised, documented, flexible model for MSD/PoS tagset design, that would be instantiated for lots of languages, and could be simply applied to any language? Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

EU standardisation efforts EAGLES: Expert Advisory Group for Language Engineering Standards (1993-1996) MULTEXT: Multilingual Text Tools and Corpora (1995) MULTEXT-East: MULTEXT for Central and Eastern European Languages: Version 1: TELRI edition (1998) Version 2: Concede edition (2002) Version 3: TEI edition (2004) Version 4: MondiLex edition (2009?) ... ISO / TC 37 / LMF / isoCat (2008) Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

MULTEXT-East morphosyntactic resources Basic Language Resource Kit: specifications: define features and MSDs lexica (~15,000 lemmas): triplets: word-form / lemma / MSD parallel corpus: MSD and lemma annotated Freely available for research http://nl.ijs.si/ME/ Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

1984: aligned and annotated Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

MULTEXT-East languages Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

The MULTEX(-East) morphosyntactic specifications They specify that e.g.”Ncmsn” corresponds to the feature-structure [Noun, Type=common, Gender=masculine, Number=singular, Case=nominative] is a valid MSD for Slovene Specifications consist of Front matter Common part - common definitions for all languages (features) Language particular parts - particulars for each language (MSD set) Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

V4 specs draft in HTML Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Specifications in Version 4 Encoded in XML / teiLite (in Version 3: LaTeX) TEI = Text Encoding Initiative Guidelines P4 Still in “book-like” in form, to make authoring easier XSLT into other formats: HTML tabular mapping formats (e.g. MSD to features) XML/TEI feature library (OWL) Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

The common specifications Define categories (“parts-of-speech”) For each category define features, i.e. attributes and their values For each attribute-value specify for which languages it is appropriate Give positional mapping to MSDs: each attribute assigned a position each attribute-value assigned a one-character code Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Common table (HTML) Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Common table (source XML/teiLite) Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Language particular sections Recap the feature definitions for the language Add “combinations”, i.e. feature-coocurrence restrictions Add “lexicon”, i.e. list of all valid MSDs for language Possibly localise the features and codes Possibly give notes and examples Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Combinations Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Lexicon Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Jezikoslovno označevanje slovenščine http://nl.ijs.si/jos Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

JOS as a bridge to MULTEXT-East Version 4 FidaPLUS corpus MTE V3 slv specifications JOS corpora JOS (slv) specifications MTE V4 specifications MTE V4 (slv) specifications Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

JOS specifications XML/teiLite + XSLT transforms Allow reordering of attribute positions (Vm-----d → Vmd) i18n / slv+eng: translation: specifications localisation: attributes, values, codes localisation: TEI element names Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

MSD conversion tables Tabular UTF-8 files MSD-slv to -eng MSD to features Collating sequence e.g. 01N0101010100 Somei Ncmsn 01N0101010200 Somer Ncmsg 01N0101010300 Somed Ncmsd Ncmsn Noun Type=common Gender=masculine Number=singular Case=nominative Animacy=0 Ncmsg Noun Type=common Gender=masculine Number=singular Case=genitive Animacy=0 Ncmsd Noun Type=common Gender=masculine Number=singular Case=dative Animacy=0 Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Adding a new language XSLT scripts: mtems-split.xsl: make a template for the language particular section of a new language mtems-merge: merge a new language particular section to the common tables Maybe shortly to be tested on new Slavic languages in the scope of MondiLex Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Critiques It’s just an exercise in encoding anyway Same is different, different is same The Procrustean bed of standards Policy change: from unification to harmonisation (hippy school) Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Conclusions Presented work-in-progress on “standardisation” of multilingual morphosyntactic specifications Specifications are a de-facto standard for several languages (Romanian, Slovene, Croatian) Could serve as “hub” encoding for multilingual applications, e.g. MT and as an framework for new languages Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Further work Finishing MTE V4! Distribution: LDC, ELDA Relation to ISO-TC37 standards: LMF, isoCAT Connecting to GOLD ontology Adding new languages: Slavic completion Western European: MULTEXT Japanese: chasen tagset, jpWaC(-L2) Irish?☺ Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009