MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec http://nl.ijs.si/et/ Department of Knowledge.

MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute Ljubljana Slovenia Dublin April 3rd, 2009

Overview of the talk Part-of-speech tagging, tagsets and interoperability MULTEXT(-East) morphosyntactic specifications Languages, formats, transformations An application: JOS resources for Slovene Conclusions Erjavec: MULTEXT-East Version 4 Dublin,

Part-of-speech tagging
The task of assigning the correct PoS tag to each word in a running text, e.g. Under/IN the/DT proposal/NN ,/, Delmed/NNP would/MD issue/VB about/IN 123.5/CD million/CD additional/JJ Delmed/NNP common/JJ shares/NNS to/TO Fresenius/NNP … Important HLT infrastructure Very useful annotations for linguists Some applications: pre-processing step for further analyses: lemmas, syntactic structure, etc. text indexing, e.g. nouns are more useful than verbs Erjavec: MULTEXT-East Version 4 Dublin,

Methods of PoS tagging PoS tagging:
determine ambiguity class or word (saw → NN | VBD) disambiguate to correct tag in (local) context (“I saw/VBD a saw/NN “) Tagger training: manually annotated corpus: source of probabilities for tags given a (local) context + (lexicon: gives possible tags for each word-form) Popular taggers: TnT (HMM tagger), TreeTagger (decision trees), TBL (transformation based tagging) Tagging usefulness as well as accuracy crucially depends on the tagset Erjavec: MULTEXT-East Version 4 Dublin,

English tagsets Tagging first developed for English (Brown, CLAWS, PTB tagsets) English inflectionally very poor language → small tagsets ~ 50 different tags Tags are typically “synthetic”, i.e. the tag does not transparently map to features e.g. : to/TO (PoS?) Delmed/NNP (number?) shares/NNS (number?) Erjavec: MULTEXT-East Version 4 Dublin,

Tagsets for other languages
will often have many more morphosyntactic features associated with a word, so tagsets will be larger e.g. Slovene nouns: type: common, proper gender: masculine, feminine, neuter number: singular, dual, plural case: nom., gen., dat., acc., loc., ins. (animacy: yes, no) = 104 “PoS” tags just for Nouns Russian, Czech, Slovene ~ word level syntactict tags Erjavec: MULTEXT-East Version 4 Dublin,

PoS tags vs. MSDs PoS tags:
used in corpora for corpus annotations / tagging typically synthetic Morphosyntactic Descriptions (MSDs): used in inflectional lexica for lexical annotations / morphological analysis typically analytic Relation of PoS tagsets to MSD tagsets/features in general: |PoS| < |MSD| but in most MULTEXT-East languages: [PoS] ≡ [MSD] Erjavec: MULTEXT-East Version 4 Dublin,

Developing a multilingual morphosyntactic framework
Interoperability: Tagsets developed for various languages (or even for the same language) have no connection with each other and are often poorly documented Best practice: Languages that do not yet have a tagset could benefit from an operational framework in which to model it Erjavec: MULTEXT-East Version 4 Dublin,

so, wouldn’t it be nice to have:
an open, standardised, documented, flexible model for MSD/PoS tagset design, that would be instantiated for lots of languages, and could be simply applied to any language? Erjavec: MULTEXT-East Version 4 Dublin,

EU standardisation efforts
EAGLES: Expert Advisory Group for Language Engineering Standards ( ) MULTEXT: Multilingual Text Tools and Corpora (1995) MULTEXT-East: MULTEXT for Central and Eastern European Languages: Version 1: TELRI edition (1998) Version 2: Concede edition (2002) Version 3: TEI edition (2004) Version 4: MondiLex edition (2009?) ... ISO / TC 37 / LMF / isoCat (2008) Erjavec: MULTEXT-East Version 4 Dublin,

MULTEXT-East morphosyntactic resources
Basic Language Resource Kit: specifications: define features and MSDs lexica (~15,000 lemmas): triplets: word-form / lemma / MSD parallel corpus: MSD and lemma annotated Freely available for research Erjavec: MULTEXT-East Version 4 Dublin,

1984: aligned and annotated
Erjavec: MULTEXT-East Version 4 Dublin,

MULTEXT-East languages

The MULTEX(-East) morphosyntactic specifications
They specify that e.g.”Ncmsn” corresponds to the feature-structure [Noun, Type=common, Gender=masculine, Number=singular, Case=nominative] is a valid MSD for Slovene Specifications consist of Front matter Common part - common definitions for all languages (features) Language particular parts - particulars for each language (MSD set) Erjavec: MULTEXT-East Version 4 Dublin,

V4 specs draft in HTML Erjavec: MULTEXT-East Version 4
Dublin,

Specifications in Version 4
Encoded in XML / teiLite (in Version 3: LaTeX) TEI = Text Encoding Initiative Guidelines P4 Still in “book-like” in form, to make authoring easier XSLT into other formats: HTML tabular mapping formats (e.g. MSD to features) XML/TEI feature library (OWL) Erjavec: MULTEXT-East Version 4 Dublin,

The common specifications
Define categories (“parts-of-speech”) For each category define features, i.e. attributes and their values For each attribute-value specify for which languages it is appropriate Give positional mapping to MSDs: each attribute assigned a position each attribute-value assigned a one-character code Erjavec: MULTEXT-East Version 4 Dublin,

Common table (HTML) Erjavec: MULTEXT-East Version 4 Dublin,

Common table (source XML/teiLite)

Language particular sections
Recap the feature definitions for the language Add “combinations”, i.e. feature-coocurrence restrictions Add “lexicon”, i.e. list of all valid MSDs for language Possibly localise the features and codes Possibly give notes and examples Erjavec: MULTEXT-East Version 4 Dublin,

Combinations Erjavec: MULTEXT-East Version 4 Dublin,

Lexicon Erjavec: MULTEXT-East Version 4 Dublin,

Jezikoslovno označevanje slovenščine http://nl.ijs.si/jos

JOS as a bridge to MULTEXT-East Version 4
FidaPLUS corpus MTE V3 slv specifications JOS corpora JOS (slv) specifications MTE V4 specifications MTE V4 (slv) specifications Erjavec: MULTEXT-East Version 4 Dublin,

Erjavec: MULTEXT-East Version 4
Dublin,

JOS specifications XML/teiLite + XSLT transforms
Allow reordering of attribute positions (Vm-----d → Vmd) i18n / slv+eng: translation: specifications localisation: attributes, values, codes localisation: TEI element names Erjavec: MULTEXT-East Version 4 Dublin,

Erjavec: MULTEXT-East Version 4
Dublin,

MSD conversion tables Tabular UTF-8 files MSD-slv to -eng
MSD to features Collating sequence e.g. 01N Somei Ncmsn 01N Somer Ncmsg 01N Somed Ncmsd Ncmsn Noun Type=common Gender=masculine Number=singular Case=nominative Animacy=0 Ncmsg Noun Type=common Gender=masculine Number=singular Case=genitive Animacy=0 Ncmsd Noun Type=common Gender=masculine Number=singular Case=dative Animacy=0 Erjavec: MULTEXT-East Version 4 Dublin,

Adding a new language XSLT scripts:
mtems-split.xsl: make a template for the language particular section of a new language mtems-merge: merge a new language particular section to the common tables Maybe shortly to be tested on new Slavic languages in the scope of MondiLex Erjavec: MULTEXT-East Version 4 Dublin,

Critiques It’s just an exercise in encoding anyway
Same is different, different is same The Procrustean bed of standards Policy change: from unification to harmonisation (hippy school) Erjavec: MULTEXT-East Version 4 Dublin,

Conclusions Presented work-in-progress on “standardisation” of multilingual morphosyntactic specifications Specifications are a de-facto standard for several languages (Romanian, Slovene, Croatian) Could serve as “hub” encoding for multilingual applications, e.g. MT and as an framework for new languages Erjavec: MULTEXT-East Version 4 Dublin,

Further work Finishing MTE V4! Distribution: LDC, ELDA
Relation to ISO-TC37 standards: LMF, isoCAT Connecting to GOLD ontology Adding new languages: Slavic completion Western European: MULTEXT Japanese: chasen tagset, jpWaC(-L2) Irish?☺ Erjavec: MULTEXT-East Version 4 Dublin,

MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec http://nl.ijs.si/et/ Department of Knowledge.

Similar presentations

Presentation on theme: "MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec http://nl.ijs.si/et/ Department of Knowledge."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec http://nl.ijs.si/et/ Department of Knowledge.

Similar presentations

Presentation on theme: "MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec http://nl.ijs.si/et/ Department of Knowledge."— Presentation transcript:

Similar presentations

About project

Feedback