MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora Tomaž Erjavec Department of Knowledge Technologies Jožef.

Slides:

Advertisements

Similar presentations

OMV Ontology Metadata Vocabulary April 10, 2008 Peter Haase.

Advertisements

Sustainable Sanitation in Central and Eastern Europe High-Level Policy Dialogue on EU Sanitation Policies and Practicies in the 2008 International Year.

LIFTing LEGO with RELISH: Lexicon Interchange FormaT in Use Helen Aristar-Dry Institute for Language Information and Technology Eastern Michigan U.

MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec Department of Knowledge Technologies.

Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics

Usage of the memoQ web service API by LSP – a case study

MIG-KOMM-EU Multilingual intercultural business communication in Europe University of Bucharest Faculty of Foreign Languages and Literatures German Studies.

Extensible Stylesheet Language for Transformations XSLT An introduction.

Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.

Totale Multilingual Tokenisation, Tagging and Lemmatisation Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Ljubljana, Slovenia JRC.

WG3: Innovative e-dictionaries Simon Krek „Jožef Stefan“ Institute, Ljubljana, Slovenia Carole Tiberius Institute of Dutch Lexicology, Leiden, the Netherlands.

HOO 2012: A Report on the Preposition and Determiner Error Correction Shared Task Robert Dale, Ilya Anisimoff and George Narroway Centre for Language Technology.

The University of Wisconsin-Madison Universal Morphological Analysis using Structured Nearest Neighbor Prediction Young-Bum Kim, João V. Graça, and Benjamin.

Prim(j)ena MULTEXT-East standarda i normi TEI u izradi paralelnih korpusa Applikation des MULTEXT-East und der TEI-Normen bei der Erstellung von Parallelkorpora.

The MULTEXT-East multilingual language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

OneGeology-Europe - the first step to the European Geological SDI INSPIRE Conference 2010, Session Thematic Communities: Geology Krakow, June 24 th 2010.

Eleni Galiotou, Dept. of Informatics

New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.

Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty.

A System for A Semi-Automatic Ontology Annotation Kiril Simov, Petya Osenova, Alexander Simov, Anelia Tincheva, Borislav Kirilov BulTreeBank Group LML,

Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.

The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

Template produced at the Graphics Support Workshop, Media Centre Combining the strengths of UMIST and The Victoria University of Manchester Aims The GerManC.

RSS RSS is a method that uses XML to distribute web content on one web site, to many other web sites. RSS allows fast browsing for news and updates.

WG3: Innovative e-dictionaries Simon Krek „Jožef Stefan“ Institute, Ljubljana, Slovenia Carole Tiberius Institute of Dutch Lexicology, Leiden, the Netherlands.

Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.

Corpus linguistics for translators Amanda Saksida University of Nova Gorica.

Sharing linguistic multi-media resources Jacquelijn Ringersma Paul Trilsbeek Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands.

6th Intex Workshop, Sofia May th Intex Workshop & 10 years of (Silberztein, 1993) Sofia, May 2003.

DATA COMMUNICATION DONE BY: ALVIN SAMPATH CARLVIN SAMPATH.

Principles of organizing a common morphological tagset and a search engine for PolUKR (Polish-Ukrainian Parallel Corpus) Польсько-Український паралельний.

PrepTalk a Preprocessor for Talking book production Ted van der Togt, Dedicon, Amsterdam.

JRC-Ispra, , Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged.

Max Planck Institute for Psycholinguistics Tool development report H. Brugman MPI Nijmegen.

Roadmap for Language Resources and Evaluation in a Multilingual Environment Minority Languages in the African Context Justus Roux Centre for Language and.

LEXUS: a web based lexicon tool Jacquelijn Ringersma Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands.

© Copyright 2008 STI INNSBRUCK NLP Interchange Format José M. García.

Language Data Resources About Corpora. J. Sinclair: “Language looks rather different when you look at a lot of it at once.“ P. Eisner: “Znáte jej, ten.

Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.

2XML Marko Tadić Department of linguistics, Faculty of philosophy, University of Zagreb ( Tübingen,

Standards for digital encoding Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž.

Application of INTEX in refinement and validation of Serbian WordNet Ivan Obradović, Ranka Stanković Cvetana Krstev, Gordana Pavlović-Lažetić University.

Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic,

PLS Considerations on using PLS for Slovenian Pronunciation Lexicon Construction Jerneja Žganec Gros Alpineon d.o.o., Ljubljana, Slovenia

Languages of Europe. Languages of Europe Europe is slightly larger than the United States, but the population is more than double. We speak English.

A Semantic-Web based Framework for Developing Applications to Improve Accessibility in the WWW Michail Salampasis Dept. of Informatics TEI of Thessaloniki.

Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 1 Comparability of language data and analysis Using an ontology for linguistics Scott Farrar, U.

Tomaž Erjavec Dept. of Knowledge Technologies Jožef Stefan Institute Ljubljana, Slovenia Polishing BootCat corpora: XML validation and tagset unification.

Encoding language corpora: current trends and future directions Tomaž Erjavec Department of Knowledge Technologies Department of Knowledge Technologies.

Using XML to store Descriptive Metadata Richard Murphy Rosarie O’Riordan Central Statistics Office Ireland.

Improving Morphosyntactic Tagging of Slovene by Tagger Combination Jan Rupnik Miha Grčar Tomaž Erjavec Jožef Stefan Institute.

Corpus lexicography in Russia: recent trends and perspectives Maria Khokhlova St.Petersburg State University Philological Faculty

Identifying Entity Relationships in News Reports 27. January 2010 Martin Jačala, Jozef Tvarožek Faculty of Informatics and Information Technology Slovak.

Catia Cucchiarini, Walter Daelemans and Helmer Strik Strengthening the Dutch Language and Speech Technology Infrastructure Catia Cucchiarini, Walter Daelemans.

LINGUATECA FLUP/CLUP The Corpógrafo – a Web-based environment for corpora research extract Term Candidates.

Standards for digital encoding Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture 2: TEI.

Generality and Openness in Enabling Methodologies for Morphology and Text Processing Anssi Yli-Jyrä Department of General Linguistics, University of Helsinki.

Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

Introduction A field survey of Dutch language resources has been carried out within the framework of a project launched by the Dutch Language Union (Nederlandse.

Knowledge Support for Modeling and Simulation Michal Ševčenko Czech Technical University in Prague.

Open Science and Research – Services for Research Data Management © 2014 OKM ATT 2014–2017 initiative Licenced under.

Web Service Exchange Protocols Preliminary Proposal ISO TC37 SC4 WG1 2 September 2013 Pisa, Italy.

GBIF NODES Committee Meeting Copenhagen, Denmark 4 th October 2009 The GBIF Integrated Publishing Toolkit Alberto GONZÁLEZ-TALAVÁN Programme Officer for.

TEI 工作坊 TEI and Images October The Concept.

Languages of Europe Romance, Germanic, and Slavic.

TextCrowd – Collaborative semantic enrichment of text-based datasets

MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec Department of Knowledge.

Tomaž Erjavec1, Adam Kilgarriff2, Irena Srdanović Erjavec3

Darja Fišer CLARIN ERIC Director of User Involvement

Presentation transcript:

MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute Ljubljana Slovenia LREC 2010 Malta

Overview Specifications (comprehensive) (define features and MSD tagsets) Ncmsn ≡ [Noun, Type=common, Gender=masculine, Number=singular, Case=nominative] Lexicons (medium sized) (wordform/lemma/MSD triplets) abstinent abstinent Ncmsn Corpora (small) (part annotated & sentence aligned) <w xml:id="Osl.1.5.25.8.4" lemma="abstinent“ ana="#Ncmsn">abstinent</w>

Motivation Interoperability for multilingual applications: tagsets developed for various languages (or even for the same language) have no connection with each other and are often poorly documented BLARK best practice: many languages do not yet have a morphosyntactic tagset and associated resources and could benefit from an operational framework in which to model them Erjavec: MULTEXT-East Version 4

Background EAGLES: Expert Advisory Group for Language Engineering Standards (1993-1996) MULTEXT: Multilingual Text Tools and Corpora (1995) MULTEXT-East: MULTEXT for Central and Eastern European Languages: Version 1: TELRI edition (1998) Version 2: Concede edition (2002) Version 3: TEI edition (2004) Version 4: MondiLex edition (2010)

Multilingual Morphosyntactic Specifications, Lexicons and Corpora Polish (West Slavic) Czech (West Slavic) Slovak (West Slavic) Slovene (South West Slavic) Resian (dialect of Slovene) Croatian (South West Slavic) Serbian (South West Slavic) Russian (East Slavic) Ukrainian (East Slavic) Macedonian (South East Slavic) Bulgarian (South East Slavic) added in V4 updated in V4 English Romanian Estonian Hungarian Persian

MULTEXT-East morphosyntactic specifications in Version 4 Encoded in XML TEI P5 (in Version 3: LaTeX) In form still follow the original MULTEXT specs but add many extensions: localisation of feature names and MSDs language specific MSDs Vm-----d → Vmd XSLT scripts: for adding new languages (consistency checking) for HTML display for creating tabular files of various mappings → HTML and tabular files part of the distribution

Common tables (HTML) Erjavec: MULTEXT-East Version 4

Language particular tables

MSD tag lists

Related work Vocabularies of linguistic features: GOLD, http://linguistics-ontology.org/ ISO TC 37 / LMF / isoCat: http://www.isocat.org/ …connecting MULTEXT-East features with isoCat and GOLD Erjavec: MULTEXT-East Version 4

MULTEXT-East lexica

MULTEXT-East corpora in V4: XML TEI P5 small parallel corpus of spoken texts taken from the EUROM-1 speech corpus comparable corpus (2x100.000 words) fiction newspaper articles parallel corpus, Orwell’s “1984” Erjavec: MULTEXT-East Version 4

tagged with morphosyntactic descriptions and lemmas sentence aligned nice (if small) dataset for various experiments

Distribution http://nl.ijs.si/ME/V4 Documentation, browsing and download Specifications & speech corpus: Creative Commons BY SA Lexica and text corpora: freely avaialable for research use (after filling out a web agreement form)

Further work Correct mistakes.. Other East European languages Add missing resources for current languages Relation to standards (isoCat) Unify (Slavic) features Western European languages?

Conclusions Presented MULTEXT-East V4 Covers most Slavic languages Resources uniformly encoded in XML TEI P5 As freely available as possible Up to V3 over hundred registered users, hopefully many more to come.. Erjavec: MULTEXT-East Version 4

Acknowledgements Adam Radziszewski Aleksandar Petrovski Anna Feldman Behrang QasemiZadeh Csaba Oravecz Cvetana Krstev Dagmar Divjak Igor Shevchenko Ivan Derzhanski Katerina Čundeva Marcin Woliński Mikhail Kopotev Natalia Kotsyba Radovan Garabík Serge Sharoff EU FP7 Capacities - Research Infrastructures project MONDILEX "Conceptual Modelling of Networking of Centres for High-Quality Research in Slavic Lexicography and Their Digital Resources"