The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

Slides:



Advertisements
Similar presentations
Using OLIF, The Open Lexicon Interchange Format Susan McCormick OLIF2 Consortium October 1, 2004.
Advertisements

Part Two: Using Xaira to explore corpora Richard Xiao
Disambiguation of homographic adjective and adverb forms in Croatian texts Danijela Merkler*, Daša Berović*, Željko Agić** * Department of Linguistics.
MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec Department of Knowledge Technologies.
Introduction to BLaRKs Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.
MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora Tomaž Erjavec Department of Knowledge Technologies Jožef.
MIG-KOMM-EU Multilingual intercultural business communication in Europe University of Bucharest Faculty of Foreign Languages and Literatures German Studies.
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Totale Multilingual Tokenisation, Tagging and Lemmatisation Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Ljubljana, Slovenia JRC.
WG3: Innovative e-dictionaries Simon Krek „Jožef Stefan“ Institute, Ljubljana, Slovenia Carole Tiberius Institute of Dutch Lexicology, Leiden, the Netherlands.
What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.
The University of Wisconsin-Madison Universal Morphological Analysis using Structured Nearest Neighbor Prediction Young-Bum Kim, João V. Graça, and Benjamin.
Multilingual eLearning in LANGuage Engineering. Project Overview  Project span: Oct 2004 – Oct 2007  Kick-off meeting Oct  Project goals:
Prim(j)ena MULTEXT-East standarda i normi TEI u izradi paralelnih korpusa Applikation des MULTEXT-East und der TEI-Normen bei der Erstellung von Parallelkorpora.
The MULTEXT-East multilingual language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana
Eleni Galiotou, Dept. of Informatics
New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.
Michal Křen Institute of the Czech National Corpus Charles University, Prague SLAVICORP Warszawa, 22 November 2010 Accessing the.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
Research methods in corpus linguistics Xiaofei Lu.
WG3: Innovative e-dictionaries Simon Krek „Jožef Stefan“ Institute, Ljubljana, Slovenia Carole Tiberius Institute of Dutch Lexicology, Leiden, the Netherlands.
Advanced Language Technologies Information and Communication Technologies Module "Knowledge Technologies" Jožef Stefan International Postgraduate School.
Barcelona Meeting 21/06/05 MM 1 LIRICS WP2 LIRICS WP2 NLP LEXICA Task Leader: ILC-CNR (Pisa) presented by: Monica Monachini.
Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac
Corpus linguistics for translators Amanda Saksida University of Nova Gorica.
Using corpora for bespoke language teaching
EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic.
6th Intex Workshop, Sofia May th Intex Workshop & 10 years of (Silberztein, 1993) Sofia, May 2003.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
IKTA-27/2000 Development of a Part-of-Speech (POS) Tagging Method for Hungarian Using Machine Learning Algorithms Project duration: July June.
February 2007MCST - FP7 Launch1 Michael Rosner Department of Computer Science and Artificial Intelligence University of Malta.
JRC-Ispra, , Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged.
SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
Linguistics & AI1 Linguistics and Artificial Intelligence Linguistics and Artificial Intelligence Frank Van Eynde Center for Computational Linguistics.
Roadmap for Language Resources and Evaluation in a Multilingual Environment Minority Languages in the African Context Justus Roux Centre for Language and.
© Copyright 2008 STI INNSBRUCK NLP Interchange Format José M. García.
Introducing MorphoLogic to LIRICS Gábor Prószéky MorphoLogic Pázmány Péter Catholic University Faculty.
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. Semantic Web services Interoperability for Geospatial decision.
Language Data Resources About Corpora. J. Sinclair: “Language looks rather different when you look at a lot of it at once.“ P. Eisner: “Znáte jej, ten.
ISLE: International Standards for Language Engineering A European/US joint project Martha Palmer University of Pennsylvania Tides Kickoff March 22, 2000.
Dutch HLT Resources: from BLARK to Priority Lists Helmer Strik, Diana Binnenpoorte, Janienke Sturm, Folkert de Vriend, and Catia Cucchiarini* A 2 RT, Dept.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic,
Advanced Language Technologies Information and Communication Technologies Research Area "Knowledge Technologies" Jožef Stefan International Postgraduate.
PLS Considerations on using PLS for Slovenian Pronunciation Lexicon Construction Jerneja Žganec Gros Alpineon d.o.o., Ljubljana, Slovenia
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture 1: Overview
Tomaž Erjavec Dept. of Knowledge Technologies Jožef Stefan Institute Ljubljana, Slovenia Polishing BootCat corpora: XML validation and tagset unification.
Encoding language corpora: current trends and future directions Tomaž Erjavec Department of Knowledge Technologies Department of Knowledge Technologies.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
Improving Morphosyntactic Tagging of Slovene by Tagger Combination Jan Rupnik Miha Grčar Tomaž Erjavec Jožef Stefan Institute.
PLS Considerations on using PLS for Slovenian Pronunciation Lexicon Construction Jerneja Žganec Gros Alpineon d.o.o., Ljubljana, Slovenia
Advanced Language Technologies Information and Communication Technologies Module "Knowledge Technologies" Jožef Stefan International Postgraduate School.
Catia Cucchiarini, Walter Daelemans and Helmer Strik Strengthening the Dutch Language and Speech Technology Infrastructure Catia Cucchiarini, Walter Daelemans.
Standards for digital encoding Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture 2: TEI.
1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001.
Introduction A field survey of Dutch language resources has been carried out within the framework of a project launched by the Dutch Language Union (Nederlandse.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
Computational and Statistical Methods for Corpus Analysis: Overview
MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec Department of Knowledge.
Darja Fišer CLARIN ERIC Director of User Involvement
ICEweb 2 a new way of compiling high-quality web-based components for ICE corpora Martin Weisser Center for Linguistics & Applied Linguistics, Guangdong.
The European Union case law corpus (EUCLCORP)
ITS 2.0 Enriched Terminology Annotation Showcase
ENETCOLLECT - WG2 Simon Krek.
Presentation transcript:

The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana Gralis 2006 Gralis 2006 Institut für Slawistik der Universität Graz

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Overview 1. Background 2. FIDA: a reference corpus of Slovene 3. MULTEXT-East: morphosyntactic resources for Central and East- European languages 4. Other language resources for Slovene

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Language Resources LR comprise three layers of data: LR comprise three layers of data: –corpora: mono- or multilingual, reference or specialised, … /variously annotated/ –lexica: vocabularies, morphosyntactic, syntactic, semantic, (ontologies) –standards: linguistic and technical encoding LRs, esp. corpora are used for empirical language research: LRs, esp. corpora are used for empirical language research: –linguistic studies: (annotated) corpus + (sophisticated) search engine –human language technology R&D: testing and training dataset

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Part I. The FIDA corpus Slovene reference corpus for linguistic studies Slovene reference corpus for linguistic studies

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute FIDA Joint project ( ) of Filozofska fakulteta Vojko Gorjanc, Marko Stabej, Špela Vintar Filozofska fakulteta Vojko Gorjanc, Marko Stabej, Špela Vintar Institut Jožef Stefan Tomaž Erjavec Institut Jožef Stefan Tomaž Erjavec DZS Simon Krek DZS Simon Krek Amebis Peter Holozan, Miro Romih Amebis Peter Holozan, Miro Romih Financed by industry partnerns

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Characteristics of FIDA monolingual monolingual synchronous synchronous written language written language reference reference –representative –balanced annotated annotated

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Sizes Total 103,513,072 words 29,177 texts Avg. text length 3,548 words Largest texts: Leksikon DZS: 508,370 words 69 texts > Smallest texts: < 100 words 2 x rezgrtshdrghgth4 2 x rezgrtshdrghgth4

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Time Composition Oldest/most recent text: 1989/2000 Oldest/most recent text: 1989/2000 Average date Average date Texts/Words with unknown date: 3.94%/8.28% Texts/Words with unknown date: 3.94%/8.28%

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute FIDA taxonomoy: publication types … Ft.P.P.O (published) 95.72% Ft.P.P.O.K (books) 22.71% Ft.P.P.O.P (periodicals) 70.50% Ft.P.P.O.P.C (newspaper) 46.59% Ft.P.P.O.P.C.D (daily) 32.67% Ft.P.P.O.P.C.T (weekly) 66.18% Ft.P.P.O.P.C.V (multi-weekly)17.74% …

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute FIDA taxonomoy: text types Ft.Z (text type) 99.47% Ft.Z.N (non-ficiton) 93.57% Ft.Z.N.N (non-professional)75.14% Ft.Z.N.S (professional) 18.37% Ft.Z.N.S.H (hum. & soc. sci.)10.57% Ft.Z.N.S.N (nat. & tech. sci.) 6.04% Ft.Z.U (fiction) 5.90% Ft.Z.U.D (drama) 0.10% Ft.Z.U.P (poetry) 0.17% Ft.Z.U.R (prose) 5.12%

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Markup of FIDA corpus elements annotated with meta- data (bibliographic, taxonomy) corpus elements annotated with meta- data (bibliographic, taxonomy) text linguistically annotated text linguistically annotated encoded according to international standards and recommendations encoded according to international standards and recommendations –technical: SGML, TEI P3 –linguistic: MULTEXT-East (MULTEXT, EAGLES)

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Linguistic annotation

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Accesibility Exploitation by partners: –DZS: new dictionaries –Amebis: development of HLT –Arts faculty: teaching –IJS: research on HLT Availability to the public: –access via concordance engine by Amebis –free access, but displays only few hits –possibility of academic licences FIDA (web site) no longer maintained!

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute FIDA+ FIDA Plus project: FIDA Plus project: –Filozofska fakulteta, Fakulteta za družbene vede, Institut Jožef Stefan –DZS, Amebis Financed by the ministry + ind. partners Financed by the ministry + ind. partners Extend the corpus with Extend the corpus with –Web materials –spoken component Better linguistic markup Better linguistic markup Free concordances: up to 100 lines Free concordances: up to 100 lines Also possibility of licences Also possibility of licences

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Concordancer

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Output

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Extended searches

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Corpus “Nova Beseda” being developed at Institute for Slovene language, ZRC SAZU (Primož Jakopin) being developed at Institute for Slovene language, ZRC SAZU (Primož Jakopin) Web concordancer with no hit limit Web concordancer with no hit limit now larger than FIDA now larger than FIDA but much less varied: fiction, Delo, DZ but much less varied: fiction, Delo, DZ not linguistically annotated not linguistically annotated

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Part II. MULTEXT-East multilingual morphosyntactic resources for HLT development multilingual morphosyntactic resources for HLT development

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute MULTEXT-East resources MULTEXT-East: Copernicus Joint Project COP 106 ( ) Multilingual Texts and Corpora for Eastern and Central European Languages MULTEXT-East: Copernicus Joint Project COP 106 ( ) Multilingual Texts and Corpora for Eastern and Central European Languages MULTEXT-East Based on the results of EU MULTEXT (~West) Based on the results of EU MULTEXT (~West) To produce a harmonised BLARK for six languages: To produce a harmonised BLARK for six languages:BLARK –corpus encoding standardisation (TEI / CES) –multilingual parallel, comparable, speech corpora –morphosyntactic specifications (EAGLES / MULTEXT) –(inflectional) lexicon –annotated corpus –language processing tools

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute History of MULTEXT-East resources First release 1998 on TELRI CD-ROM Vol II: already extended with new languages First release 1998 on TELRI CD-ROM Vol II: already extended with new languages Resources since 1998 available on the Web: Resources since 1998 available on the Web: Second release 2002 in scope of EU CONCEDE: re-encoding in XML/TEI, harmonisation Second release 2002 in scope of EU CONCEDE: re-encoding in XML/TEI, harmonisation Third release 2004: merge of first two releases, further languages Third release 2004: merge of first two releases, further languages Work (indirectly) supported by: TELRI, CONCEDE, NSF grant, bi-lateral projects Work (indirectly) supported by: TELRI, CONCEDE, NSF grant, bi-lateral projects

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute The Languages of MULTEXT-East Germanic: English Germanic: EnglishEnglish Romance: Romanian Romance: RomanianRomanian Baltic: Baltic: –Latvian Latvian –Lithuanian Lithuanian Finno-Ugric: Finno-Ugric: –Estonian Estonian –Hungarian Hungarian Slavic: Russian (East Slavic) Russian (East Slavic) Russian Czech (West Slavic) Czech (West Slavic) Czech Slovene (South West Slavic) Slovene (South West Slavic) Slovene Resian (Slovene dialect) Resian (Slovene dialect) Resian Croatian (South West Slavic) Croatian (South West Slavic) Croatian Serbian (South West Slavic) Serbian (South West Slavic) Serbian Bulgarian (South East Slavic) Bulgarian (South East Slavic) Bulgarian In progress: Macedonian Macedonian Persian Persian

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Version 3 Available on Available on Some parts completely free, others free for research  Web licence Some parts completely free, others free for research  Web licence Web pages gives: Web pages gives: –extensive documentation –bibliography list –web licence form –resource download

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute The MULTEXT morphosyntactic trinity 1. MULTEXT-East morphosyntactic specifications 2. MULTEXT-East morphosyntactic lexica 3. MULTEXT-East morphosyntactically annotated "1984" corpus

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute 1. Morphosyntactic specifications Based on EAGLES / MULTEXT Based on EAGLES / MULTEXT Define PoS, their attributes and values Define PoS, their attributes and values The specs are a document containing: The specs are a document containing: –introduction –common tables –language particular sections Written in LaTeX  PDF & HTML Written in LaTeX  PDF & HTMLPDFHTMLPDFHTML Derived XML/TEI encoding as feature structures Derived XML/TEI encoding as feature structuresXML/TEI

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Example common table

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Example language specific section table (shows only categories actually used) table (shows only categories actually used) notes notes combinations combinations lexicon lexicon for Slovene (FIDA): localisation of category names for Slovene (FIDA): localisation of category names

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Morphosyntactic Complexity

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute 2. The lexica Medium size morphosyntactic lexica Medium size morphosyntactic lexica Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian. Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian. ~ all word-forms of cca lemmas ~ all word-forms of cca lemmas Lexical entry is composed of three fields: Lexical entry is composed of three fields: –the word-form: the inflected form of the word –the lemma: the base-form of the word –the morphosyntactic description (MSD)

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Example: Slovene lexicon abeced abeceda Ncfdg abeced abeceda Ncfpg abeceda = Ncfsn abecedah abeceda Ncfdl abecedah abeceda Ncfpl abecedam abeceda Ncfpd abecedama abeceda Ncfdd abecedama abeceda Ncfdi abecedami abeceda Ncfpi abecede abeceda Ncfpa abecede abeceda Ncfpn abecede abeceda Ncfsg abecedi abeceda Ncfda abecedi abeceda Ncfdn …

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Lexicon sizes

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute 3. The “1984” corpus Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr,…)) Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr,…)) Structuraly annotated Structuraly annotated Sentence aligned with English Sentence aligned with English Words annotated with lemma and MSD Words annotated with lemma and MSD Encoded in TEI P4 (XML) Encoded in TEI P4 (XML)

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Example linguistic encoding Bil Bil je je jasen jasen,, mrzel mrzel aprilski aprilski dan dan in in ure ure so so bile bile trinajst trinajst.. … Sentence alignment & Context disambiguated lemmas and MSDs

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Quantifying the corpus

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Utility of MULTEXT-East LRs Specifications became, for some, the “national” standard Specifications became, for some, the “national” standard Training/testing dataset for HLT development: PoS taggers, lemmatizers, lexicon extractors, ILP Training/testing dataset for HLT development: PoS taggers, lemmatizers, lexicon extractors, ILP A base dataset for further annotation and experiments: A base dataset for further annotation and experiments: –Word-sense disambiguation –WordNet development and evaluation –Syntactic parser induction Teaching aid in HLT courses Teaching aid in HLT courses ~ 100 registered users ~ 100 registered users As a BLARK “best practice” for new languages: Resian, Croatian, Macedonian, Persian As a BLARK “best practice” for new languages: Resian, Croatian, Macedonian, Persian

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute JSI Also ours: VAYNA, GORE, sloWNet Contributors to: FIDA, DSI, FDV, JRC-ACQUIS

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Overview of Slovene LRs and Slovenian Language Technologies Society

Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Thank you!