The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana Gralis 2006 Gralis 2006 Institut für Slawistik der Universität Graz
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Overview 1. Background 2. FIDA: a reference corpus of Slovene 3. MULTEXT-East: morphosyntactic resources for Central and East- European languages 4. Other language resources for Slovene
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Language Resources LR comprise three layers of data: LR comprise three layers of data: –corpora: mono- or multilingual, reference or specialised, … /variously annotated/ –lexica: vocabularies, morphosyntactic, syntactic, semantic, (ontologies) –standards: linguistic and technical encoding LRs, esp. corpora are used for empirical language research: LRs, esp. corpora are used for empirical language research: –linguistic studies: (annotated) corpus + (sophisticated) search engine –human language technology R&D: testing and training dataset
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Part I. The FIDA corpus Slovene reference corpus for linguistic studies Slovene reference corpus for linguistic studies
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute FIDA Joint project ( ) of Filozofska fakulteta Vojko Gorjanc, Marko Stabej, Špela Vintar Filozofska fakulteta Vojko Gorjanc, Marko Stabej, Špela Vintar Institut Jožef Stefan Tomaž Erjavec Institut Jožef Stefan Tomaž Erjavec DZS Simon Krek DZS Simon Krek Amebis Peter Holozan, Miro Romih Amebis Peter Holozan, Miro Romih Financed by industry partnerns
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Characteristics of FIDA monolingual monolingual synchronous synchronous written language written language reference reference –representative –balanced annotated annotated
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Sizes Total 103,513,072 words 29,177 texts Avg. text length 3,548 words Largest texts: Leksikon DZS: 508,370 words 69 texts > Smallest texts: < 100 words 2 x rezgrtshdrghgth4 2 x rezgrtshdrghgth4
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Time Composition Oldest/most recent text: 1989/2000 Oldest/most recent text: 1989/2000 Average date Average date Texts/Words with unknown date: 3.94%/8.28% Texts/Words with unknown date: 3.94%/8.28%
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute FIDA taxonomoy: publication types … Ft.P.P.O (published) 95.72% Ft.P.P.O.K (books) 22.71% Ft.P.P.O.P (periodicals) 70.50% Ft.P.P.O.P.C (newspaper) 46.59% Ft.P.P.O.P.C.D (daily) 32.67% Ft.P.P.O.P.C.T (weekly) 66.18% Ft.P.P.O.P.C.V (multi-weekly)17.74% …
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute FIDA taxonomoy: text types Ft.Z (text type) 99.47% Ft.Z.N (non-ficiton) 93.57% Ft.Z.N.N (non-professional)75.14% Ft.Z.N.S (professional) 18.37% Ft.Z.N.S.H (hum. & soc. sci.)10.57% Ft.Z.N.S.N (nat. & tech. sci.) 6.04% Ft.Z.U (fiction) 5.90% Ft.Z.U.D (drama) 0.10% Ft.Z.U.P (poetry) 0.17% Ft.Z.U.R (prose) 5.12%
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Markup of FIDA corpus elements annotated with meta- data (bibliographic, taxonomy) corpus elements annotated with meta- data (bibliographic, taxonomy) text linguistically annotated text linguistically annotated encoded according to international standards and recommendations encoded according to international standards and recommendations –technical: SGML, TEI P3 –linguistic: MULTEXT-East (MULTEXT, EAGLES)
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Linguistic annotation
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Accesibility Exploitation by partners: –DZS: new dictionaries –Amebis: development of HLT –Arts faculty: teaching –IJS: research on HLT Availability to the public: –access via concordance engine by Amebis –free access, but displays only few hits –possibility of academic licences FIDA (web site) no longer maintained!
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute FIDA+ FIDA Plus project: FIDA Plus project: –Filozofska fakulteta, Fakulteta za družbene vede, Institut Jožef Stefan –DZS, Amebis Financed by the ministry + ind. partners Financed by the ministry + ind. partners Extend the corpus with Extend the corpus with –Web materials –spoken component Better linguistic markup Better linguistic markup Free concordances: up to 100 lines Free concordances: up to 100 lines Also possibility of licences Also possibility of licences
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Concordancer
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Output
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Extended searches
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Corpus “Nova Beseda” being developed at Institute for Slovene language, ZRC SAZU (Primož Jakopin) being developed at Institute for Slovene language, ZRC SAZU (Primož Jakopin) Web concordancer with no hit limit Web concordancer with no hit limit now larger than FIDA now larger than FIDA but much less varied: fiction, Delo, DZ but much less varied: fiction, Delo, DZ not linguistically annotated not linguistically annotated
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Part II. MULTEXT-East multilingual morphosyntactic resources for HLT development multilingual morphosyntactic resources for HLT development
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute MULTEXT-East resources MULTEXT-East: Copernicus Joint Project COP 106 ( ) Multilingual Texts and Corpora for Eastern and Central European Languages MULTEXT-East: Copernicus Joint Project COP 106 ( ) Multilingual Texts and Corpora for Eastern and Central European Languages MULTEXT-East Based on the results of EU MULTEXT (~West) Based on the results of EU MULTEXT (~West) To produce a harmonised BLARK for six languages: To produce a harmonised BLARK for six languages:BLARK –corpus encoding standardisation (TEI / CES) –multilingual parallel, comparable, speech corpora –morphosyntactic specifications (EAGLES / MULTEXT) –(inflectional) lexicon –annotated corpus –language processing tools
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute History of MULTEXT-East resources First release 1998 on TELRI CD-ROM Vol II: already extended with new languages First release 1998 on TELRI CD-ROM Vol II: already extended with new languages Resources since 1998 available on the Web: Resources since 1998 available on the Web: Second release 2002 in scope of EU CONCEDE: re-encoding in XML/TEI, harmonisation Second release 2002 in scope of EU CONCEDE: re-encoding in XML/TEI, harmonisation Third release 2004: merge of first two releases, further languages Third release 2004: merge of first two releases, further languages Work (indirectly) supported by: TELRI, CONCEDE, NSF grant, bi-lateral projects Work (indirectly) supported by: TELRI, CONCEDE, NSF grant, bi-lateral projects
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute The Languages of MULTEXT-East Germanic: English Germanic: EnglishEnglish Romance: Romanian Romance: RomanianRomanian Baltic: Baltic: –Latvian Latvian –Lithuanian Lithuanian Finno-Ugric: Finno-Ugric: –Estonian Estonian –Hungarian Hungarian Slavic: Russian (East Slavic) Russian (East Slavic) Russian Czech (West Slavic) Czech (West Slavic) Czech Slovene (South West Slavic) Slovene (South West Slavic) Slovene Resian (Slovene dialect) Resian (Slovene dialect) Resian Croatian (South West Slavic) Croatian (South West Slavic) Croatian Serbian (South West Slavic) Serbian (South West Slavic) Serbian Bulgarian (South East Slavic) Bulgarian (South East Slavic) Bulgarian In progress: Macedonian Macedonian Persian Persian
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Version 3 Available on Available on Some parts completely free, others free for research Web licence Some parts completely free, others free for research Web licence Web pages gives: Web pages gives: –extensive documentation –bibliography list –web licence form –resource download
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute The MULTEXT morphosyntactic trinity 1. MULTEXT-East morphosyntactic specifications 2. MULTEXT-East morphosyntactic lexica 3. MULTEXT-East morphosyntactically annotated "1984" corpus
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute 1. Morphosyntactic specifications Based on EAGLES / MULTEXT Based on EAGLES / MULTEXT Define PoS, their attributes and values Define PoS, their attributes and values The specs are a document containing: The specs are a document containing: –introduction –common tables –language particular sections Written in LaTeX PDF & HTML Written in LaTeX PDF & HTMLPDFHTMLPDFHTML Derived XML/TEI encoding as feature structures Derived XML/TEI encoding as feature structuresXML/TEI
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Example common table
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Example language specific section table (shows only categories actually used) table (shows only categories actually used) notes notes combinations combinations lexicon lexicon for Slovene (FIDA): localisation of category names for Slovene (FIDA): localisation of category names
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Morphosyntactic Complexity
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute 2. The lexica Medium size morphosyntactic lexica Medium size morphosyntactic lexica Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian. Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian. ~ all word-forms of cca lemmas ~ all word-forms of cca lemmas Lexical entry is composed of three fields: Lexical entry is composed of three fields: –the word-form: the inflected form of the word –the lemma: the base-form of the word –the morphosyntactic description (MSD)
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Example: Slovene lexicon abeced abeceda Ncfdg abeced abeceda Ncfpg abeceda = Ncfsn abecedah abeceda Ncfdl abecedah abeceda Ncfpl abecedam abeceda Ncfpd abecedama abeceda Ncfdd abecedama abeceda Ncfdi abecedami abeceda Ncfpi abecede abeceda Ncfpa abecede abeceda Ncfpn abecede abeceda Ncfsg abecedi abeceda Ncfda abecedi abeceda Ncfdn …
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Lexicon sizes
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute 3. The “1984” corpus Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr,…)) Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr,…)) Structuraly annotated Structuraly annotated Sentence aligned with English Sentence aligned with English Words annotated with lemma and MSD Words annotated with lemma and MSD Encoded in TEI P4 (XML) Encoded in TEI P4 (XML)
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Example linguistic encoding Bil Bil je je jasen jasen,, mrzel mrzel aprilski aprilski dan dan in in ure ure so so bile bile trinajst trinajst.. … Sentence alignment & Context disambiguated lemmas and MSDs
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Quantifying the corpus
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Utility of MULTEXT-East LRs Specifications became, for some, the “national” standard Specifications became, for some, the “national” standard Training/testing dataset for HLT development: PoS taggers, lemmatizers, lexicon extractors, ILP Training/testing dataset for HLT development: PoS taggers, lemmatizers, lexicon extractors, ILP A base dataset for further annotation and experiments: A base dataset for further annotation and experiments: –Word-sense disambiguation –WordNet development and evaluation –Syntactic parser induction Teaching aid in HLT courses Teaching aid in HLT courses ~ 100 registered users ~ 100 registered users As a BLARK “best practice” for new languages: Resian, Croatian, Macedonian, Persian As a BLARK “best practice” for new languages: Resian, Croatian, Macedonian, Persian
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute JSI Also ours: VAYNA, GORE, sloWNet Contributors to: FIDA, DSI, FDV, JRC-ACQUIS
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Overview of Slovene LRs and Slovenian Language Technologies Society
Gralis Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Thank you!