The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si, http://nl.ijs.si/et/ http://nl.ijs.si/et/ tomaz.erjavec@ijs.sihttp://nl.ijs.si/et/ Gralis 2006 Gralis 2006 Institut für Slawistik der Universität Graz 2006-05-09

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Overview 1. Background 2. FIDA: a reference corpus of Slovene 3. MULTEXT-East: morphosyntactic resources for Central and East- European languages 4. Other language resources for Slovene

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Language Resources LR comprise three layers of data: LR comprise three layers of data: –corpora: mono- or multilingual, reference or specialised, … /variously annotated/ –lexica: vocabularies, morphosyntactic, syntactic, semantic, (ontologies) –standards: linguistic and technical encoding LRs, esp. corpora are used for empirical language research: LRs, esp. corpora are used for empirical language research: –linguistic studies: (annotated) corpus + (sophisticated) search engine –human language technology R&D: testing and training dataset

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Part I. The FIDA corpus Slovene reference corpus for linguistic studies Slovene reference corpus for linguistic studies

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute FIDA http://www.fida.net/ Joint project (1997-2000) of Filozofska fakulteta Vojko Gorjanc, Marko Stabej, Špela Vintar Filozofska fakulteta Vojko Gorjanc, Marko Stabej, Špela Vintar Institut Jožef Stefan Tomaž Erjavec Institut Jožef Stefan Tomaž Erjavec DZS Simon Krek DZS Simon Krek Amebis Peter Holozan, Miro Romih Amebis Peter Holozan, Miro Romih Financed by industry partnerns

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Characteristics of FIDA monolingual monolingual synchronous synchronous written language written language reference reference –representative –balanced annotated annotated

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Sizes Total 103,513,072 words 29,177 texts Avg. text length 3,548 words Largest texts: Leksikon DZS: 508,370 words 69 texts > 100.000 Smallest texts: 2.648 < 100 words 2 x rezgrtshdrghgth4 2 x rezgrtshdrghgth4

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Time Composition Oldest/most recent text: 1989/2000 Oldest/most recent text: 1989/2000 Average date 1997-02 Average date 1997-02 Texts/Words with unknown date: 3.94%/8.28% Texts/Words with unknown date: 3.94%/8.28%

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute FIDA taxonomoy: publication types … Ft.P.P.O (published) 95.72% Ft.P.P.O.K (books) 22.71% Ft.P.P.O.P (periodicals) 70.50% Ft.P.P.O.P.C (newspaper) 46.59% Ft.P.P.O.P.C.D (daily) 32.67% Ft.P.P.O.P.C.T (weekly) 66.18% Ft.P.P.O.P.C.V (multi-weekly)17.74% …

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute FIDA taxonomoy: text types Ft.Z (text type) 99.47% Ft.Z.N (non-ficiton) 93.57% Ft.Z.N.N (non-professional)75.14% Ft.Z.N.S (professional) 18.37% Ft.Z.N.S.H (hum. & soc. sci.)10.57% Ft.Z.N.S.N (nat. & tech. sci.) 6.04% Ft.Z.U (fiction) 5.90% Ft.Z.U.D (drama) 0.10% Ft.Z.U.P (poetry) 0.17% Ft.Z.U.R (prose) 5.12%

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Markup of FIDA corpus elements annotated with meta- data (bibliographic, taxonomy) corpus elements annotated with meta- data (bibliographic, taxonomy) text linguistically annotated text linguistically annotated encoded according to international standards and recommendations encoded according to international standards and recommendations –technical: SGML, TEI P3 –linguistic: MULTEXT-East (MULTEXT, EAGLES)

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Linguistic annotation

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Accesibility Exploitation by partners: –DZS: new dictionaries –Amebis: development of HLT –Arts faculty: teaching –IJS: research on HLT Availability to the public: –access via concordance engine by Amebis –free access, but displays only few hits –possibility of academic licences FIDA (web site) no longer maintained!

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute FIDA+ http://www.fidaplus.net/ http://www.fidaplus.net/ FIDA Plus project: FIDA Plus project: –Filozofska fakulteta, Fakulteta za družbene vede, Institut Jožef Stefan –DZS, Amebis Financed by the ministry + ind. partners Financed by the ministry + ind. partners Extend the corpus with Extend the corpus with –Web materials –spoken component Better linguistic markup Better linguistic markup Free concordances: up to 100 lines Free concordances: up to 100 lines Also possibility of licences Also possibility of licences

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Concordancer

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Output

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Extended searches

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Corpus “Nova Beseda” http://bos.zrc-sazu.si/ being developed at Institute for Slovene language, ZRC SAZU (Primož Jakopin) being developed at Institute for Slovene language, ZRC SAZU (Primož Jakopin) Web concordancer with no hit limit Web concordancer with no hit limit now larger than FIDA now larger than FIDA but much less varied: fiction, Delo, DZ but much less varied: fiction, Delo, DZ not linguistically annotated not linguistically annotated

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Part II. MULTEXT-East multilingual morphosyntactic resources for HLT development multilingual morphosyntactic resources for HLT development

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute MULTEXT-East resources MULTEXT-East: Copernicus Joint Project COP 106 (1995-1997) Multilingual Texts and Corpora for Eastern and Central European Languages MULTEXT-East: Copernicus Joint Project COP 106 (1995-1997) Multilingual Texts and Corpora for Eastern and Central European Languages MULTEXT-East Based on the results of EU MULTEXT (~West) Based on the results of EU MULTEXT (~West) To produce a harmonised BLARK for six languages: To produce a harmonised BLARK for six languages:BLARK –corpus encoding standardisation (TEI / CES) –multilingual parallel, comparable, speech corpora –morphosyntactic specifications (EAGLES / MULTEXT) –(inflectional) lexicon –annotated corpus –language processing tools

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute History of MULTEXT-East resources First release 1998 on TELRI CD-ROM Vol II: already extended with new languages First release 1998 on TELRI CD-ROM Vol II: already extended with new languages Resources since 1998 available on the Web: http://nl.ijs.si/ME/ Resources since 1998 available on the Web: http://nl.ijs.si/ME/http://nl.ijs.si/ME/ Second release 2002 in scope of EU CONCEDE: re-encoding in XML/TEI, harmonisation Second release 2002 in scope of EU CONCEDE: re-encoding in XML/TEI, harmonisation Third release 2004: merge of first two releases, further languages Third release 2004: merge of first two releases, further languages Work (indirectly) supported by: TELRI, CONCEDE, NSF grant, bi-lateral projects Work (indirectly) supported by: TELRI, CONCEDE, NSF grant, bi-lateral projects

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute The Languages of MULTEXT-East Germanic: English Germanic: EnglishEnglish Romance: Romanian Romance: RomanianRomanian Baltic: Baltic: –Latvian Latvian –Lithuanian Lithuanian Finno-Ugric: Finno-Ugric: –Estonian Estonian –Hungarian Hungarian Slavic: Russian (East Slavic) Russian (East Slavic) Russian Czech (West Slavic) Czech (West Slavic) Czech Slovene (South West Slavic) Slovene (South West Slavic) Slovene Resian (Slovene dialect) Resian (Slovene dialect) Resian Croatian (South West Slavic) Croatian (South West Slavic) Croatian Serbian (South West Slavic) Serbian (South West Slavic) Serbian Bulgarian (South East Slavic) Bulgarian (South East Slavic) Bulgarian In progress: Macedonian Macedonian Persian Persian

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Version 3 Available on http://nl.ijs.si/ME/V3/ Available on http://nl.ijs.si/ME/V3/http://nl.ijs.si/ME/V3/ Some parts completely free, others free for research  Web licence Some parts completely free, others free for research  Web licence Web pages gives: Web pages gives: –extensive documentation –bibliography list –web licence form –resource download

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute The MULTEXT morphosyntactic trinity 1. MULTEXT-East morphosyntactic specifications 2. MULTEXT-East morphosyntactic lexica 3. MULTEXT-East morphosyntactically annotated "1984" corpus

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute 1. Morphosyntactic specifications Based on EAGLES / MULTEXT Based on EAGLES / MULTEXT Define PoS, their attributes and values Define PoS, their attributes and values The specs are a document containing: The specs are a document containing: –introduction –common tables –language particular sections Written in LaTeX  PDF & HTML Written in LaTeX  PDF & HTMLPDFHTMLPDFHTML Derived XML/TEI encoding as feature structures Derived XML/TEI encoding as feature structuresXML/TEI

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Example common table

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Example language specific section table (shows only categories actually used) table (shows only categories actually used) notes notes combinations combinations lexicon lexicon for Slovene (FIDA): localisation of category names for Slovene (FIDA): localisation of category names

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Morphosyntactic Complexity

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute 2. The lexica Medium size morphosyntactic lexica Medium size morphosyntactic lexica Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian. Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian. ~ all word-forms of cca 15.000 lemmas ~ all word-forms of cca 15.000 lemmas Lexical entry is composed of three fields: Lexical entry is composed of three fields: –the word-form: the inflected form of the word –the lemma: the base-form of the word –the morphosyntactic description (MSD)

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Example: Slovene lexicon abeced abeceda Ncfdg abeced abeceda Ncfpg abeceda = Ncfsn abecedah abeceda Ncfdl abecedah abeceda Ncfpl abecedam abeceda Ncfpd abecedama abeceda Ncfdd abecedama abeceda Ncfdi abecedami abeceda Ncfpi abecede abeceda Ncfpa abecede abeceda Ncfpn abecede abeceda Ncfsg abecedi abeceda Ncfda abecedi abeceda Ncfdn …

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Lexicon sizes

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute 3. The “1984” corpus Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr,…)) Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr,…)) Structuraly annotated Structuraly annotated Sentence aligned with English Sentence aligned with English Words annotated with lemma and MSD Words annotated with lemma and MSD Encoded in TEI P4 (XML) Encoded in TEI P4 (XML)

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Example linguistic encoding Bil Bil je je jasen jasen,, mrzel mrzel aprilski aprilski dan dan in in ure ure so so bile bile trinajst trinajst.. … Sentence alignment & Context disambiguated lemmas and MSDs

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Quantifying the corpus

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Utility of MULTEXT-East LRs Specifications became, for some, the “national” standard Specifications became, for some, the “national” standard Training/testing dataset for HLT development: PoS taggers, lemmatizers, lexicon extractors, ILP Training/testing dataset for HLT development: PoS taggers, lemmatizers, lexicon extractors, ILP A base dataset for further annotation and experiments: A base dataset for further annotation and experiments: –Word-sense disambiguation –WordNet development and evaluation –Syntactic parser induction Teaching aid in HLT courses Teaching aid in HLT courses ~ 100 registered users ~ 100 registered users As a BLARK “best practice” for new languages: Resian, Croatian, Macedonian, Persian As a BLARK “best practice” for new languages: Resian, Croatian, Macedonian, Persian

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute LRs @ JSI http://nl.ijs.si/nl.html#Resource Also ours: VAYNA, GORE, sloWNet Contributors to: FIDA, DSI, FDV, JRC-ACQUIS

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Overview of Slovene LRs and services @ Slovenian Language Technologies Society http://nl.ijs.si/sdjt/

Gralis 2006-05-09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Thank you!

The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

Similar presentations

Presentation on theme: "The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

Similar presentations

Presentation on theme: "The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana"— Presentation transcript:

Similar presentations

About project

Feedback