CLEF Workshop ECDL 2003 Trondheim Michael Kluck slide 1 Introduction to the Monolingual and Domain-Specific Tasks of the Cross-language Evaluation Forum 2003 Michael Kluck (Informationszentrum Sozialwissenschaften – IZ, Bonn/Berlin, Humboldt-University Berlin)
CLEF Workshop ECDL 2003 Trondheim Michael Kluck slide 2 Monolingual Task Languages: –Dutch, Finnish, French, German, Italian, Spanish, Swedish –New: Russian (with reduced topic set, because of the time span of the data) –exclusion of English (widely used in TRE etc., overflow of runs; only newcomers) Aim: –Building a starting-point for CLIR –Enlarge and balance the pool –Use of recently introduced or new languages in the CLEF campaign
CLEF Workshop ECDL 2003 Trondheim Michael Kluck slide 3 Monolingual runs by 22 participants Lang.Deliver- ed runs Judged runs % DE EN ES FI FR IT NL RU SV18 100
CLEF Workshop ECDL 2003 Trondheim Michael Kluck slide 4 Domain-Specific Task Amaryllis –could not be continued because of lack of funding in France –trying to get social science data from INIST failed GIRT –New bigger corpus GIRT4 in German from social science literature and current research information –Parallel corpus in English, although with smaller amount of text compared to the German part
CLEF Workshop ECDL 2003 Trondheim Michael Kluck slide 5 Features of GIRT4 Bigger than GIRT3, now: 320,638 documents –151,319 original German –151,319 translated into English Pseudo-parallel corpus: –Title, Controlled-Term, Classification-Text available in German and English for all documents –Abstract available for 96% in German, only for 15 % in English -> reduced amount of text for the English part –Translated texts (Abstract) are sometimes result of machine translation by SYSTRAN (EU) –Renumbered
CLEF Workshop ECDL 2003 Trondheim Michael Kluck slide 6 Field Availability in GIRT4 Equal distribution for the German and English part: –Title: 1 per doc On average: –Controlled-Terms: per doc –Classification-Text: 2.02 per doc Different distribution for the German and English part: On average: –Method-Term DE 2.35 per doc EN 1.93 per doc –Abstract DE 0.96 per doc EN 0.15 per doc
CLEF Workshop ECDL 2003 Trondheim Michael Kluck slide 7 GIRT4 Tasks Monolingual –DE topics -> DE data –EN topics -> EN data Bilingual –EN or RU topics -> DE data –DE or RU topics -> EN data Additional instruments –German-English thesaurus –German-Russian translation table (not fully up-to-date) Concordance list of document numbers –Will be available by end of August 2003
CLEF Workshop ECDL 2003 Trondheim Michael Kluck slide 8 Assessment of GIRT4 17,031 docs, +65 % Started with the German part Then identified the identical English documents (if they had been indicated as relevant hits) Continued with those hits in the English part that have been indicated as relevant (without having counterparts in the German part) During assessment it showed up that the search results in the different language parts have not been fully congruent –For a given topic the result hits in the English part have not been identical with those in the German part (without knowing which was belonging to what run)
CLEF Workshop ECDL 2003 Trondheim Michael Kluck slide 9 GIRT4 runs by 4 participants DataTopic lang. judged runs GIRT4 DE DE13Mono- lingual 17 GIRT4 EN 4 GIRT4 DE EN1Bilin- gual 5 GIRT4 DE RU2 GIRT4 EN DE1 GIRT4 EN RU1