1 The Domain-Specific Track at CLEF 2008 Vivien Petras & Stefan Baerisch GESIS Social Science Information Centre, Bonn, Germany Aarhus, Denmark, September 17, 2008
2 Outline The Domain-Specific Task Collections & Controlled Vocabularies Participants, Runs & Relevance Assessments Themes Outlook
3 The Domain-Specific Task CLIR on structured scientific document collections: social science domain bibliographic metadata controlled vocabularies for subject description Leverage for: search query expansion translation
4 The Domain-Specific Task Tasks: Monolingual: against German, English or Russian Bilingual: against German, English or Russian Multilingual: against combined collection Topics: 25 topics in standard TREC format (title, desc, narr): suggestions from 28 subject specialties in the Social Sciences translated from German English, Russian
5 Collections GermanEnglishRussian NameGIRT-DEGIRT-ENCSA-SAISISS DescriptionGerman social science literature & projects GIRT-DE translated Sociolog. Abstracts Inst. of Scientific Inf. for Soc. Sc. of the Ru. Acad. of Science Coverage Docs151,319 20,000145,802 Abstracts96%17%94%27%
6 Controlled Vocabularies GIRTCSA-SAINION Descriptors / doc Class. codes / doc 21.3n/a 5 different subject-describing terminologies: Thesaurus for the Social Sciences (GIRT-DE, -EN) Thesaurus of Sociological Indexing Terms (CSA-SA) INION Thesaurus (ISISS) Social Sciences Classification (GIRT-DE, -EN) Sociological Abstracts Classification (CSA-SA)
7 Controlled Vocabularies – Mapping Tools Translation: GIRT German GIRT English, GIRT Russian INION Russian INION English Term mappings: equivalent terms in vocabularies GIRT German / English CSA-SA English GIRT German INION Russian counseling for the aged Counseling + Elderly
8 Participants 6 groups Group InstitutionCountry AmsterdamUniversity of AmsterdamThe Netherlands Chemnitz Media Informatics, Chemnitz University of Technology Germany CheshireSchool of Information, UC BerkeleyUSA DarmstadtTechnical University DarmstadtGermany HugUniversity Hospitals GenevaSwitzerland Unine Computer Science Department, University of Neuchatel Switzerland
9 Runs TaskRuns 2008 Runs 2007 Runs 2006 Monolingual - against German against English against Russian9111 Bilingual - against German against English against Russian893 Multilingual992 Total698636
10 Relevance Assessments GermanEnglishRussian Pool size Rel. Docs %14%2%* Rel. Docs %25%10%** Rel. Docs %26%n/a * In Russian collection: 1 topic without relevant docs ** 3 topics without relevant docs
11 Relevance Assessments – Best MAP TaskBest MAP 2008 Best MAP 2007 Best MAP 2006 Monolingual - against German against English against Russian Bilingual - against German (82%) (90%) (45%) - against English (87%) (95%) (72%) - against Russian (49%) (68%) (62%) Multilingual0.2816* *German topics; English = ; Russian =
12 Themes - Retrieval models Lucene (Xtrieval Chemnitz, Darmstadt) Semantic relatedness: Wikipedia / Wiktionary (Darmstadt) Language Models (Amsterdam) Vector space (EasyIR, Hug) Probabilistic – Logistic Regression (Cheshire) Comparison: Vector Space, LM, Probabilistic, DFR (Unine) Data fusion
13 Themes – Query Expansion Blind Feedback (Rocchio) idf-window BF (infrequent terms near search term) Thesaurus Lookup Thesaurus as pivot language: double translation Google (text snippets) Wikipedia (frequent terms from top-ranked articles)
14 Themes – Translation Google AJAX language API Commercial Software (Systran, LEC) Bilingual thesaurus look-up ML retrieval thesaurus look-up Wikipedia (Cross-language links)
15 Summary & Outlook Enough interest for 2009? Different corpora Different tasks full topic run (125 topics) result: controlled vocabulary terms (not documents) robust task Full-text retrieval with open access literature
16 Domain-Specific Track: information_technology/clef_ds.htm Vocabulary Mappings: information_technology/komohe.htm