Download presentation
Presentation is loading. Please wait.
Published byPaul Roberts Modified over 8 years ago
1
1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch & Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest, September 19, 2007
2
2 Outline The Domain-Specific Task Collections & Controlled Vocabularies Topics Participants, Runs & Relevance Assessments Themes Summary & Outlook
3
3 The Domain-Specific Task CLIR on structured scientific document collections: social science domain bibliographic metadata controlled vocabularies for subject description Leverage bibliographic metadata & controlled vocabularies for: search translation
4
4 The Domain-Specific Task Tasks: Monolingual against German, English or Russian Bilingual against German, English or Russian Multilingual against combined collection
5
5 Collections GermanEnglishRussian NameGIRT-DEGIRT-ENCSA-SAISISS DescriptionGerman social science literature & projects GIRT-DE translated Sociolog. Abstracts Inst. of Scientific Inf. for Soc. Sc. of the Ru. Acad. of Science Coverage1990-2000 1994-1996 Docs151,319 20,000145,802 Abstracts96%17%94%27%
6
6 Controlled Vocabularies GIRTCSA-SAISISS Descriptors / doc 106.43.9 Class. codes / doc 21.3n/a 5 different subject-describing terminologies: Thesaurus for the Social Sciences (GIRT-DE, -EN) Thesaurus of Sociological Indexing Terms (CSA-SA) INION Thesaurus (ISISS) Social Sciences Classification (GIRT-DE, -EN) Sociological Abstracts Classification (CSA-SA)
7
7 Controlled Vocabularies – Mapping Tools Translation: GIRT German GIRT English Intellectual term mappings (cross-walks): equivalent terms in vocabularies GIRT German CSA-SA English GIRT English CSA-SA English original-term: agricultural area mapped-term: Rural areas
8
8 Topics 25 topics in standard TREC format (title, desc, narr): 15 volunteers (social scientists) 2-5 suggestions from 28 subject specialties checked for: coverage in collections variance from previous years translated into English, Russian
9
9 Participants 5 groups Group InstitutionCountry Chemnitz Media Informatics Chemnitz University of Technology Germany Cheshire School of Information UC Berkeley USA MoscowMoscow State UniversityRussia Unine Computer Science Department University of Neuchatel Switzerland Xerox Data Mining Group Xerox Research Centre Europe France
10
10 Runs TaskRuns 2007 Runs 2006 Runs 2005 Monolingual - against German13 17 - against English158 - against Russian1118 Bilingual - against German14615 - against English15313 - against Russian935 Multilingual923 Total863676
11
11 Relevance Assessments GermanEnglishRussian Pool size16,28817,86714,473 Rel. Docs 200722%25%10%* Rel. docs 200639%26%n/a Rel. docs 200520%21%9% (RSSC) * In Russian collection: 3 topics without relevant topics All assessments done with Univ. of Padova‘s DIRECT System.
12
12 Relevance Assessments – Best MAP TaskMAP 2007 MAP 2006 MAP 2005 Monolingual - against German0.50510.54540.4936 - against English0.35340.45760.5065 - against Russian0.19710.25420.3038 Bilingual - against German0.4568 (90%)0.2448 (45%)0.4201 (85%) - against English0.3341 (95%)0.3301 (72%)0.4743 (94%) - against Russian0.1348 (68%)0.1648 (62%)0.2331 (77%) Multilingual0.08840.07530.0532
13
13 Themes - Retrieval models Lucene Language Modelling Logistic Regression Comparison: Vector Space, LM, Probabilistic - Okapi, DFR Data fusion Russian word-based vs. N-gram retrieval new light-weight stemmer
14
14 Themes – Query Expansion Entry Vocabulary Modules query terms associated with thesaurus terms from documents Thesaurus Lookup combined thesaurus from all CVs GIRT Thesaurus Index Lexical Entailment find document terms in relation to query terms Blind Feedback
15
15 Themes – Translation Lucene plug-in Babelfish, Google, PROMT, Reverso Bilingual thesaurus mapping Dictionary adaption disambiguate term translation given language context of feedback documents Statistical machine translation MATRAX Commercial Software
16
16 Summary & Outlook Extension of Russian materials Translation table DE-EN-RU for GIRT Thesaurus Translation table RU-EN for INION Thesaurus Mapping between GIRT – INION Thesaurus More tools for Terminology mapping different relationships (0T, SYN, BT, NT, RT) GESIS-IZ project: > 40 mappings 25 controlled vocabularies / 11 disciplines ~ 125,000 terms & phrases ~ 400,000 relations
17
17 Domain-Specific Track: http://www.gesis.org/en/research/ information_technology/clef_ds_2007.htm Vocabulary Mappings: http://www.gesis.org/en/research/ information_technology/komohe.htm Email: vivien.petras@gesis.org
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.