Download presentation
Presentation is loading. Please wait.
Published byFlorence Garrison Modified over 9 years ago
1
LREC 2008 From Research to Application in Multilingual Information Access: The Contribution of Evaluation Carol Peters ISTI-CNR, Pisa, Italy
2
LREC 2008 From Research to Application in Multilingual Information Access: The Contribution of Evaluation Carol Peters ISTI-CNR, Pisa, Italy
3
LREC 2008 Outline What is MLIA/CLIR? What is the State-of-the-Art? Where are the Problems? What is the Contribution of Evaluation? Where are the Problems? What more can we do? From CLEF to TrebleCLEF
4
LREC 2008 Europe’s Linguistic Diversity
5
LREC 2008 There are 6,800 known languages spoken in 200 countries6,800 known languages 2,261 have writing systems (the others are only spoken) Just 300 have some kind of language processing tools
6
LREC 2008 What is MLIA? MLIA related research regards the storage, access, retrieval and presentation of information in any of the world's languages. Two main areas of interest: multiple language access, browsing, display cross-language information discovery and retrieval
7
LREC 2008 Multi-Language Access, Browsing, Display The enabling technology: character encoding specific requirements of particular languages and scripts internationalization & localization
8
LREC 2008 Cross-Language Information Retrieval Crossing the language barrier… querying of multilingual collection in one language against documents in many other languages… filtering, selecting, ranking retrieved documents presenting retrieved information in an interpretable and exploitable fashion
9
LREC 2008 The Problem User Query Document Language Barrier Query Representation Document Representation
10
LREC 2008
12
CLIR methods How is it done? Pre-process & index both documents and queries – generally using language dependent techniques (tokenisation, stopwords, stemming, morphological analysis, decompounding, etc.) Translate: queries or documents (or both) Translation resources Machine Translation (MT) Parallel/comparable corpora Bilingual Dictionaries Multilingual Thesauri Conceptual Interlingua Find relevant documents in target collection(s) & present results
13
LREC 2008 CLIR for Multimedia Retrieval from a mixed media collection is non- trivial problem Different media processed in different ways and suffer from different kinds of indexing errors: spoken documents indexed using speech recognition handwritten documents indexed using OCR images indexed using significant features Need for complex integration of multiple technologies Need for merging of results from different sources
14
LREC 2008 Main CLIR Difficulties (I) Language identification Morphology: inflection, derivation, compounding, … OOV terms, e.g. proper names, terminology Multi-word concepts, e.g. phrases and idioms Ambiguity, e.g. polysemy Handling many languages: L1 -> Ln Merging results from different sources / media Presenting the results in useful fashion
15
LREC 2008 Main CLIR Difficulties (II) MLIA system need clever pre-processing of target collections (e.g. semantic analysis, classification, information extraction) MLIA systems need intelligent post-processing of results: merging/ summarization / translation MLIA systems need well-developed resources Language Processing Tools Language Resources Resources are expensive to acquire, maintain, update
16
LREC 2008 Cross-Language Evaluation Forum Objectives Promote research and stimulate development of multilingual IR systems for European languages, through Creation of evaluation infrastructure Building of an MLIA/CLIR research community Construction of publicly available test-suites Major Goal Encourage development of truly multilingual, multimodal systems
17
LREC 2008 CLEF Coordination Centre for the Evaluation of Human Language and Multimodal Communication Technologies (CELCT), Trento, Italy College of Information Studies and Institute for Advanced Computer Studies, U. Maryland, USA Dept. of Computer Science, U. Indonesia Depts. of Computer Science & Medical Informatics, RWTH Aachen U., Germany Dept. of Computer Science and Information Systems, U. Limerick, Ireland Dept. of Computer Science and Information Engineering, National U. Taiwan Dept. of Information Engineering, U. Padua, Italy Dept. of Information Sci, U. Hildesheim, Germany Dept. of Information Studies, U. Sheffield, UK Evaluations and Language Resources Distribution Agency Sarl, Paris, France Fondazione Bruno Kessler FBK-irst, Trento, Italy German Research Centre for Artificial Intelligence, DFKI, Saarbrücken, Germany Information and Language Processing Systems, U. Amsterdam, Netherlands IZ Bonn, Germany Inst. For Information technology, Hyderabad, India Inst. of Formal and Applied Linguistics, Charles University, Czech Rep LSI-UNED, Madrid, Spain Linguateca, Sintef, Oslo, Norway Linguistic Modelling Lab., Bulgarian Acad Sci Microsoft Research Asia NIST, USA Biomedial Informatics, Oregon Health and Science University, USA Research Computing Center of Moscow State U. Research Institute for Linguistics, Hungarian Academy of Sciences School of Computer Science and Mathematics, Victoria U., Australia School of Computing, DCU, Ireland UC Data Archive and School of Information Management and Systems, UC Berkeley, USA University "Alexandru Ioan Cuza", IASI, Romania U. Hospitals and U.of Geneva, Switzerland Vienna University of Technology, Austria Institutions contributing to the organisation of the different tracks of CLEF 2007
18
LREC 2008 Evolution of CLEF CLEF 2000 Tracks mono-, bi- & multilingual text doc retrieval (Ad Hoc) mono- and cross-language information on structured scientific data (Domain-Specific) CLEF 2001 Added interactive cross-language retrieval (iCLEF) CLEF 2002 Added cross-language spoken document retrieval (CL-SR) CLEF 2003 Added multiple language question answering (QA@CLEF) cross-language retrieval in image collections (ImageCLEF) CLEF 2005 Added multilingual retrieval of Web documents (WebCLEF) cross-language geographical retrieval (GeoCLEF) CLEF 2008 Added cross-language video retrieval (VideoCLEF) multingual information filtering (INFILE@CLEF)
19
LREC 2008 CLEF Test Collections 2000 News documents in 4 languages GIRT German Social science database 2007 CLEF multilingual comparable corpus of more than 3M news docs in 13 languages: CZ,DE,EN,ES,FI,FR,IT,NL,RU,SV,PT,BG and HU GIRT-4 social science database in EN and DE, Russian ISISS collection; Cambridge Sociological Abstracts Malach collection of conversational speech derived from the Shoah archives EN & CZ EuroGOV, 3.5 M webpages crawled from European governmental sites IAPR TC-12 photo database; PASCAL VOC 2006 training data ImageCLEFmed radiological database consisting of 6 distinct datasets; IRMA collection in EN & DE for automatic medical image annotation Each track creates topics/queries & relevance assessments in diverse languages
20
LREC 2008 20 Promoting Research through Evaluation Text Retrieval (from 2000) Mono-, bi- and multilingual system performance tested using news documents (13 European languages) bilingual task testing on unusual language combinations multilingual system testing with many target languages advanced tasks to monitor improvement in system performance over time and focused on problem of merging results from different collections/languages “robust” task emphasized importance of stable performance over languages instead of high average performance Since 2006, queries in non-European languages (Indian sub-task) 2008: New tasks on library archives; Tasks on non-European target collections; robust task uses WSD data
21
LREC 2008 Results: Cross-Language Text Retrieval Comparing bilingual results with monolingual baselines: TREC-6, 1997: EN→FR: 49% of best monolingual French system EN→DE: 64% of best monolingual German system CLEF 2002: EN→FR: 83,4% of best monolingual French system EN→DE: 85,6% of best monolingual German system CLEF 2003 enforced the use of “unusual” language pairs: IT→ES: 83% of best monolingual Spanish IR system DE→IT: 87% of best monolingual Italian IR system FR→NL: 82% of best monolingual Dutch IR system CLEF 2007 best bilingual system 88% of best monolingual system
22
LREC 2008 Other results:non-doc & non-text retrieval Interactive CLEF Cross-Lang. IR from a user-inclusive perspective Multilingual Question Answering 10 different target collections, Real-time exercise, answer validation, QA on speech transcripts Geographical CLIR Cross-language image retrieval Tasks on photo and medical archives, tasks for retrieval and calssification Cross-language spoken document & cross-language speech retrieval
23
LREC 2008 CLEF Achievements Stimulation of research activity in new, previously unexplored areas Study and implementation of evaluation methodologies for diverse types of cross-language IR systems Creation of a large set of empirical data about multilingual information access from the user perspective Quantitative and qualitative evidence with respect to best practice in cross-language system development Creation of reusable test collections for system benchmarking Building of a strong, multidisciplinary research community BUT
24
LREC 2008 BUT Notable lack of takeup by Application Communities
25
LREC 2008 TrebleCLEF is a Coordination Action, funded under the 7FP from 2008 to 2009, which aims at: continuing to promote the development of advanced multilingual multimedia information access systems disseminating knowhow, tools, and resources to enable DL creators to make content and knowledge accessible, usable and exploitable over time, over media and over language boundaries TrebleCLEF
26
LREC 2008 Objectives I TrebleCLEF will promote R&D and industrial take-up of multilingual, multimodal information access functionality in the following ways: by continuing to support the annual CLEF system evaluation campaigns, with particular focus on: user modeling, e.g. the requirements of different classes of users when querying multilingual information sources language-specific experimentation, e.g. looking at differences across languages in order to derive best practices for each language results presentation, e.g. how can results be presented in the most useful and comprehensible way to the user.
27
LREC 2008 Objectives II by constituting a scientific forum for the MLIA community of researchers enabling them to meet and discuss results, emerging trends, new directions providing a scientific digital library to manage accessible the scientific data and experiments produced during the course of an evaluation campaign, providing tools to: analyze, compare, and cite the data and experiments curate, preserve, annotate, enrich them (promoting their re- use)
28
LREC 2008 Objectives III by acting as a virtual centre of competence providing a central reference point for anyone interested in studying or implementing MLIA functionality: making publicly available sets of guidelines on best practices in MLIA (e.g. what stemmer to use, what stop list, what translation resources, how best to evaluate, etc., depending on the application requirements); making tools and resources used in the evaluation campaigns freely available to a wider public whenever possible; otherwise providing links to where they can be acquired; organising workshops, and/or tutorials and training sessions.
29
LREC 2008 Approach MLI A EvaluationBest Practices Dissemination and Training Evaluation test collections and laboratory evaluation user evaluation log analysis Best Practices & Guidelines system-oriented aspects of MLIA applications collaborative user studies user-oriented aspects of MLIA interfaces Dissemination and Training tutorials workshops summer school
30
LREC 2008 Consortium ISTI-CNR, Pisa, Italy University of Padua, Italy University of Sheffield, United Kingdom Universidad Nacional de Educación a Distancia, Spain Zurich University of Applied Sciences, Switzerland Centre for the Evaluation of Language Communication Technologies, Italy Evaluations & Language resources Distribution Agency, France
31
LREC 2008 Contacts For further information see: http://www.trebleclef.eu/ or contact: Carol Peters - ISTI-CNR E-mail:carol.peters@isti.cnr.it
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.