LREC 2008 From Research to Application in Multilingual Information Access: The Contribution of Evaluation Carol Peters ISTI-CNR, Pisa, Italy.

LREC 2008 From Research to Application in Multilingual Information Access: The Contribution of Evaluation Carol Peters ISTI-CNR, Pisa, Italy

LREC 2008 Outline  What is MLIA/CLIR?  What is the State-of-the-Art?  Where are the Problems?  What is the Contribution of Evaluation?  Where are the Problems?  What more can we do?  From CLEF to TrebleCLEF

LREC 2008 Europe’s Linguistic Diversity

LREC 2008 There are 6,800 known languages spoken in 200 countries6,800 known languages 2,261 have writing systems (the others are only spoken) Just 300 have some kind of language processing tools

LREC 2008 What is MLIA?  MLIA related research regards the storage, access, retrieval and presentation of information in any of the world's languages.  Two main areas of interest:  multiple language access, browsing, display  cross-language information discovery and retrieval

LREC 2008 Multi-Language Access, Browsing, Display The enabling technology:  character encoding  specific requirements of particular languages and scripts  internationalization & localization

LREC 2008 Cross-Language Information Retrieval Crossing the language barrier…  querying of multilingual collection in one language against documents in many other languages…  filtering, selecting, ranking retrieved documents  presenting retrieved information in an interpretable and exploitable fashion

LREC 2008 The Problem User Query Document Language Barrier Query Representation Document Representation

LREC 2008

CLIR methods  How is it done?  Pre-process & index both documents and queries – generally using language dependent techniques (tokenisation, stopwords, stemming, morphological analysis, decompounding, etc.)  Translate: queries or documents (or both)  Translation resources  Machine Translation (MT)  Parallel/comparable corpora  Bilingual Dictionaries  Multilingual Thesauri  Conceptual Interlingua  Find relevant documents in target collection(s) & present results

LREC 2008 CLIR for Multimedia  Retrieval from a mixed media collection is non- trivial problem  Different media processed in different ways and suffer from different kinds of indexing errors:  spoken documents indexed using speech recognition  handwritten documents indexed using OCR  images indexed using significant features  Need for complex integration of multiple technologies  Need for merging of results from different sources

LREC 2008 Main CLIR Difficulties (I)  Language identification  Morphology: inflection, derivation, compounding, …  OOV terms, e.g. proper names, terminology  Multi-word concepts, e.g. phrases and idioms  Ambiguity, e.g. polysemy  Handling many languages: L1 -> Ln  Merging results from different sources / media  Presenting the results in useful fashion

LREC 2008 Main CLIR Difficulties (II)  MLIA system need clever pre-processing of target collections (e.g. semantic analysis, classification, information extraction)  MLIA systems need intelligent post-processing of results: merging/ summarization / translation  MLIA systems need well-developed resources  Language Processing Tools  Language Resources  Resources are expensive to acquire, maintain, update

LREC 2008 Cross-Language Evaluation Forum Objectives  Promote research and stimulate development of multilingual IR systems for European languages, through  Creation of evaluation infrastructure  Building of an MLIA/CLIR research community  Construction of publicly available test-suites Major Goal  Encourage development of truly multilingual, multimodal systems

LREC 2008 CLEF Coordination  Centre for the Evaluation of Human Language and Multimodal Communication Technologies (CELCT), Trento, Italy  College of Information Studies and Institute for Advanced Computer Studies, U. Maryland, USA  Dept. of Computer Science, U. Indonesia  Depts. of Computer Science & Medical Informatics, RWTH Aachen U., Germany  Dept. of Computer Science and Information Systems, U. Limerick, Ireland  Dept. of Computer Science and Information Engineering, National U. Taiwan  Dept. of Information Engineering, U. Padua, Italy  Dept. of Information Sci, U. Hildesheim, Germany  Dept. of Information Studies, U. Sheffield, UK  Evaluations and Language Resources Distribution Agency Sarl, Paris, France  Fondazione Bruno Kessler FBK-irst, Trento, Italy  German Research Centre for Artificial Intelligence, DFKI, Saarbrücken, Germany  Information and Language Processing Systems, U. Amsterdam, Netherlands  IZ Bonn, Germany  Inst. For Information technology, Hyderabad, India  Inst. of Formal and Applied Linguistics, Charles University, Czech Rep  LSI-UNED, Madrid, Spain  Linguateca, Sintef, Oslo, Norway  Linguistic Modelling Lab., Bulgarian Acad Sci  Microsoft Research Asia  NIST, USA  Biomedial Informatics, Oregon Health and Science University, USA  Research Computing Center of Moscow State U.  Research Institute for Linguistics, Hungarian Academy of Sciences  School of Computer Science and Mathematics, Victoria U., Australia  School of Computing, DCU, Ireland  UC Data Archive and School of Information Management and Systems, UC Berkeley, USA  University "Alexandru Ioan Cuza", IASI, Romania  U. Hospitals and U.of Geneva, Switzerland  Vienna University of Technology, Austria Institutions contributing to the organisation of the different tracks of CLEF 2007

LREC 2008 Evolution of CLEF CLEF 2000 Tracks  mono-, bi- & multilingual text doc retrieval (Ad Hoc)  mono- and cross-language information on structured scientific data (Domain-Specific) CLEF 2001 Added  interactive cross-language retrieval (iCLEF) CLEF 2002 Added  cross-language spoken document retrieval (CL-SR) CLEF 2003 Added  multiple language question answering (QA@CLEF)  cross-language retrieval in image collections (ImageCLEF) CLEF 2005 Added  multilingual retrieval of Web documents (WebCLEF)  cross-language geographical retrieval (GeoCLEF) CLEF 2008 Added  cross-language video retrieval (VideoCLEF)  multingual information filtering (INFILE@CLEF)

LREC 2008 CLEF Test Collections 2000  News documents in 4 languages  GIRT German Social science database 2007  CLEF multilingual comparable corpus of more than 3M news docs in 13 languages: CZ,DE,EN,ES,FI,FR,IT,NL,RU,SV,PT,BG and HU  GIRT-4 social science database in EN and DE, Russian ISISS collection; Cambridge Sociological Abstracts  Malach collection of conversational speech derived from the Shoah archives EN & CZ  EuroGOV, 3.5 M webpages crawled from European governmental sites  IAPR TC-12 photo database; PASCAL VOC 2006 training data  ImageCLEFmed radiological database consisting of 6 distinct datasets;  IRMA collection in EN & DE for automatic medical image annotation Each track creates topics/queries & relevance assessments in diverse languages

LREC 2008 20 Promoting Research through Evaluation  Text Retrieval (from 2000) Mono-, bi- and multilingual system performance tested using news documents (13 European languages)  bilingual task testing on unusual language combinations  multilingual system testing with many target languages  advanced tasks to monitor improvement in system performance over time and focused on problem of merging results from different collections/languages  “robust” task emphasized importance of stable performance over languages instead of high average performance  Since 2006, queries in non-European languages (Indian sub-task)  2008: New tasks on library archives; Tasks on non-European target collections; robust task uses WSD data

LREC 2008 Results: Cross-Language Text Retrieval Comparing bilingual results with monolingual baselines:  TREC-6, 1997:  EN→FR: 49% of best monolingual French system  EN→DE: 64% of best monolingual German system  CLEF 2002:  EN→FR: 83,4% of best monolingual French system  EN→DE: 85,6% of best monolingual German system  CLEF 2003 enforced the use of “unusual” language pairs:  IT→ES: 83% of best monolingual Spanish IR system  DE→IT: 87% of best monolingual Italian IR system  FR→NL: 82% of best monolingual Dutch IR system  CLEF 2007 best bilingual system 88% of best monolingual system

LREC 2008 Other results:non-doc & non-text retrieval  Interactive CLEF Cross-Lang. IR from a user-inclusive perspective  Multilingual Question Answering  10 different target collections, Real-time exercise, answer validation, QA on speech transcripts  Geographical CLIR  Cross-language image retrieval  Tasks on photo and medical archives, tasks for retrieval and calssification  Cross-language spoken document & cross-language speech retrieval

LREC 2008 CLEF Achievements  Stimulation of research activity in new, previously unexplored areas  Study and implementation of evaluation methodologies for diverse types of cross-language IR systems  Creation of a large set of empirical data about multilingual information access from the user perspective  Quantitative and qualitative evidence with respect to best practice in cross-language system development  Creation of reusable test collections for system benchmarking  Building of a strong, multidisciplinary research community BUT

LREC 2008 BUT Notable lack of takeup by Application Communities

LREC 2008 TrebleCLEF is a Coordination Action, funded under the 7FP from 2008 to 2009, which aims at:  continuing to promote the development of advanced multilingual multimedia information access systems  disseminating knowhow, tools, and resources to enable DL creators to make content and knowledge accessible, usable and exploitable over time, over media and over language boundaries TrebleCLEF

LREC 2008 Objectives I TrebleCLEF will promote R&D and industrial take-up of multilingual, multimodal information access functionality in the following ways:  by continuing to support the annual CLEF system evaluation campaigns, with particular focus on:  user modeling, e.g. the requirements of different classes of users when querying multilingual information sources  language-specific experimentation, e.g. looking at differences across languages in order to derive best practices for each language  results presentation, e.g. how can results be presented in the most useful and comprehensible way to the user.

LREC 2008 Objectives II  by constituting a scientific forum for the MLIA community of researchers enabling them to meet and discuss results, emerging trends, new directions  providing a scientific digital library to manage accessible the scientific data and experiments produced during the course of an evaluation campaign, providing tools to:  analyze, compare, and cite the data and experiments  curate, preserve, annotate, enrich them (promoting their re- use)

LREC 2008 Objectives III  by acting as a virtual centre of competence providing a central reference point for anyone interested in studying or implementing MLIA functionality:  making publicly available sets of guidelines on best practices in MLIA (e.g. what stemmer to use, what stop list, what translation resources, how best to evaluate, etc., depending on the application requirements);  making tools and resources used in the evaluation campaigns freely available to a wider public whenever possible; otherwise providing links to where they can be acquired;  organising workshops, and/or tutorials and training sessions.

LREC 2008 Approach MLI A EvaluationBest Practices Dissemination and Training  Evaluation  test collections and laboratory evaluation  user evaluation  log analysis  Best Practices & Guidelines  system-oriented aspects of MLIA applications  collaborative user studies  user-oriented aspects of MLIA interfaces  Dissemination and Training  tutorials  workshops  summer school

LREC 2008 Consortium  ISTI-CNR, Pisa, Italy  University of Padua, Italy  University of Sheffield, United Kingdom  Universidad Nacional de Educación a Distancia, Spain  Zurich University of Applied Sciences, Switzerland  Centre for the Evaluation of Language Communication Technologies, Italy  Evaluations & Language resources Distribution Agency, France

LREC 2008 Contacts For further information see: http://www.trebleclef.eu/ or contact: Carol Peters - ISTI-CNR E-mail:carol.peters@isti.cnr.it

LREC 2008 From Research to Application in Multilingual Information Access: The Contribution of Evaluation Carol Peters ISTI-CNR, Pisa, Italy.

Similar presentations

Presentation on theme: "LREC 2008 From Research to Application in Multilingual Information Access: The Contribution of Evaluation Carol Peters ISTI-CNR, Pisa, Italy."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LREC 2008 From Research to Application in Multilingual Information Access: The Contribution of Evaluation Carol Peters ISTI-CNR, Pisa, Italy.

Similar presentations

Presentation on theme: "LREC 2008 From Research to Application in Multilingual Information Access: The Contribution of Evaluation Carol Peters ISTI-CNR, Pisa, Italy."— Presentation transcript:

Similar presentations

About project

Feedback