Evaluating Multilingual Question Answering Systems at CLEF Pamela Forner 1, Danilo Giampiccolo 1, Bernardo Magnini 2, Anselmo Peñas 3, Álvaro Rodrigo 3,

Evaluating Multilingual Question Answering Systems at CLEF Pamela Forner 1, Danilo Giampiccolo 1, Bernardo Magnini 2, Anselmo Peñas 3, Álvaro Rodrigo 3, Richard Sutcliffe 4 1 - CELCT, Trento, Italy, 2 - FBK, Trento, Italy 3 - UNED, Madrid, Spain 4- University of Limerick, Ireland

Outline Background QA at CLEF Resources Participation Evaluation Discussion Conclusions

Background – QA A Question Answering (QA) system takes as input a short natural language question and a document collection and produces an exact answer to the question, taken from the collection In Monolingual QA – Q and A in same language In Cross-Lingual QA – Q and A in different languages

Background – Monolingual Example Question: How many gold medals did Brian Goodel win in the 1979 Pan American Games? Answer: three gold medals Docid: LA112994-0248 Context: When comparing Michele Granger and Brian Goodell, Brian has to be the clear winner. In 1976, while still a student at Mission Viejo High, Brian won two Olympic gold medals at Montreal, breaking his own world records in both the 400- and 1,500-meter freestyle events. He went on to win three gold medals in the 1979 Pan American Games.

Background – Cross-Lingual Example Question: How high is the Eiffel Tower? Answer: 300 Meter Docid: SDA.950120.0207 Context: Der Eiffelturm wird jaehrlich von 4,5 bis 5 Millionen Menschen besucht. Das 300 Meter hohe Wahrzeichnen von Paris hatte im vergangenen Jahr vier neue Aufzuege von der zweiten bis zur vierten Etage erhalten.

Background – Grouped Questions With grouped questions there are several on the same topic which may be linked even indirectly by co-reference: Question: Who wrote the song "Dancing Queen"? Question: When did it come out? Question: How many people were in the group?

QA at CLEF - Eras Origin was QA at Text REtrieval Conference, in 1999 onwards; term factoid coined there At CLEF, there have been three Eras Era 1 (2003-6): Ungrouped; mainly factoid; monolingual newspapers; exact answers Era 2: (2007-8): Grouped; mainly factoid; monolingual newspapers and Wikipedias; exact answers Era 3: (2009-10): Ungrouped; factoid + others; multilingual aligned EU documents; passages and exact answers

QA at CLEF - Tasks

Resources - Documents Originally various newspapers (different in each target language, but same years 94/95) For Era-2 (linked questions) Wikipedia 2006 was added With Era-3 changed to JRC-Acquis Corpus – European Agreements and Laws In 2010 Europarl was added (partly transcribed debates from the European Parliament) Acquis and Europarl are Parallel Aligned (Ha Ha)

Resources - Questions In all years, questions are back-composed from target language corpus They are carefully grouped into various categories (person, place etc etc) However, they are not naturally occurring or real

Resources – Back Translation of Questions Each group composes questions in their own language, with answers in their target document collection They translate these into English (pivot language) All resulting English translations are pooled Each group translates English questions into their language Eras 1 & 2: Questions in a given target language can be asked in any source language Era 3: Questions in any target language can be asked in any source language (Ho Ho)

Resources – Back Trans Cont. Eras 1 & 2: Each participating group is answering different questions, depending on the target language Era 3: Each group is answering same questions The Gold Standard comprising questions, answers and contexts in target language is probably the most interesting thing to come out of the QA at CLEF activity The back translation paradigm was worked out for the first campaign

Participation

Evaluation - Measures Right / Wrong / Unsupported / ineXact These standard TREC measures have been used all along Accuracy: Proportion of answers Right MRR: Reciprocal of rank of first correct answer. Thus each answer contributes 1, 0.5, 0.33, or 0 C@1: Rewards system for not answering wrongly CWS: Rewards system for being confident of correct ans K1: Also links correctness and confidence

Evaluation - Method Originally, runs inspected individually by hand LIM used Perl TREC tools incorporating double judging WiQA group produced excellent web-based system allowing double judging CELCT produced web-based system Evaluation is very interesting work!

Evaluation - Results

Discussion – Era 1 (03-06) Monolingual QA improved 49%->68% The best system was for a different language each year! Reason: Increasingly sophisticated techniques used, mostly learned from TREC, plus CLEF and NTCIR Cross-Lingual QA remained 35-45% throughout Reason: Required improvement in Machine Translation has not been realised by participants

Discussion – Era 2 (07-08) Monolingual QA improved 54%->64% However, range of results was greater, as only a few groups were capable of the more difficult task Cross-Lingual QA deteriorated 42%->19%! Reason: 42% was an isolated result and the general field was much worse

Discussion – Era 3 (09-10) In 2009, task was only passage retrieval (easier) However, documents are much more difficult than newspapers and questions reflect this Monolingual Passage Retrieval was 61% Cross-Lingual Passage Retrieval was 18%

Conclusions - General A lot of groups around Europe and beyond have been able to participate in their own languages Hence, the general capability in European languages has improved considerably – both systems and research groups However, people are often interested in their own language only – i.e. Monolingual systems Cross-lingual systems mostly X->EN or EN->X, i.e. to or from English Many language directions are supported by us but not taken up

Conclusions – Resources & Tools During the campaigns, very useful resources have been developed – Gold Standards for each year These are readily available and can be used by groups to develop systems even if they did not participate in CLEF Interesting tools for devising questions and evaluating results have also been produced

Conclusions - Results Monolingual results have improved to the level of TREC English results Thus new, more dynamic and more realistic QA challenges must be found for future campaigns Cross-Lingual results have not improved to the same degree. High quality MT (on Named Entities especially) is not a solved problem and requires further attention

Evaluating Multilingual Question Answering Systems at CLEF Pamela Forner 1, Danilo Giampiccolo 1, Bernardo Magnini 2, Anselmo Peñas 3, Álvaro Rodrigo 3,

Similar presentations

Presentation on theme: "Evaluating Multilingual Question Answering Systems at CLEF Pamela Forner 1, Danilo Giampiccolo 1, Bernardo Magnini 2, Anselmo Peñas 3, Álvaro Rodrigo 3,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Evaluating Multilingual Question Answering Systems at CLEF Pamela Forner 1, Danilo Giampiccolo 1, Bernardo Magnini 2, Anselmo Peñas 3, Álvaro Rodrigo 3,

Similar presentations

Presentation on theme: "Evaluating Multilingual Question Answering Systems at CLEF Pamela Forner 1, Danilo Giampiccolo 1, Bernardo Magnini 2, Anselmo Peñas 3, Álvaro Rodrigo 3,"— Presentation transcript:

Similar presentations

About project

Feedback