F. López-Ostenero, V. Peinado, V. Sama & F. Verdejo

Slides:



Advertisements
Similar presentations
A Human-Centered Computing Framework to Enable Personalized News Video Recommendation (Oh Jun-hyuk)
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.
Terminology Retrieval: towards a synergy between thesaurus and free text searching Anselmo Peñas, Felisa Verdejo and Julio Gonzalo Dpto. Lenguajes y Sistemas.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Multilingual experiments of CLEF 2003 Eija Airio, Heikki Keskustalo, Turid Hedlund, Ari Pirkola University of Tampere, Finland Department of Information.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
The XLDB Group at GeoCLEF 2005 Nuno Cardoso, Bruno Martins, Marcírio Chaves, Leonardo Andrade, Mário J. Silva
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,
LREC Combining Multiple Models for Speech Information Retrieval Muath Alzghool and Diana Inkpen University of Ottawa Canada.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( Bridging Languages for Question Answering: DIOGENE at CLEF-2003.
CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.
COMP423.  Query expansion  Two approaches ◦ Relevance feedback ◦ Thesaurus-based  Most Slides copied from ◦
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
“ SINAI at CLEF 2005 : The evolution of the CLEF2003 system.” Fernando Martínez-Santiago Miguel Ángel García-Cumbreras University of Jaén.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City.
Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.
Internet Business Foundations © 2004 ProsoftTraining All rights reserved.
The CLEF 2003 cross language image retrieval task Paul Clough and Mark Sanderson University of Sheffield
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
1 Query Operations Relevance Feedback & Query Expansion.
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
Interactive Probabilistic Search for GikiCLEF Ray R Larson School of Information University of California, Berkeley Ray R Larson School of Information.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Chapter 6: Information Retrieval and Web Search
IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.
CLEF Kerkyra Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Arantxa Otegi UNIPD: Giorgio Di Nunzio UH: Thomas Mandl.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Information Retrieval
CIW Lesson 6MBSH Mr. Schmidt1.  Define databases and database components  Explain relational database concepts  Define Web search engines and explain.
The Cross Language Image Retrieval Track: ImageCLEF Breakout session discussion.
1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Indri at TREC 2004: UMass Terabyte Track Overview Don Metzler University of Massachusetts, Amherst.
Diana Inkpen, University of Ottawa, CLEF 2005 Using various indexing schemes and multiple translations in the CL-SR task at CLEF 2005 Diana Inkpen, Muath.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
1 Knowledge-Based Medical Image Indexing and Retrieval Caroline LACOSTE Joo Hwee LIM Jean-Pierre CHEVALLET Daniel RACOCEANU Nicolas Maillot Image Perception,
Query expansion COMP423. Menu Query expansion Two approaches Relevance feedback Thesaurus-based Most Slides copied from
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
1 SINAI at CLEF 2004: Using Machine Translation resources with mixed 2-step RSV merging algorithm Fernando Martínez Santiago Miguel Ángel García Cumbreras.
CLEF Budapest1 Measuring the contribution of Word Sense Disambiguation for QA Proposers: UBC: Agirre, Lopez de Lacalle, Otegi, Rigau, FBK: Magnini.
Language Identification and Part-of-Speech Tagging
Multilingual Search using Query Translation and Collection Selection Jacques Savoy, Pierre-Yves Berger University of Neuchatel, Switzerland
Measuring Monolinguality
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Lesson 6: Databases and Web Search Engines
An Automatic Construction of Arabic Similarity Thesaurus
Experiments for the CL-SR task at CLEF 2006
Compact Query Term Selection Using Topically Related Text
Lesson 6: Databases and Web Search Engines
Introduction to Text Analysis
Relevance and Reinforcement in Interactive Browsing
Information Retrieval and Web Design
Cheshire at GeoCLEF 2008: Text and Fusion Approaches for GIR
Presentation transcript:

F. López-Ostenero, V. Peinado, V. Sama & F. Verdejo NE Recognition and Keyword-based Relevance Feedback in Mono and CL Automatic Speech Transcriptions Retrieval F. López-Ostenero, V. Peinado, V. Sama & F. Verdejo CLEF 2005. Vienna, Austria. September 22th 2005

2 Goals Test the suitability of our translation resources for a new track Test different strategies to clean the automatic transcriptions Check if Proper Noun Recognition can help to improve retrieval over automatic speech transcriptions Compare the efectiveness of manual and automatic keywords in a keyword-based pseudo-relevance feedback approach

Cleaning strategies Starting poing: ASR2004A field 3 Cleaning strategies Starting poing: ASR2004A field Join all the occurrences of a list of single characters: l i e b b a c h a r d  liebbachard Remove all extra occurences of duplicated words: “yes yes yes yes”  “yes” CLEAN collection Split documents of CLEAN collection in character 3-grams 3GRAMS collection We see that proper nouns that the ASR software don’t recognize were transcripted as a list of single characters. Perform a morphological analysis over CLEAN collection and remove all words that cannot act as noun, adjective or verb MORPHO collection Perform a full Part of Speech tagging over CLEAN collection and delete all words except nouns, adjectives and verbs POS collection

Submitted runs Ranking MAP Run Language 20 0.0934 mono-pos English 21 0.0918 mono-morpho 29 0.0706 mono-3grams 32 0.0373 trans-pos Spanish 33 0.0370 trans-morpho Runs based on a query translation approach (using Pirkola’s structured queries) and INQUERY retrieval engine Scores far from best monolingual and Spanish crosslingual runs: room for improvement MAP scores of morpho and pos very similar, what’s the influence of cleaning strategies? Character 3-grams scores worse than full word retrieval Only 5 different runs: not enough data to obtain clear conclusions

Proper noun identification 5 Proper noun identification We used a shallow entity recognizer to identify proper nouns in topics: Monolingual: we identify proper nouns in English topics and we structure the query tagging them with a proximity operator Crosslingual: we also identify proper nouns in Spanish topics but, which proper nouns should be translated and which ones should not? if a proper noun appears in the SUMMARY field of documents, we assume that it should not be translated and we tag it using a proximity operator otherwise we try to translate the proper noun La historia de Varian Fry y el Comité de Rescates de Emergencia … don’t translate try to translate

Pseudo Relevance Feedback 6 Pseudo Relevance Feedback Five collections to study a keyword-based pseudo relevance feedback: AUTOKEYWORD2004A1 field (to build up AK1 collection) AUTOKEYWORD2004A2 field (to build up AK2 collection) Mix of 1 and 2 single field keyword score, according to the position: 1st keyword = 20; 2nd keyword = 19 ... if a keyword appears in both fields, its final score is the sum of both single field scores select the top 20 scored keywords (to build up AK12 collection) MANUALKEYWORD field (to build up MK collection) Mix of 3 and 4 Select the n manual keywords from 4 If n < 20, add the 20 – n first keywords from 3 (to build up MKAK12 collection) Pseudo relevance feedback procedure: Launch a plain query (without keyword expansion) Retrieve keywords from top 10 retrieved documents Mix these keywords using algorithm described to create AK12 collection Expand query using top 20 keywords

Combining techniques Total combinations: 2 x 2 x 4 x 6 = 96 runs 7 Combining techniques We have tested all possible combinations of the different strategies, except 3grams in translingual runs because these combinations didn’t make any sense. Total combinations: 2 x 2 x 4 x 6 = 96 runs No 3grams in crosslingual runs: mono: 2 x 4 x 6 = 48 runs trans: 2 x 3 x 6 = 36 runs real number or runs: 48 + 36 = 84 runs

Variation wrt UNED submitted runs 8 Results Ranking MAP Run Variation wrt UNED submitted runs 1 0.2595 mono-ent-morpho-MK 277.8% (monolingual) 13 0.2036 trans-ent-pos-MK 545.8% (crosslingual) 31 0.0934 mono-noent-pos-NO 100% (monolingual) 63 0.0706 mono-noent-3grams-NO 75.6% (monolingual) 73 0.0373 trans-noent-pos-NO 100% (crosslingual) Preliminary conclusions: Monolingual improvement: 277.8% Crosslingual improvement: 545.8% Best strategies: PRF using MK or MKAK12 collections Use proper noun recognition Monolingual 3-grams scores poorly, reaching a 27.2% of our best run

Influence of proper nouns 9 Influence of proper nouns Each point represents MAP ent / MAP noent in percentage monolingual: increment worthless and probably statistically not relevant crosslingual: proper noun detection increment MAP more than twice clean pos morpho For instance, leftmost red point represents MAP(mono-ent-clean-NO) / MAP(mono-noent-clean-NO) Proper nouns show more influence in crosslingual when combined with AK2 or AK12 prf

Influence of relevance feedback 10 Influence of relevance feedback Each point represents MAP (rf method) / MAP (NO rf) in percentage MK: the best option, but when combined with AK12 MAP decreases AK12 usually better than AK1 or AK2. More stable in monolingual, better in crosslingual with entities and worse in crosslingual without entities AK1 and AK2 identical in monolingual, AK1 better in crosslingual AK2 and AK12 decrease MAP in crosslingual without proper nouns, this can explain why the identified proper nouns improve more MAP for AK” and AK12 runs

Conclusions and future work 11 Conclusions and future work The use of a shallow entity recognizer to identify proper nouns seems to be very useful, specially in a crosslingual environment where MAP increases 221,9% on average Cleaning methods based on full words (clean, morpho and pos) show no significative differences, but character 3-grams approach seems to be not useful for this task Pseudo Relevance Feedback using manually generated keywords shows to be the best option to improve the performance of the retrieval, with an average of 271.6% MAP regarding no relevance feedback Perform further analysis over the results, including statistical relevance tests Try a different approach to identify proper nouns in the automatic transcriptions or in the automatic keyword fields, instead of using the manual summary of the transcriptions Maybe the big improvement when detecting proper nouns in crosslingual environment is due to the use of a manually generated field to decide if a proper noun should be or should not be translated

F. López-Ostenero, V. Peinado, V. Sama & F. Verdejo NE Recognition and Keyword-based Relevance Feedback in Mono and CL Automatic Speech Transcriptions Retrieval F. López-Ostenero, V. Peinado, V. Sama & F. Verdejo CLEF 2005. Vienna, Austria. September 22th 2005