Cross-Language Access to Recorded Speech in the MALACH Project Douglas Oard, Dina Demner-Fushman, Jan Hajic, Bhuvana Ramabhadran, Sam Gustman, Bill Byrne,

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
Language Technology Research Serving eHumanities New Ways of Accessing the USC Shoah Foundation Archive in the Center for Visual History Malach Jan Hajič.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials 2.
Modern Information Retrieval
MALACH Multilingual Access to Large spoken ArCHives Survivors of the Shoah Visual History Foundation Human Language Technologies IBM T. J. Watson Research.
Cross Language IR Philip Resnik Salim Roukos Workshop on Challenges in Information Retrieval and Language Modeling Amherst, Massachusetts, September 11-12,
Issues in Pre- and Post-translation Document Expansion: Untranslatable Cognates and Missegmented Words Gina-Anne Levow University of Chicago July 7, 2003.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Access to News Audio User Interaction in Speech Retrieval Systems by Jinmook Kim and Douglas W. Oard May 31, th Annual Symposium and Open House.
Advance Information Retrieval Topics Hassan Bashiri.
Information Access Douglas W. Oard College of Information Studies and Institute for Advanced Computer Studies Design Understanding.
Lecturing with Digital Ink Richard Anderson University of Washington.
Carnegie Mellon © Copyright 2000 Michael G. Christel and Alexander G. Hauptmann 1 Informedia 03/12/97.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Chapter 5: Information Retrieval and Web Search
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
LREC Combining Multiple Models for Speech Information Retrieval Muath Alzghool and Diana Inkpen University of Ottawa Canada.
Multilingual Access to Large Spoken Archives Douglas W. Oard University of Maryland, College Park, MD, USA.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,
Mandarin-English Information (MEI) Johns Hopkins University Summer Workshop 2000 presented at the TDT-3 Workshop February 28, 2000 Helen Meng The Chinese.
August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( Bridging Languages for Question Answering: DIOGENE at CLEF-2003.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
1 Intra- and interdisciplinary cross- concordances for information retrieval Philipp Mayr GESIS – Leibniz Institute for the Social Sciences, Bonn, Germany.
Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.
1 A Unified Relevance Model for Opinion Retrieval (CIKM 09’) Xuanjing Huang, W. Bruce Croft Date: 2010/02/08 Speaker: Yu-Wen, Hsu.
November 15, 2003CLIS Alumni Chapter Talking to the Future: The MALACH Project Douglas W. Oard Joanne Archer, Ammie Feijoo, Xiaoli Huang College of Information.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals  Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Producción de Sistemas de Información Agosto-Diciembre 2007 Sesión # 8.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Cross-Language Evaluation Forum (CLEF) IST Expected Kick-off Date: August 2001 Carol Peters IEI-CNR, Pisa, Italy Carol Peters: blabla Carol.
Chapter 6: Information Retrieval and Web Search
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials: Informedia.
GTRI.ppt-1 NLP Technology Applied to e-discovery Bill Underwood Principal Research Scientist “The Current Status and.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Speech and Music Retrieval INST 734 Doug Oard Module 12.
Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information.
Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington.
Information Transfer through Online Summarizing and Translation Technology Sanja Seljan*, Ksenija Klasnić**, Mara Stojanac*, Barbara Pešorda*, Nives Mikelić.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.
Information Retrieval
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.
Acceso a la información mediante exploración de sintagmas Anselmo Peñas, Julio Gonzalo y Felisa Verdejo Dpto. Lenguajes y Sistemas Informáticos UNED III.
September 16, 2004CLEF 2004 CLEF-2005 CL-SDR: Proposing an IR Test Collection for Spontaneous Conversational Speech Gareth Jones (Dublin City University,
Multilingual Search Shibamouli Lahiri
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
November 8, 2005NSF Expedition Workshop Supporting E-Discovery with Search Technology Douglas W. Oard College of Information Studies and Institute for.
User Needs Session 6 INST 301 Introduction to Information Science.
1 CLASS Lesson Planning System and Teachers’ Collaboratory Dagobert Soergel With Katy Lawley, Tandeep Sidhu, Ryen White, and David Doermann College of.
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Large Digital Oral History Archives
3.0 Map of Subject Areas.
Rapidly Retargetable Translingual Detection
Content Augmentation for Mixed-Mode News Broadcasts Mike Dowman
Presentation transcript:

Cross-Language Access to Recorded Speech in the MALACH Project Douglas Oard, Dina Demner-Fushman, Jan Hajic, Bhuvana Ramabhadran, Sam Gustman, Bill Byrne, Dagobert Soergel, Bonnie Dorr, Philip Resnik, Michael Picheny, Josef Psutka

Outline The MALACH project Searching speech A cross-language retrieval experiment Next steps

The MALACH Project 52,000 interviews with Holocaust survivors –116,000 hours (180 TB MPEG-1) –32 languages, recorded in 67 countries Present: Manual indexing –14,000 controlled vocabulary terms Future: Automatic indexing –Speech recognition –Translation

Who Uses the Collection? History Linguistics Journalism Material culture Education Psychology Political science Law enforcement Book Documentary film Research paper CDROM Study guide Obituary Evidence Personal use DisciplineProducts Based on analysis of 280 access requests

Research Challenges Speech Recognition –Spontaneous, accented, elderly, language switching Computational Linguistics –Segmentation, classification, summarization, extraction Information Retrieval –Query formulation, search, selection, examination, use Today Tomorrow (Josef Psutka)

Supporting Information Access Source Selection Search Query Selection Ranked List Examination Recording Delivery Recording Query Formulation Search System Query Reformulation and Relevance Feedback Source Reselection

Key Issues in Speech Retrieval Recognition accuracy –Content-based retrieval works when WER<40% Topic segmentation –Average MALACH interview is 2.3 hours! Multi-scale summarization –Brief summaries: selection from a ranked list –Detailed summaries: minimize audio replay

English Recognition Accuracy 60% WER for off-the-shelf systems! –3 systems (broadcast news, dictation, telephone) MLLR adaptation helps –33% WER for fluent speech –46% WER for heavy accents/disfluent speech Next step: retrain on transcribed interviews –200 hours from 800 speakers

Cross-Language Search Query formulation –Spoken words (free text) –Thesaurus descriptors Segment selection –Speech-to-text translation –multi-scale indicative summaries Use of retrieved segments –Query reformulation –Incorporation in projects

Ranked Retrieval System Design Compute Term Weights Build Index Documents Compute Term Weights Compute Document Score Sort Scores Ranked List Query Translation Lexicon

Ranked Retrieval Czech/English Translation Lexicon Evaluation Framework Ranked List English Documents Relevance Judgments Evaluation Measure of Effectiveness Czech Queries

Czech/English Test Collection 113,000 English newspaper stories Two sets of 33 Czech queries –S: Very short (1-3 words) –L: Sentence-length Human “ground truth” relevance judgments –Pooled assessment methodology (CLEF-2000)

Translation Lexicon Machine-readable dictionary –Lemmatized Czech query words –Looked each up in “PC Translator” Bilingual term list –Downloaded 800 term pairs from Ergane Retained untranslatable terms –Stripped diacritics to match proper names –Optionally, made minor corrections (by hand) e.g., “afrika” to “africa”

Example Query Original Czech query (S) –Architektura v Berlínì Word-by-word translation into English –architecture architecture –at below beneath by embattled in inside into on per under upon upstairs v within at below beneath by embattled in inside into on per under upon upstairs v within – berlin

Example Search Results Creating a new architectural vocabulary for a democratic Berlin UCLA merges architecture and arts into a new school Best of Berlin for young travelers Who owns the Nazi paper trail? A commitment to change the world; No place like utopia: Modern Architecture and the Company we Kept … On the record: Sanderling's dark take on Sibelius Max Bill, 85; Controversial Swiss artist, sculptor and writer The week ahead: Berlin; Farewell to allies Roll over Beethoven; Jeff Berlin leaves the violin and classical … Californians had right stuff for airlift; Europe: former pilots …

Precision-Recall Graph Average Precision = Czech title query 1, LA Times Documents, CLEF 2000 Relevance Assessments

Average Precision Czech title queries, LA Times Documents, CLEF 2000 Relevance Assessments Mean Average Precision =

Results

Czech seems to pose no unusual problems –55% of monolingual with simple techniques Suitable Czech/English resources exist –Czech morphology –Czech/English bilingual lexicon Multiword expression handling would help –Named entities, non-compositional phrases

Some Next Steps Integrate Czech/English statistical MT –Johns Hopkins (Summer 2002 Workshop) Integrate with English and Czech ASR –IBM and Univ of West Bohemia/Charles Univ Integrate into an interactive retrieval system –University of Maryland and Shoah Foundation

For More Information Cross-language and speech retrieval – – The MALACH project – NSF/EU Spoken Word Access Working Group –