Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cross-Language Access to Recorded Speech in the MALACH Project Douglas Oard, Dina Demner-Fushman, Jan Hajic, Bhuvana Ramabhadran, Sam Gustman, Bill Byrne,

Similar presentations


Presentation on theme: "Cross-Language Access to Recorded Speech in the MALACH Project Douglas Oard, Dina Demner-Fushman, Jan Hajic, Bhuvana Ramabhadran, Sam Gustman, Bill Byrne,"— Presentation transcript:

1 Cross-Language Access to Recorded Speech in the MALACH Project Douglas Oard, Dina Demner-Fushman, Jan Hajic, Bhuvana Ramabhadran, Sam Gustman, Bill Byrne, Dagobert Soergel, Bonnie Dorr, Philip Resnik, Michael Picheny, Josef Psutka

2 Outline The MALACH project Searching speech A cross-language retrieval experiment Next steps

3 The MALACH Project 52,000 interviews with Holocaust survivors –116,000 hours (180 TB MPEG-1) –32 languages, recorded in 67 countries Present: Manual indexing –14,000 controlled vocabulary terms Future: Automatic indexing –Speech recognition –Translation

4 Who Uses the Collection? History Linguistics Journalism Material culture Education Psychology Political science Law enforcement Book Documentary film Research paper CDROM Study guide Obituary Evidence Personal use DisciplineProducts Based on analysis of 280 access requests

5 Research Challenges Speech Recognition –Spontaneous, accented, elderly, language switching Computational Linguistics –Segmentation, classification, summarization, extraction Information Retrieval –Query formulation, search, selection, examination, use Today Tomorrow (Josef Psutka)

6 Supporting Information Access Source Selection Search Query Selection Ranked List Examination Recording Delivery Recording Query Formulation Search System Query Reformulation and Relevance Feedback Source Reselection

7 Key Issues in Speech Retrieval Recognition accuracy –Content-based retrieval works when WER<40% Topic segmentation –Average MALACH interview is 2.3 hours! Multi-scale summarization –Brief summaries: selection from a ranked list –Detailed summaries: minimize audio replay

8 English Recognition Accuracy 60% WER for off-the-shelf systems! –3 systems (broadcast news, dictation, telephone) MLLR adaptation helps –33% WER for fluent speech –46% WER for heavy accents/disfluent speech Next step: retrain on transcribed interviews –200 hours from 800 speakers

9 Cross-Language Search Query formulation –Spoken words (free text) –Thesaurus descriptors Segment selection –Speech-to-text translation –multi-scale indicative summaries Use of retrieved segments –Query reformulation –Incorporation in projects

10 Ranked Retrieval System Design Compute Term Weights Build Index Documents Compute Term Weights Compute Document Score Sort Scores Ranked List Query Translation Lexicon

11 Ranked Retrieval Czech/English Translation Lexicon Evaluation Framework Ranked List English Documents Relevance Judgments Evaluation Measure of Effectiveness Czech Queries

12 Czech/English Test Collection 113,000 English newspaper stories Two sets of 33 Czech queries –S: Very short (1-3 words) –L: Sentence-length Human “ground truth” relevance judgments –Pooled assessment methodology (CLEF-2000)

13 Translation Lexicon Machine-readable dictionary –Lemmatized Czech query words –Looked each up in “PC Translator” Bilingual term list –Downloaded 800 term pairs from Ergane Retained untranslatable terms –Stripped diacritics to match proper names –Optionally, made minor corrections (by hand) e.g., “afrika” to “africa”

14 Example Query Original Czech query (S) –Architektura v Berlínì Word-by-word translation into English –architecture architecture –at below beneath by embattled in inside into on per under upon upstairs v within at below beneath by embattled in inside into on per under upon upstairs v within – berlin

15 Example Search Results Creating a new architectural vocabulary for a democratic Berlin UCLA merges architecture and arts into a new school Best of Berlin for young travelers Who owns the Nazi paper trail? A commitment to change the world; No place like utopia: Modern Architecture and the Company we Kept … On the record: Sanderling's dark take on Sibelius Max Bill, 85; Controversial Swiss artist, sculptor and writer The week ahead: Berlin; Farewell to allies Roll over Beethoven; Jeff Berlin leaves the violin and classical … Californians had right stuff for airlift; Europe: former pilots …

16 Precision-Recall Graph Average Precision = 0.477 Czech title query 1, LA Times Documents, CLEF 2000 Relevance Assessments

17 Average Precision Czech title queries, LA Times Documents, CLEF 2000 Relevance Assessments Mean Average Precision = 0.188 0.477

18 Results

19 Czech seems to pose no unusual problems –55% of monolingual with simple techniques Suitable Czech/English resources exist –Czech morphology –Czech/English bilingual lexicon Multiword expression handling would help –Named entities, non-compositional phrases

20 Some Next Steps Integrate Czech/English statistical MT –Johns Hopkins (Summer 2002 Workshop) Integrate with English and Czech ASR –IBM and Univ of West Bohemia/Charles Univ Integrate into an interactive retrieval system –University of Maryland and Shoah Foundation

21 For More Information Cross-language and speech retrieval –http://www.clis.umd.edu/~dlrg/clir/ –http://www.clis.umd.edu/~dlrg/speech/ The MALACH project –http://www.clsp.jhu.edu/research/malach/ NSF/EU Spoken Word Access Working Group –http://www.dcs.shef.ac.uk/spandh/projects/swag/


Download ppt "Cross-Language Access to Recorded Speech in the MALACH Project Douglas Oard, Dina Demner-Fushman, Jan Hajic, Bhuvana Ramabhadran, Sam Gustman, Bill Byrne,"

Similar presentations


Ads by Google