Mandarin-English Information (MEI) Johns Hopkins University Summer Workshop 2000 presented at the TDT-3 Workshop February 28, 2000 Helen Meng The Chinese University of Hong Kong Sanjeev Khudanpur Johns Hopkins University Douglas W. Oard University of Maryland Hsin-Min Wang Academia Sinica, Taiwan
Outline Background The MEI Project –Multiscale Retrieval –Multiscale Translation Using the TDT-3 collection Schedule
Motivation Emerging speech retrieval applications –E.g., Increasing need for translingual audio search –1896 Internet accessible radio & TV stations –529 of these (28%) are not in English source:
The Big Picture Speech to Speech Translation Translingual Audio Browsing Translingual Audio Search English Query English Audio SelectExamine MEI
Related Work TREC Spoken Document Retrieval –Close coupling of recognition and retrieval TREC Cross-Language Retrieval –Close coupling of translation and retrieval TDT-3 –Coupling recognition, translation and retrieval –Using baseline recognizer transcripts
The MEI Project Closely coupling recognition and translation –For the purpose of retrieval English text queries, Mandarin news audio Specific research issues: –Multi-scale retrieval –Multi-scale translation
Multi-scale Analysis of Mandarin Initial/Final Preme/Core Final Preme/Toneme /iang/ /ji/ /j/ /ang/ /i//a/ /ng/ /j/
Multi-scale Retrieval Subword-scale –Syllable lattice matching [Chen, Wang & Lee, 2000] –Overlapping syllable n-grams [Meng et al., 1999] –Skipped syllable pairs [Chen, Wang & Lee, 2000] –Syllable confusion matrix [Meng et al., 1999] Word-scale –Structured queries [Pirkola, 1998] Multi-scale –Unified retrieval using a merged feature set –Scale-optimized retrieval with result-set merging
Why Multi-scale Retrieval? Word-based retrieval exploits lexical knowledge –Enhances precision Subword units achieve complete phonological coverage –Enhances recall Combination of evidence may beat either alone
Multi-scale Translation Word-scale –Dictionary-based [Levow & Oard, 2000] –Parallel corpora [Nie, 1999] –Comparable corpora [Fung, 1998] Subword-scale –Cross-language phonetic map [Knight & Graehl, 1997] /bei2 ai4 er3 lan2/ Kosovo (/ke1-sou3-wo4/, /ke1-sou3-fo2/, /ke1-sou3-fu1/, /ke1-sou3-fu2/)
Using the TDT-3 Collection English queries formed from topic descriptions –2-4 words (simulated Web search) –Full topic description (simulated routing profile) Mandarin broadcast news audio (121 hours) –Story-boundary-known condition (4624 stories) –Baseline recognizer transcripts provide words
Schedule DecFebJunAprAug Six Weeks: Summer Workshop Planning Meeting First MEI Team Planning Meeting Second MEI Team Planning Meeting
Things We Need Ideas –To sharpen our focus Connections –To build a community of interest Resources –To build on what others have done
Background: Chinese Many dialects (e.g., Mandarin and Cantonese) –differences in phonetics, vocabularies, syntax… Syllable-based language –~400 base syllables, 4 lexical tones + light tone Syllable structure (CG)V(X) –(CG): onset, optional, consonant+medial glide –V:nuclear vowel –X:coda, glide / alveolar nasal / velar nasal –~ 21 initials, 39 finals
Background: Chinese (cont) Characters (written) -> syllables (spoken) Degenerate mapping – /hang2/, /hang4/, /heng2/ or /xing2/ –/fu4 shu4/ (LDC’s CALLHOME lexicon) Tokenization / Segmentation –/zhe4 yi1 wan3 hui4 ru2 chang2 ju3 xing2/