Issues in Pre- and Post-translation Document Expansion: Untranslatable Cognates and Missegmented Words Gina-Anne Levow University of Chicago July 7, 2003
Roadmap Goals of expansion –Expansion points in CL-SDR Pre- and Post-translation document expansion experiments –Task, query & document processing –Expansion methodology Results Discussion & Conclusions
Why Expansion? Recover terms that could have appeared –Compensate for difference in term choice Author concepts vs searcher information need –Compensate for noisy processing ASR transcription errors –Misrecognitions, deletions, missegmentations Translation errors –Gaps, missegmentations –Context disambiguates
Expansion Opportunities Query: –(Ballesteros & Croft’96; McNamee & Mayfield 2002) –Before, after translation; both –Different enhancements to precision/recall –Pre-translation key – something to translate European languages Document –Before, after translation; both –Developed for monolingual SDR (Singhal 1999) –CLIR (+SDR) (Levow & Oard 2000) Post-translation promising
Experimental Configuration: Basic Task Variant of Topic Detection and Tracking (TDT) –English queries to Mandarin documents Query-by-example –English newswire or broadcast news stories Mandarin audio broadcast news documents –Automatically transcribed by Dragon ASR system –Modifications: Retrospective retrieval Evaluation metric: Mean Average Precision
Experimental Configuration: Query and Document Processing Query: –Select top 180 positively correlated terms in 4 exemplars Based on Χ^2 test 996 prior documents assumed not relevant Document: –Dictionary-based word-for-word translation Segmentation: NMSU ch_seg Translation resource: –Merged bilingual term list: CETA & LDC term list Translation ranking: –Target language unigram frequency: single words, multi-word
Experimental Configuration: Document Expansion
Document Expansion: Details Side collections: –Mandarin: TDT-2 Xinhua, Zaobao newswire –English: TDT-2 New York Times, AP news Expansion term selection –Top 5 documents –Sort candidate terms by idf –Exclude terms in only one document –Add one term instance per document –Add until document doubled in length
Results Post-translation significantly outperforms pre- translation expansion NonePrePostPre+Post
Discussion: Post-translation Effectivenes Post-translation document expansion significantly improves retrieval effectiveness –Little improvement from pre-translation expans’n Either alone or in conjunction Expansion introduces key enriching terms –Named entities, alternate forms E.g. Tariq Aziz, Saddam, Yeltsin, etc –Available in English (post-translation) collection
Discussion: Pre-translation Limitations Expansion terms do not exist –Segmentation & transcription rely on term lists Named entities frequently absent Can not extract terms from Mandarin newswire Expansion terms can not translate –Key terms (e.g. named entities) absent from bilingual term lists All examples on previous page absent
Discussion: Contrasts Contradict prior query expansion results –Re: Primacy of pre-translation expansion Explanation: –Prior languages – mostly European Common writing system, white-space delimited Pre-translation expansion produces –-> translatable terms + (possibly) untranslatable cognates –Cognates still match, even without translation –Current experiment: English-Mandarin Untranslatable cognates useless –Different orthography Terms not identified - missegmentation
Conclusion Document expansion improves effectiveness –For CL-SDR case, recovers terms lost by missegmentation, mistranscription, or mistranslation; supports different terms Post-translation expansion most effective –Translated terms provide context for retrieval Correct translations/transcriptions coherent; others noise –Enriching terms often absent from term lists Segmentation, transcription, translation all rely on lists –Expansion in indexing language bypasses barriers Crucial in languages with segmentation issues and different forms