Presentation is loading. Please wait.

Presentation is loading. Please wait.

Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Similar presentations


Presentation on theme: "Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000."— Presentation transcript:

1 Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000

2 Roadmap The signal to noise perspective Our topic tracking system Boosting signal Reducing noise Future directions

3 Translingual Tracking Challenges Segmentation of text adds noise –Unknown words Transcription of speech adds noise –Unknown words –Easily confused words (e.g., homophones) Translation adds noise –Vocabulary mismatch with ASR / segmentation –Incorrect translation selection

4 Improving the Signal to Noise Ratio Translation coverage –Enrich the term list using large dictionaries Translation selection –Statistical evidence from comparable corpora Enriching indexing vocabulary –Add related terms from comparable corpora Score normalization –Learn source dependence from dry-run collection

5 Preview Focusing on noise alone is not enough –Signal boosting is a big win Baseline: Systran –Goal: choose the best single translation Two signal-boosting strategies beat Systran –Choose the best two translations –Add related terms for indexing (found in related documents)

6 Improvements Since TDT-2 Weight selection –PRISE “bm25idf” Query representation: –Vector of 180 most selective terms by χ² test Two-pass normalization –Source-specific, 5 source classes NYT, APW, Eng. Speech, Man. Text, Man. Speech –Topic-specific Average of example story scores

7 Mandarin (All Sources) English (All Sources) Source-independent Source-dependent Source-independent Source-dependent

8 Translingual Approaches Indexing strategies (boosting signal) –Post-translation document expansion –n-best translation Translation tweaks (reducing noise) –Enriched bilingual term list –Corpus-based translation selection –Pre-translation Mandarin stopword removal

9 Translingual Runs (* = official run scored by NIST)

10 Document Expansion BNNWT Mandarin Word-to-Word Translation Comp. English Corpus PRISE Top 5 ASR Transcript NMSU Segmenter Term Selection PRISE BNNWT English Results Query Vector Documents to Index Single Document

11 Mandarin Newswire Text

12 Mandarin Broadcast News

13 Why Document Expansion Works Story-length objects provide useful context Ranked retrieval finds signal amid the noise Selective terms discriminate among documents –Enrich index with high IDF terms from top documents Similar strategies work well in other applications –TREC-7 SDR [Singhal et al., 1998] –CLIR query translation [Ballesteros & Croft, 1997]

14 n-best Translation We generally used 1-best translation –Highest unigram frequency in comparable corpus Tried 2-best: two highest-ranked translations –Duplicating unique translations where necessary Should reduce miss rate –But at what cost in false alarms?

15 Mandarin Newswire Text

16 Mandarin Broadcast News

17 Comparison With Systran Used baseline translations provided by LDC –Untranslated words not used –No document expansion Systran produces 1-best translations –Natural comparison is with our 2-best run

18 Mandarin Newswire Text

19 Mandarin Broadcast News

20 Bilingual Term List Enrichment Two sources of candidate translations –LDC Chinese-English term list (version 2) –CETA (Optilex) dictionary >250K entries, hand-built from >250 sources Merging strategy –Used only general-purpose sources in CETA –Filtered out definitions –Removed parenthetical clauses

21 Term List Statistics

22 Broadcast News Newswire Text

23 Translation Preference Unigram statistics guided translation selection –Minimize effect of rare translations, misspellings, … Based on dry run stories and rolling update –Backoff to balanced corpus for unknown words Brown corpus: variety of genres Compared with use of balanced corpus alone

24 Mandarin Newswire Text

25 Pre-Translation Stopword Removal Common words don’t help retrieval much –But mistranslations might hurt We built a Mandarin stopword list –Processed dictionary to identify function words –Added the top 300 words in LDC frequency list –Filtered by two speakers of Mandarin Suppressed translation of stopwords

26 Mandarin Newswire Text

27 Summary 3 techniques produced improvements: –Source-dependent normalization –Post-translation document expansion –n-best translation 3 techniques had little effect: –Bilingual term list enrichment –Comparable-corpus-based translation preference –Pre-translation stopword removal

28 Future Directions Statistical significance –Can this be added to the scoring software? Pre-translation document expansion –An effective approach in CLIR query translation Further experiments with n-best translation –Probably using a weighted strategy Structured translation [Pirkola, 1998] –Some concern about efficiency, though

29 Where is the Perfect TDT System? Run TDT-4 In Nova Scotia! Maryland Penn BBN


Download ppt "Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000."

Similar presentations


Ads by Google