Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Slides:



Advertisements
Similar presentations
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
Advertisements

Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Mandarin-English Information (MEI): Investigating Translingual Speech Retrieval Johns Hopkins University Summer Workshop 2000 Presented at the ANLP-NAACL.
A Maximum Coherence Model for Dictionary-based Cross-language Information Retrieval Yi Liu, Rong Jin, Joyce Y. Chai Dept. of Computer Science and Engineering.
Search Engines and Information Retrieval
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
Rutgers’ HARD Track Experiences at TREC 2004 N.J. Belkin, I. Chaleva, M. Cole, Y.-L. Li, L. Liu, Y.-H. Liu, G. Muresan, C. L. Smith, Y. Sun, X.-J. Yuan,
Issues in Pre- and Post-translation Document Expansion: Untranslatable Cognates and Missegmented Words Gina-Anne Levow University of Chicago July 7, 2003.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Topic Detection and Tracking Introduction and Overview.
Mandarin-English Information (MEI) Johns Hopkins University Summer Workshop 2000 presented at the TDT-3 Workshop February 28, 2000 Helen Meng The Chinese.
Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU.
Search Engines and Information Retrieval Chapter 1.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi.
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
CLEF 2005: Multilingual Retrieval by Combining Multiple Multilingual Ranked Lists Luo Si & Jamie Callan Language Technology Institute School of Computer.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Lexical Trigger and Latent Semantic Analysis for Cross-Lingual Language Model Adaptation WOOSUNG KIM and SANJEEV KHUDANPUR 2005/01/12 邱炫盛.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
INTERESTING NUGGETS AND THEIR IMPACT ON DEFINITIONAL QUESTION ANSWERING Kian-Wei Kor, Tat-Seng Chua Department of Computer Science School of Computing.
IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
UMass at TDT 2000 James Allan and Victor Lavrenko (with David Frey and Vikas Khandelwal) Center for Intelligent Information Retrieval Department of Computer.
Overview of the TDT-2003 Evaluation and Results Jonathan Fiscus NIST Gaithersburg, Maryland November 17-18, 2002.
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
TDT 2002 Straw Man TDT 2001 Workshop November 12-13, 2001.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
National Taiwan University, Taiwan
Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching.
Vector Space Models.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Mandarin-English Information (MEI): Investigating Translingual Speech Retrieval Johns Hopkins University Center of Language and Speech Processing Summer.
Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.
Compact Query Term Selection Using Topically Related Text Date : 2013/10/09 Source : SIGIR’13 Authors : K. Tamsin Maxwell, W. Bruce Croft Advisor : Dr.Jia-ling,
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Statistical Properties of Text
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
TDT 2000 Workshop Lessons Learned These slides represent some of the ideas that were tried for TDT 2000, some conclusions that were reached about techniques.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Why indexing? For efficient searching of a document
F. López-Ostenero, V. Peinado, V. Sama & F. Verdejo
Presentation transcript:

Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000

Roadmap The signal to noise perspective Our topic tracking system Boosting signal Reducing noise Future directions

Translingual Tracking Challenges Segmentation of text adds noise –Unknown words Transcription of speech adds noise –Unknown words –Easily confused words (e.g., homophones) Translation adds noise –Vocabulary mismatch with ASR / segmentation –Incorrect translation selection

Improving the Signal to Noise Ratio Translation coverage –Enrich the term list using large dictionaries Translation selection –Statistical evidence from comparable corpora Enriching indexing vocabulary –Add related terms from comparable corpora Score normalization –Learn source dependence from dry-run collection

Preview Focusing on noise alone is not enough –Signal boosting is a big win Baseline: Systran –Goal: choose the best single translation Two signal-boosting strategies beat Systran –Choose the best two translations –Add related terms for indexing (found in related documents)

Improvements Since TDT-2 Weight selection –PRISE “bm25idf” Query representation: –Vector of 180 most selective terms by χ² test Two-pass normalization –Source-specific, 5 source classes NYT, APW, Eng. Speech, Man. Text, Man. Speech –Topic-specific Average of example story scores

Mandarin (All Sources) English (All Sources) Source-independent Source-dependent Source-independent Source-dependent

Translingual Approaches Indexing strategies (boosting signal) –Post-translation document expansion –n-best translation Translation tweaks (reducing noise) –Enriched bilingual term list –Corpus-based translation selection –Pre-translation Mandarin stopword removal

Translingual Runs (* = official run scored by NIST)

Document Expansion BNNWT Mandarin Word-to-Word Translation Comp. English Corpus PRISE Top 5 ASR Transcript NMSU Segmenter Term Selection PRISE BNNWT English Results Query Vector Documents to Index Single Document

Mandarin Newswire Text

Mandarin Broadcast News

Why Document Expansion Works Story-length objects provide useful context Ranked retrieval finds signal amid the noise Selective terms discriminate among documents –Enrich index with high IDF terms from top documents Similar strategies work well in other applications –TREC-7 SDR [Singhal et al., 1998] –CLIR query translation [Ballesteros & Croft, 1997]

n-best Translation We generally used 1-best translation –Highest unigram frequency in comparable corpus Tried 2-best: two highest-ranked translations –Duplicating unique translations where necessary Should reduce miss rate –But at what cost in false alarms?

Mandarin Newswire Text

Mandarin Broadcast News

Comparison With Systran Used baseline translations provided by LDC –Untranslated words not used –No document expansion Systran produces 1-best translations –Natural comparison is with our 2-best run

Mandarin Newswire Text

Mandarin Broadcast News

Bilingual Term List Enrichment Two sources of candidate translations –LDC Chinese-English term list (version 2) –CETA (Optilex) dictionary >250K entries, hand-built from >250 sources Merging strategy –Used only general-purpose sources in CETA –Filtered out definitions –Removed parenthetical clauses

Term List Statistics

Broadcast News Newswire Text

Translation Preference Unigram statistics guided translation selection –Minimize effect of rare translations, misspellings, … Based on dry run stories and rolling update –Backoff to balanced corpus for unknown words Brown corpus: variety of genres Compared with use of balanced corpus alone

Mandarin Newswire Text

Pre-Translation Stopword Removal Common words don’t help retrieval much –But mistranslations might hurt We built a Mandarin stopword list –Processed dictionary to identify function words –Added the top 300 words in LDC frequency list –Filtered by two speakers of Mandarin Suppressed translation of stopwords

Mandarin Newswire Text

Summary 3 techniques produced improvements: –Source-dependent normalization –Post-translation document expansion –n-best translation 3 techniques had little effect: –Bilingual term list enrichment –Comparable-corpus-based translation preference –Pre-translation stopword removal

Future Directions Statistical significance –Can this be added to the scoring software? Pre-translation document expansion –An effective approach in CLIR query translation Further experiments with n-best translation –Probably using a weighted strategy Structured translation [Pirkola, 1998] –Some concern about efficiency, though

Where is the Perfect TDT System? Run TDT-4 In Nova Scotia! Maryland Penn BBN