December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA

Slides:

Advertisements

Similar presentations

Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.

Advertisements

SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.

Cross-Language Retrieval INST 734 Module 11 Doug Oard.

Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.

Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.

Multilingual Information Access in a Digital Library Vamshi Ambati, Rohini U, Pramod, N Balakrishnan and Raj Reddy International Institute of Information.

IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.

Cross Language IR Philip Resnik Salim Roukos Workshop on Challenges in Information Retrieval and Language Modeling Amherst, Massachusetts, September 11-12,

Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.

Jumping Off Points Ideas of possible tasks Examples of possible tasks Categories of possible tasks.

The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

Advance Information Retrieval Topics Hassan Bashiri.

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

Overview of Search Engines

Microsoft Research Faculty Summit Robert Moore Principal Researcher Microsoft Research.

Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.

22 August 2003CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of Maryland.

Named Entity Recognition without Training Data on a Language you don’t speak Diana Maynard Valentin Tablan Hamish Cunningham NLP group, University of Sheffield,

Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.

Overview of RISOT: Retrieval of Indic Script OCR’d Text Utpal GarainIndian Statistical Institute, Kolkata Tamaltaru PalIndian Statistical Institute, Kolkata.

Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.

The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,

English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.

August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.

Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.

Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.

CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( Bridging Languages for Question Answering: DIOGENE at CLEF-2003.

1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.

CLEF 2004 – Interactive Xling Bookmarking, thesaurus, and cooperation in bilingual Q & A Jussi Karlgren – Preben Hansen –

AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.

Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.

Microsoft Research India’s Participation in FIRE2008 Raghavendra Udupa

Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:

Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.

MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.

Interactive Probabilistic Search for GikiCLEF Ray R Larson School of Information University of California, Berkeley Ray R Larson School of Information.

IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1.

Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.

Cross-Language Retrieval INST 734 Module 11 Doug Oard.

Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.

Chapter 23: Probabilistic Language Models April 13, 2004.

Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.

Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.

For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington.

Information Retrieval

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.

Multilingual Search Shibamouli Lahiri

Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,

Cross-Language Information Retrieval Applied Natural Language Processing October 29, 2009 Douglas W. Oard.

DARPA TIDES MT Group Meeting Marina del Rey Jan 25, 2002 Alon Lavie, Stephan Vogel, Alex Waibel (CMU) Ulrich Germann, Kevin Knight, Daniel Marcu (ISI)

LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.

Issues in Arabic MT Alex Fraser USC/ISI 9/22/2018 Issues in Arabic MT.

Rapidly Retargetable Translingual Detection

Multilingual Information Access in a Digital Library

Introduction to Search Engines

Language Technologies for Scalable Digital Libraries

Presentation transcript:

December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA Slides from: Leah Larkey, Mike Maxwell, Franz Josef Och, David Yarowsky Ideas from: Just about all of “Team TIDES”

A Very Brief History of NLP 1966: ALPAC –Refocus investment on enabling technologies 1990: IBM’s Candide MT system –Credible data-driven approaches 1999: TIDES –Translation, Detection, Extraction, Summarization

Surprise Language Framework English-only Users / Docs in language X Zero-resource start (treasure hunt) Sharply time constrained (29 days) Character-coded text Research-oriented Intense team-based collaboration

Schedule Cebuano Announce: Mar 5 Test Data: Stop Work: Mar 14 Newsletter:April Talks:May 30 (HLT) Papers: Hindi Jun 1 Jun 27 Jun 30 August Aug 5 (TIDES PI) October (TALIP)

300-Language Survey

Five evaluated tasks –Automatic CLIR (English queries) –Topic tracking (English examples, event-based) –Machine translation into English –English “Headline” generation –Entity tagging (five MUC types) Several useful components –POS tags, morphology, time expressions, parsing Several demonstration systems –Interactive CLIR (two systems) –Cross-language QA (English Q, Translated A) –Machine translation (+ Translation elicitation) –Cross-document entity tracking

16 Participating Teams Cebuano + Hindi USC-ISI Maryland NYU Johns Hopkins Sheffield U Penn-LDC CMU UC Berkeley MITRE Hindi Only U Mass Alias-i BBN IBM CUNY-Queens K-A-T (Colorado) Navy-SPAWAR

Translation Detection Extraction Summarization Books Web Books Web People Lexicons Corpora Time Resource Harvesting Systems Research Results Capture Process Knowledge Innovation Cycle Coordination Strategy Push Organize Talk

10-Day Cebuano Pre-Exercise

Hindi Participants Alias-I UC Berkeley BBN CMU CUNY Johns Hopkins IBM ISI LDC MITRE NYU SPAWAR U. Sheffield U. Massachusetts U. Maryland Resource Generation Detection Extraction Summarization Translation

Hindi Resources Much more data available than for Cebuano Data collected by all project participants –Web pages, News, Handbooks, Manually created, … –Dictionaries Major problems: –Many non-standard encodings –Often no converters available –Available converters often did not work properly Huge effort: data conversion and cleaning Resulting bilingual corpus: 4.2 million words

Hindi Translation Elicitation Server - Johns Hopkins University (David Yarowsky) People voluntarily translated large numbers of Hindi news sentences for nightly prizes at a novel Johns Hopkins University website Performance is measured by Bleu score on 20% randomly interspersed test sentences Allows immediate way to rank and reward quality translations and exclude junk Result: 300,000 words of perfectly sentence-aligned bitext (exactly on genre) for 1-2 cents/word within ~5 days Much cheaper than 25 cents/word for translation services or 5 cents/word for a prior MT-group’s recruitment of local students Sample Interface: user (English) translations typed here… and here …. User choice of 2-3 encoding alternatives Observed exponential growth in usage (before prizes ended) viral advertising via family, friends, newgroups, … $0 in recruitment, advertising, and administrative costs Nightly incentive rewards given automatically via amazon.com gift certificates to addresses (any $ amount, no fee) no need for hiring overhead. Rewards only given for proven high quality work already performed (prizes not salary). immediate positive feedback encourages continued use Direct immediate access to worldwide labor market fluent in source language

MT Challenges Lexicon coverage –Hindi morphology –Transliteration of Names Hindi word order: –SOV vs. SVO Training data inconsistencies, misalignments Incomplete tuning cycle –Same data/same model would give better results from better tuning of model parameters

Example Translation Indonesian City of Bali in October last year in the bomb blast in the case of imam accused India of the sea on Monday began to be averted. The attack on getting and its plan to make the charges and decide if it were found guilty, he death sentence of May. Indonesia of the police said that the imam sea bomb blasts in his hand claim to be accepted. A night Club and time in the bomb blast in more than 200 people were killed and several injured were in which most foreign nationals. …

MT Results Overview - Hindi Results in NIST evaluation: 7.43 Cased NIST (7.80 uncased)

Comparison to other languages Language pairWords Training DataNIST scoreRelative Human NIST Cebuano-English 1.3M (w/o Bible: 400K) ?? Hindi-English4.2M7.473% Chinese-English150M9.080% Arabic-English120M10.189% Note: different (news) test corpora, NIST scores incomparable

Hindi Week 1: Porting Monday –2,973 BBC documents (UTF-8) –Batch CLIR (no stem, 2/3 known items rank 1) Tuesday –MIRACLE (“ITRANS”, gloss) –Stemmer (implemented from a paper) Wednesday –BBC CLIR collection (19 topic, known item) Friday: –Parallel text (Bible: 900k words, Web: 4k words) –Devanagari OCR system

Hindi Weeks 2/3/4: Exploration N-grams (trigrams best for UTF-8) Relative Average Term Frequency (Kwok) Scanned bilingual dictionary (Oxford) More topics for test collection (29) Weighted structured queries (IBM lexicon) Alternative stemmers (U Mass, Berkeley) Blind relevance feedback Transliteration Noun phrase translation MIRACLE integration (ISI MT, BBN headlines)

Formative Evaluation

Lessons Learned We learned more from 2 languages than 1 –Simple techniques worked for Cebuano –Hindi needed more (encoding, MT, transliteration) Usable systems can be built in a month –Parallel text for MT is the pacing item Broad collaboration yielded useful insights

Our FIRE-2008 Goals Evaluate Surprise Language resources –IBM and LDC translation lexicons –Berkeley Stemmer Compare CLIR techniques –Probabilistic Structured Queries (PSQ) –Derived Aggregated Meaning Matching (DAMM)

Comparing Test Collections FIRE-2008 Test Collection Surprise Language Test Collection Query language English Doc language Hindi Topics 5015 Documents 95,21541,697 Avg rel docs/topic 6841

Monolingual Baselines Our FIRE-2008 Training (TDN)2003 Surprise Language (TDNS) 15 Surprise Language topics

A Ranking Function: Okapi BM25 document frequency term frequency query termquerydocument lengthdocument average document lengthterm frequency in query

Estimating TF and DF for Query Terms 0.4* * * *50 = * * * *200 = e1e f1f2f3f4f1f2f3f4

Bidirectional Translation wonders of ancient world (CLEF Topic 151) se//0.31 demande//0.24 demander//0.08 peut//0.07 merveilles//0.04 question//0.02 savoir//0.02 on//0.02 bien//0.01 merveille//0.01 pourrait//0.01 Unidirectional: si//0.01 sur//0.01 me//0.01 t//0.01 emerveille//0.01 ambition//0.01 merveilleusement//0.01 veritablement//0.01 cinq//0.01 hier//0.01 merveilles//0.92 merveille//0.03 emerveille//0.03 merveilleusement//0.02 Bidirectional:

Surprise Language Translation Lexicons Source Translation pairs English words Hindi words LDC (dict) 69,19521,84233,251 IBM (stat)181,11050,14177,517 ISI (stat)512,24865,36697,275 p(h|e) p(e|h) 40% 60%

George. W. Bush 乔治. 布什 shrubbery 草丛 grass lawn 草坪 marijuana grass 大麻 bush grass 布什草丛大麻 Synonym Sets as Models of Term Meaning

“Meaning Matching” Variants iconQuery translation knowledge? Document translation knowledge? Query language Synsets? Document language synsets? Pre- aligned Synsets? FAMM DAMM PAMM q PAMM d IMM PSQ APSQ PDT APDT (Q) (D) (Q) D Q (D) Q D Q (D) Q D (Q) D

f 1 (0.32) f 2 (0.21) f 3 (0.11) f 4 (0.09) f 5 (0.08) f 6 (0.05) f 7 (0.04) f 8 (0.03) f 9 (0.03) f 10 (0.02) f 11 (0.01) f 12 (0.01) f1f1 f1f2f3f4f5f1f2f3f4f5 f1f2f3f4f1f2f3f4 f1f2f3f4f5f6f7f1f2f3f4f5f6f7 f 1 f 2 f 3 f 4 f 5 f 6 f 7 f 8 f 9 f 10 f 11 f 12 f1f1 f1f1 f1f2f1f2 f1f2f1f2 f1f2f3f1f2f3 f1f1 Cumulative Probability ThresholdTranslations Pruning Translations

Comparing PSQ and DAMM 15 Surprise Language topics, TDN queries

1/3 of Topics Improve w/DAMM 15 Surprise Language topics, TDN queries

Official CLIR Results 50 FIRE-2008 topics, TDN queries

Comparing Stemmers YASS Stemmer Better Berkeley Stemmer Better 50 FIRE-2008 Topics, TDN queries

Best (Overall) CLIR Run clir-EH-umd-man2 Better Median Better 41 FIRE-2008 topics with ≥ 5 relevant documents, TDN queries

Cross-Language “Retrieval” Search Translated Query Ranked List Query Translation Query

Interactive Translingual Search Search Translated Query Selection Ranked List Examination Document Use Document Query Formulation Query Translation Query Query Reformulation MTTranslated “Headlines”English Definitions

UMass Interactive Hindi CLIR

MIRACLE Design Goals Value-added interactive search –Regardless of available resources Maximize the value of minimal resources –Bilingual term list + Comparable English text Leverage other available resources –Parallel text, morphology, MT, summarization

Summary Larger Hindi test collection –Prerequisite for insightful failure analysis Surprise Language resources were useful –Translation lexicons –Berkeley stemmer (combine with YASS?) DAMM is robust with weaker resources

Looking Forward Shared resources –Test collections –Translation lexicons (or parallel corpora) –Stemmers System infrastructure –IL variants of Indri/Terrier/Zettair/Lucene Community-based cycle of innovation –Students are our most important “result”

For More Information Team TIDES newsletter – –Cebuano: April 2003 –Hindi: October 2003 Papers –NAACL/HLT 2003 –MT Summit 2003 –ACM TALIP Special Issues(Jun/Sep 2003)