Download presentation
Presentation is loading. Please wait.
Published byKellie Chandler Modified over 9 years ago
1
December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA http://terpconnect.umd.edu/~oard Slides from: Leah Larkey, Mike Maxwell, Franz Josef Och, David Yarowsky Ideas from: Just about all of “Team TIDES”
2
A Very Brief History of NLP 1966: ALPAC –Refocus investment on enabling technologies 1990: IBM’s Candide MT system –Credible data-driven approaches 1999: TIDES –Translation, Detection, Extraction, Summarization
3
Surprise Language Framework English-only Users / Docs in language X Zero-resource start (treasure hunt) Sharply time constrained (29 days) Character-coded text Research-oriented Intense team-based collaboration
4
Schedule Cebuano Announce: Mar 5 Test Data: Stop Work: Mar 14 Newsletter:April Talks:May 30 (HLT) Papers: Hindi Jun 1 Jun 27 Jun 30 August Aug 5 (TIDES PI) October (TALIP)
5
300-Language Survey
6
Five evaluated tasks –Automatic CLIR (English queries) –Topic tracking (English examples, event-based) –Machine translation into English –English “Headline” generation –Entity tagging (five MUC types) Several useful components –POS tags, morphology, time expressions, parsing Several demonstration systems –Interactive CLIR (two systems) –Cross-language QA (English Q, Translated A) –Machine translation (+ Translation elicitation) –Cross-document entity tracking
7
16 Participating Teams Cebuano + Hindi USC-ISI Maryland NYU Johns Hopkins Sheffield U Penn-LDC CMU UC Berkeley MITRE Hindi Only U Mass Alias-i BBN IBM CUNY-Queens K-A-T (Colorado) Navy-SPAWAR
8
Translation Detection Extraction Summarization Books Web Books Web People Lexicons Corpora Time Resource Harvesting Systems Research Results Capture Process Knowledge Innovation Cycle Coordination Strategy Push Organize Talk
9
10-Day Cebuano Pre-Exercise
10
Hindi Participants Alias-I UC Berkeley BBN CMU CUNY Johns Hopkins IBM ISI LDC MITRE NYU SPAWAR U. Sheffield U. Massachusetts U. Maryland Resource Generation Detection Extraction Summarization Translation
11
Hindi Resources Much more data available than for Cebuano Data collected by all project participants –Web pages, News, Handbooks, Manually created, … –Dictionaries Major problems: –Many non-standard encodings –Often no converters available –Available converters often did not work properly Huge effort: data conversion and cleaning Resulting bilingual corpus: 4.2 million words
12
Hindi Translation Elicitation Server - Johns Hopkins University (David Yarowsky) People voluntarily translated large numbers of Hindi news sentences for nightly prizes at a novel Johns Hopkins University website Performance is measured by Bleu score on 20% randomly interspersed test sentences Allows immediate way to rank and reward quality translations and exclude junk Result: 300,000 words of perfectly sentence-aligned bitext (exactly on genre) for 1-2 cents/word within ~5 days Much cheaper than 25 cents/word for translation services or 5 cents/word for a prior MT-group’s recruitment of local students Sample Interface: user (English) translations typed here… and here …. User choice of 2-3 encoding alternatives Observed exponential growth in usage (before prizes ended) viral advertising via family, friends, newgroups, … $0 in recruitment, advertising, and administrative costs Nightly incentive rewards given automatically via amazon.com gift certificates to email addresses (any $ amount, no fee) no need for hiring overhead. Rewards only given for proven high quality work already performed (prizes not salary). immediate positive feedback encourages continued use Direct immediate access to worldwide labor market fluent in source language
13
MT Challenges Lexicon coverage –Hindi morphology –Transliteration of Names Hindi word order: –SOV vs. SVO Training data inconsistencies, misalignments Incomplete tuning cycle –Same data/same model would give better results from better tuning of model parameters
14
Example Translation Indonesian City of Bali in October last year in the bomb blast in the case of imam accused India of the sea on Monday began to be averted. The attack on getting and its plan to make the charges and decide if it were found guilty, he death sentence of May. Indonesia of the police said that the imam sea bomb blasts in his hand claim to be accepted. A night Club and time in the bomb blast in more than 200 people were killed and several injured were in which most foreign nationals. …
15
MT Results Overview - Hindi Results in NIST evaluation: 7.43 Cased NIST (7.80 uncased)
16
Comparison to other languages Language pairWords Training DataNIST scoreRelative Human NIST Cebuano-English 1.3M (w/o Bible: 400K) ?? Hindi-English4.2M7.473% Chinese-English150M9.080% Arabic-English120M10.189% Note: different (news) test corpora, NIST scores incomparable
17
Hindi Week 1: Porting Monday –2,973 BBC documents (UTF-8) –Batch CLIR (no stem, 2/3 known items rank 1) Tuesday –MIRACLE (“ITRANS”, gloss) –Stemmer (implemented from a paper) Wednesday –BBC CLIR collection (19 topic, known item) Friday: –Parallel text (Bible: 900k words, Web: 4k words) –Devanagari OCR system
18
Hindi Weeks 2/3/4: Exploration N-grams (trigrams best for UTF-8) Relative Average Term Frequency (Kwok) Scanned bilingual dictionary (Oxford) More topics for test collection (29) Weighted structured queries (IBM lexicon) Alternative stemmers (U Mass, Berkeley) Blind relevance feedback Transliteration Noun phrase translation MIRACLE integration (ISI MT, BBN headlines)
20
Formative Evaluation
21
Lessons Learned We learned more from 2 languages than 1 –Simple techniques worked for Cebuano –Hindi needed more (encoding, MT, transliteration) Usable systems can be built in a month –Parallel text for MT is the pacing item Broad collaboration yielded useful insights
22
Our FIRE-2008 Goals Evaluate Surprise Language resources –IBM and LDC translation lexicons –Berkeley Stemmer Compare CLIR techniques –Probabilistic Structured Queries (PSQ) –Derived Aggregated Meaning Matching (DAMM)
23
Comparing Test Collections FIRE-2008 Test Collection Surprise Language Test Collection Query language English Doc language Hindi Topics 5015 Documents 95,21541,697 Avg rel docs/topic 6841
24
Monolingual Baselines Our FIRE-2008 Training (TDN)2003 Surprise Language (TDNS) 15 Surprise Language topics
25
A Ranking Function: Okapi BM25 document frequency term frequency query termquerydocument lengthdocument average document lengthterm frequency in query
26
Estimating TF and DF for Query Terms 0.4*20 + 0.3*5 + 0.2*2 + 0.1*50 = 14.9 20 50 25 3040 0.30.4 0.4*50 + 0.3*40 + 0.2*30 + 0.1*200 = 58 0.1 200 0.2 e1e1 0.4 0.3 0.2 0.1 f1f2f3f4f1f2f3f4
27
Bidirectional Translation wonders of ancient world (CLEF Topic 151) se//0.31 demande//0.24 demander//0.08 peut//0.07 merveilles//0.04 question//0.02 savoir//0.02 on//0.02 bien//0.01 merveille//0.01 pourrait//0.01 Unidirectional: si//0.01 sur//0.01 me//0.01 t//0.01 emerveille//0.01 ambition//0.01 merveilleusement//0.01 veritablement//0.01 cinq//0.01 hier//0.01 merveilles//0.92 merveille//0.03 emerveille//0.03 merveilleusement//0.02 Bidirectional:
28
Surprise Language Translation Lexicons Source Translation pairs English words Hindi words LDC (dict) 69,19521,84233,251 IBM (stat)181,11050,14177,517 ISI (stat)512,24865,36697,275 p(h|e) p(e|h) 40% 60%
29
George. W. Bush 乔治. 布什 shrubbery 草丛 grass lawn 草坪 marijuana grass 大麻 bush grass 0.7 0.3 0.8 0.2 布什 草丛 大麻 0.6 1.0 0.4 1.0 Synonym Sets as Models of Term Meaning
30
“Meaning Matching” Variants iconQuery translation knowledge? Document translation knowledge? Query language Synsets? Document language synsets? Pre- aligned Synsets? FAMM DAMM PAMM q PAMM d IMM PSQ APSQ PDT APDT (Q) (D) (Q) D Q (D) Q D Q (D) Q D (Q) D
31
f 1 (0.32) f 2 (0.21) f 3 (0.11) f 4 (0.09) f 5 (0.08) f 6 (0.05) f 7 (0.04) f 8 (0.03) f 9 (0.03) f 10 (0.02) f 11 (0.01) f 12 (0.01) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 f1f1 f1f2f3f4f5f1f2f3f4f5 f1f2f3f4f1f2f3f4 f1f2f3f4f5f6f7f1f2f3f4f5f6f7 f 1 f 2 f 3 f 4 f 5 f 6 f 7 f 8 f 9 f 10 f 11 f 12 f1f1 f1f1 f1f2f1f2 f1f2f1f2 f1f2f3f1f2f3 f1f1 Cumulative Probability ThresholdTranslations Pruning Translations
32
Comparing PSQ and DAMM 15 Surprise Language topics, TDN queries
33
1/3 of Topics Improve w/DAMM 15 Surprise Language topics, TDN queries
34
Official CLIR Results 50 FIRE-2008 topics, TDN queries
35
Comparing Stemmers YASS Stemmer Better Berkeley Stemmer Better 50 FIRE-2008 Topics, TDN queries
36
Best (Overall) CLIR Run clir-EH-umd-man2 Better Median Better 41 FIRE-2008 topics with ≥ 5 relevant documents, TDN queries
37
Cross-Language “Retrieval” Search Translated Query Ranked List Query Translation Query
38
Interactive Translingual Search Search Translated Query Selection Ranked List Examination Document Use Document Query Formulation Query Translation Query Query Reformulation MTTranslated “Headlines”English Definitions
39
UMass Interactive Hindi CLIR
40
MIRACLE Design Goals Value-added interactive search –Regardless of available resources Maximize the value of minimal resources –Bilingual term list + Comparable English text Leverage other available resources –Parallel text, morphology, MT, summarization
45
Summary Larger Hindi test collection –Prerequisite for insightful failure analysis Surprise Language resources were useful –Translation lexicons –Berkeley stemmer (combine with YASS?) DAMM is robust with weaker resources
46
Looking Forward Shared resources –Test collections –Translation lexicons (or parallel corpora) –Stemmers System infrastructure –IL variants of Indri/Terrier/Zettair/Lucene Community-based cycle of innovation –Students are our most important “result”
47
For More Information Team TIDES newsletter –http://language.cnri.reston.va.us/TeamTIDES.html –Cebuano: April 2003 –Hindi: October 2003 Papers –NAACL/HLT 2003 –MT Summit 2003 –ACM TALIP Special Issues(Jun/Sep 2003)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.