Hindi SLE Debriefing AVENUE Transfer System July 3, 2003.

Slides:



Advertisements
Similar presentations
CSA2050: DCG I1 CSA2050 Introduction to Computational Linguistics Lecture 8 Definite Clause Grammars.
Advertisements

Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
Probabilistic Language Processing Chapter 23. Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram,
Amirkabir University of Technology Computer Engineering Faculty AILAB Efficient Parsing Ahmad Abdollahzadeh Barfouroush Aban 1381 Natural Language Processing.
The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
EBMT1 Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Dave Inman.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
Evidence from Content INST 734 Module 2 Doug Oard.
Kalyani Patel K.S.School of Business Management,Gujarat University.
Albert Gatt LIN 3098 Corpus Linguistics. In this lecture Some more on corpora and grammar Construction Grammar as a theoretical framework Collostructional.
Automated Essay Evaluation Martin Angert Rachel Drossman.
Immediate constituent analysis and translation Identifying autonomous units.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Training and Decoding in SMT System) Kushal Ladha M.Tech Student CSE Dept.,
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Direct Translation Approaches: Statistical Machine Translation
Invitation to Computer Science, Java Version, Second Edition.
Drafting and Revising Effective Sentences Chapter 11 Presented By Blake O'Hare Chris McClune.
EGEE is a project funded by the European Union under contract IST Testing processes Leanne Guy Testing activity manager JRA1 All hands meeting,
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
AMTEXT: Extraction-based MT for Arabic Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.
Hebrew-to-English XFER MT Project - Update Alon Lavie June 2, 2004.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint.
Supertagging CMSC Natural Language Processing January 31, 2006.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Gregory Hanneman, Justin Merrill (Shyamsundar Jayaraman,
A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint.
The CMU Mill-RADD Project: Recent Activities and Results Alon Lavie Language Technologies Institute Carnegie Mellon University.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Chunk Parsing. Also called chunking, light parsing, or partial parsing. Method: Assign some additional structure to input over tagging Used when full.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
Bridging the Gap: Machine Translation for Lesser Resourced Languages
Avenue Architecture Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned.
3.3 A More Detailed Look At Transformations Inversion (revised): Move Infl to C. Do Insertion: Insert interrogative do into an empty.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
CMU Statistical-XFER System Hybrid “rule-based”/statistical system Scaled up version of our XFER approach developed for low-resource languages Large-coverage.
Eliciting a corpus of word- aligned phrases for MT Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University.
CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Minority Languages Katharina Probst Language Technologies Institute Carnegie Mellon.
Paul van Mulbregt Sheera Knecht Jon Yamron Dragon Systems Detection at Dragon Systems.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
The AVENUE Project: Automatic Rule Learning for Resource-Limited Machine Translation Faculty: Alon Lavie, Jaime Carbonell, Lori Levin, Ralf Brown Students:
Approaches to Machine Translation
Urdu-to-English Stat-XFER system for NIST MT Eval 2008
Natural Language - General
Vamshi Ambati 14 Sept 2007 Student Research Symposium
Approaches to Machine Translation
AMTEXT: Extraction-based MT for Arabic
Dekai Wu Presented by David Goss-Grubbs
Presentation transcript:

Hindi SLE Debriefing AVENUE Transfer System July 3, 2003

Hindi SLE Debriefing2 Summary of our Final Hindi-to- English Transfer System Overview of our Lexical Resources and how they were used in the system Grammar Development Transfer System Runtime Configuration Dev-test Evaluation Results Observations and Lessons Learned

July 3, 2003Hindi SLE Debriefing3 Elicited Data Collection Goal: Acquire high quality word aligned Hindi- English data to support system development, especially grammar development and automatic grammar learning We recruited a sizeable team of bilingual speakers – Rachel… “Original” Elicitation Corpus was translated into Hindi Corpus of Phrases extracted from Brown Corpus (NPs and PPs) was broken into files and assigned to translators, here and in India Resulting in total of word aligned translated phrases

July 3, 2003Hindi SLE Debriefing4 Summary of Lexical Resources Manual: manually written phrase transfer rules (72) Postpos: manually writen postpos rules (105) Bigram: translations of 500 most frequent bigrams in Hindi (from Ralf) Elicited: elicited data from controlled corpus and Brown, w-to-w and p-to-p, total of lexical and phrase rules LDC: “master” bilingual dict from LDC, frequency sorted, Richard and Shobha cleaned up manually top 12% of entries, total of rules NE: Named Entity lists from LDC website and from Fei, total of = 3346 rules IBM: statistical w-to-w and p-to-p lexicon from IBM, sorted by translation prob, rules JOY: SMT system w-to-w and p-to-p lexicon, sorted by translation prob, rules TOTAL: rules

July 3, 2003Hindi SLE Debriefing5 Ordering of Lexical Resources Corresponds to three passes of system: –Phrase-to-phrase (used in first pass) –POS-tagged w-to-w pass (morph, enhanced, sorted, can feed into grammar) –LEX-tagged w-to-w pass (full forms, can only be used for w-to-w, no grammar).

July 3, 2003Hindi SLE Debriefing6 Ordering of Lexical Resources Man rules (p-to-p, w-to-w) Postpos (w-to-w) Bigrams (p-to-p) LDC (w-to-w, enhanced, sorted) Etposrules (w-to-w, enhanced, sorted) NE (p-to-p, w-to-w) Etlexrules (w-to-w, sorted) Etphraserules (p-to-p) IBM (p-to-p, w-to-w, sorted) JOY (p-to-p, w-to-w, sorted) Cleaned up and duplicates removed Total Rules in Global Lexicon: xxx

July 3, 2003Hindi SLE Debriefing7 Grammar Development Grammar covers mostly VPs (verb complexes) 73 grammar rules, covering all tenses, active and passive, subjunctive Experimented also with simple NP and PP rules (movement of postpos in Hindi to prep in English), hurt performance Problems in grammar testing and debugging – Ari…

July 3, 2003Hindi SLE Debriefing8 Example Grammar Rule ;; SIMPLE PRESENT AND PAST (depends on the tense of the Aux) ; Ex: (tu) bolta hE -> (I) (usually) speak ; Ex: (maiM) sotA hUM -> (I) sleep (now) ; Ex: (maiM) sotA thA -> (I) slept (used to spleep) {VP,5} VP::VP : [V Aux] -> [V] ( (X1::Y1) ((x1 form) = part) ((x1 aspect) = imperf) ((x2 lexwx) = 'honA') ((x2 tense) = (*NOT* fut)) ((x2 tense) = (*NOT* subj)) ((x0 tense) = (x2 tense)) ((x0 agr num) = (x2 agr num)) ((x0 agr pers) = (x2 agr pers)) (x0 = x1) ((y1 tense) = (x0 tense)) ((y1 agr num) = (x0 agr num)) ; not always agrees, try commenting ((y1 agr pers) = (x0 agr pers)) )

July 3, 2003Hindi SLE Debriefing9 Transfer Runtime System Three passes: –Pass1: match against p-to-p entries, halt if match found (ver2 allows to continue) –Pass2: morph analyze word and match against all w- to-w resources, halt if match found –Pass3: match original word against all w-to-w resources, provides only w-to-w output, no feeding into grammar rules. Selection of best set of arcs: greedy left-to- right search that prefers longer input segments Unk word policy: replace with English “the” Post-processing: –remove be/give at eos if preceded by a verb –Replace all remaining “be” with “is”

July 3, 2003Hindi SLE Debriefing10 Development Testing Three dev-test sets: –India Today: 59 sentences, single ref –Full ISI: 358 sents, newswire, single ref –Small ISI: first 25 sentences of Full ISI Full ISI was most meaningful test-set, tested on IT earlier on, and to ensure no over-fitting.

July 3, 2003Hindi SLE Debriefing11 Final Performance: ISI-Full GrammarBLEUM-BLEUNIST VP0630a VP0629b NO GRA Lexicon xferdict.0630-al-4 Ngram Hits 1g2g3g4g5g VP0630a

July 3, 2003Hindi SLE Debriefing12 Debug Output with Sources amerikI senA ne kahA hE ki irAka kI galiyoM meM cAro waraPa vyApwa aparAXa ko niyaMwriwa karane ke lie uMhoMne irAkiyoM ko senA ke kAma meM Ane vAle haWiyAra sOMpane ke lie 2 sapwAha kA samaya xiyA hE.

July 3, 2003Hindi SLE Debriefing13 Histogram of Source Information SORTED total = 2425 IBM total = 447 JOY total = 483 MANUAL ERIK total = 619 MANUAL ARI total = 139 BIGR total = 196 POSTPOS total = 510 TIMEEXP total = 4 ETLEX total = 0 ETPHRASE total = 0

July 3, 2003Hindi SLE Debriefing14 Things we Tried at Last Minute Allowing the second pass to take place even if matches on phrases in first pass – no improvement in score Throwing in NP rules and solving the lost unigrams by a clever final pass that replaces the choices for words – hurt score slightly…

July 3, 2003Hindi SLE Debriefing15 Real Eval Set Transfer Run Eval set consisted of 450 sentences from a variety of newswire sources Suspicion of some sents drawn from dev data! We submitted XFER-ONLY and XFER- ONLY+CASE Aggregate stats from our run: –Coverage: 88.3% –Compounds matched: 2279 (token) –Went thru Morph and matched: 6256/9605 –Unkown Hindi words: 1122

July 3, 2003Hindi SLE Debriefing16 Limited Resource Scenario The “rules of the game” were skewed against us in this evaluation: –1.5 Million words of parallel text –Noisy statistical lexical resources –We don’t have a strong statistical selection model How do we do in the minority language scenario, with our limited resources? Kathrin ran test with Lexicon constructed just from Man rules, bigrams, postpos, LDC dict and Elicited data We will also test EBMT and SMT under the same scenario!

July 3, 2003Hindi SLE Debriefing17 Results: ISI-Full GrammarBLEUM-BLEUNIST VP0630a NP0630a Learned NP Learned NP+PP NO GRA

July 3, 2003Hindi SLE Debriefing18 Observations and Lessons Serious grammar development occurred very late in the process (last few days) Very hard time getting grammar to start pulling performance numbers up Grammar rules are often blocked from applying because of phrasal matches Rather hard to find cases where they were supposed to apply and didn’t NP/PP rules did not help, partly because NPs boundaries were not adequately found Strange phenomena of loosing unigrams when NP rules apply. Need to debug this thoroughly

July 3, 2003Hindi SLE Debriefing19 Things we should find out What sources did output come from in real eval test run? Get histogram… What is the marginal contribution of various resources to our performance? –Conduct runs with individual resources omitted, in particular, without: –Our elicited data –The IBM data –The “Joy” data –The LDC Lexicon –Without the phrase-to-phrase pass Can more grammar development help?

July 3, 2003Hindi SLE Debriefing20 Further Work on our Hindi System Our Hindi system ideal platform for advanced thesis-related research work from now on Eval test set will remain unseen test data for future experimentation (ref translations will be available soon) Low pace further system development throughout July (grammar, bug fixes) Worthwhile new results to be reported at the August PI meeting