Download presentation
Presentation is loading. Please wait.
Published byRegina Johnston Modified over 9 years ago
1
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Human Language Technologies (HLT) Workshop 2006 Classification-Based Contextual Correction of Mistranslations: A Machine Learning Approach William H. Hsu Joint work with: Waleed Al-Jandal, Martin S. R. Paradesi, Tejaswi Pydimarri, Chris Meyer Thursday, 01 June 2006 Laboratory for Knowledge Discovery in Databases Kansas State University http://www.kddresearch.org/KSU/CIS/HLT-Specialized-20060601.ppt
2
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop A Technical Survey of Statistical MT: Phrase-Based Methods and Metrics
3
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop OutlineOutline Background: Statistical Approaches to MT State of the Field: Metrics Open Problems New Approaches, Applications and Software Tools Current and Future Research Prospectus
4
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Global Search Decoder Algorithm argmax e P(t) *P(s|t) Language Model P(t) Translation Model P(s|t) Input: Source Language s Output: Target Language t Training Program (e.g., GIZA) Bilingual Parallel Corpora Language Modeling toolkit Target Language Machine Translation: Generic System Architecture
5
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Based on noisy channel model Source: foreign sentence f Target: English sentence e Bayesian inference: Maximum A Posteriori (MAP) Background [1]: Phrase Translation Model
6
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Preliminaries: segmentation of foreign input Result: Use: lexical analysis tools – string tokenizer, etc. Goal: decoding Segmented input: Output: Distributions Prediction: Distortion: a i = start of f i, b i-1 = end of f i-1 Background [2]: Modeling Steps
7
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Length normalization factor: Language Model (p LM ): Trigram [Seymour and Rosenfeld, 1997] Background [3]: Probabilistic Formulation
8
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Methods for Learning in MT: Survey Transformation-Based Learning (TBL) Example-Based Machine Translation (EBMT) Symbolic AI: Frames, Conceptual Grammars, Analogy, CBR Statistical 0. classical / naïve (cf. Weaver’s correspondence with Weiner) 1. phrase alignments from word-aligned model [Och & Ney, 2000] 2. linguistically motivated models [Yamada & Knight, 2001] 3. joint phrase model [Marcu & Wong, 2002] 4. generative phrase alignment [Koehn, Och & Marcu, 2003] 5. hierarchical models [Chiang, 2005; Taskar, 2005] 6. new approaches
9
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop OutlineOutline Background: Statistical Approaches to MT State of the Field: Metrics Open Problems New Approaches, Applications and Software Tools Current and Future Research Prospectus
10
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop N-gram precision (score is between 0 & 1) –What percentage of machine n-grams can be found in the reference translation? –n-gram: sequence of n units (words) –Not allowed to use same portion of reference translation twice (can’t cheat by repetition) Brevity penalty –Can’t just type out single word “the” p n : n-gram precision w n : positive weights r : words-in-reference c : words-in-machine Hard to “game” system (i.e., change machine output so that BLEU goes up, but quality doesn’t) Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport. Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail, which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack, [?] highly alerts after the maintenance. Adapted from Knight (2003) Bilingual Evaluation Understudy (BLEU) [1]: Papineni et al. (ACL, 2002)
11
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Bilingual Evaluation Understudy (BLEU) [2]: Multiple Reference Translations Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport. Reference translation 3: The US International Airport of Guam and its office has received an email from a self-claimed Arabian millionaire named Laden, which threatens to launch a biochemical attack on such public places as airport. Guam authority has been on alert. Reference translation 4: US Guam International Airport and its office received an email from Mr. Bin Laden and other rich businessman from Saudi Arabia. They said there would be biochemistry air raid to Guam Airport and other public places. Guam needs to be in high precaution about this matter. Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an e-mail that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places. Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail, which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack, [?] highly alerts after the maintenance. © 2003 Knight, K.
12
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Bilingual Evaluation Understudy (BLEU) [3]: Tracking Human Judgment (variant of BLEU) Courtesy G. Doddington (NIST)
13
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Bilingual Evaluation Understudy (BLEU) [4]: Metrics in Action 枪手被警方击毙。 (Foreign Original) the gunman was shot to death by the police. (Reference Translation) the gunman was police kill. #1 wounded police jaya of #2 the gunman was shot dead by the police. #3 the gunman arrested by police kill. #4 the gunmen were killed. #5 the gunman was shot to death by the police. #6 gunmen were killed by police ?SUB>0 ?SUB>0 #7 al by the police. #8 the ringer is killed by the police. #9 police killed the gunman. #10 green = 4-gram match (good!) red = word not matched (bad!) © 2003 Kevin Knight
14
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Issues with BLEU Significance of High Correlation with Human Judgment Sensitivity: Need to Recalibrate for Corpus, Language? Generalizability to Other Translation Tasks Causal explanation Associative reasoning in customer relationship management Collaborative recommendation Diagnosis (form of gisting) Speech-to-speech Meaning
15
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop New Technologies and Transfer Plan
16
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop OutlineOutline Background: Statistical Approaches to MT State of the Field: Metrics Open Problems New Approaches, Applications and Software Tools Current and Future Research Prospectus
17
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Context-Driven NLP: MT Applications Classical Natural Language Processing (NLP) (Noun and verb) phrase extraction Detection of named entity phrases Word sense disambiguation Spelling correction Interlingual Challenges Making use of mixed resources: bilingual & monolingual Semi-supervised learning Applications Mixed-mode (semi-interactive) MT – assistive technology Correcting mistranslations
18
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop The frequency of named-entity phrases in news text reflects the significance of the events they are associated with. So the news most likely be reported in many languages. For example: Translating Named Entity Phrases [1]: Arabic-English Application Translating Named Entity Phrases [1]: Arabic-English Application The Arabic newspaper article is about negotiations between the US and North Korean authorities regarding the search for the remains of US soldiers who died during the Korean war. [Knight & Al-Onaizan, 2001]
19
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Generate ranked list of translation candidates Bilingual resources: parallel corpus Monolingual resources Re-score list of candidates using different monolingual clues Translating Named Entity Phrases [2]: Two-Phase Approach Translating Named Entity Phrases [2]: Two-Phase Approach [Knight & Al-Onaizan, 2001]
20
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Correcting Faulty Translations Human-Assistive Technology Semi-Supervised: Two Training Corpora Labeled: “bad translations” and “near misses” Unlabeled: candidate translations Interactive Aspect “Which of these translations is right?” “Why is this candidate incorrect?” Application: Boosting Accuracy of SMT
21
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Boosting the Accuracy of SMT Parsing [Koehn et al., 2003] Pro: Found to slow growth of translation tables Con: Limited effect on BLEU Context-Specificity Supported by computational linguistic theory Some positive results in NLP prediction tasks [Elman, 1994] Very effective in sequence learning [Barash & Friedman, 2001] Important for Relational and First-Order Representations New Work: Semi-Supervised Approaches
22
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop OutlineOutline Background: Statistical Approaches to MT State of the Field: Metrics Open Problems New Approaches, Applications and Software Tools Current and Future Research Prospectus
23
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Popular SMT Tools Translation Model Generator: GIZA++ Search Decoder : PHARAOH, ISI ReWrite Decoder Language Model Generator : SRILM, CMU-Cambridge Statistical Language Modeling Toolkit EGYPT : A toolkit for SMT that consists GIZA/GIZA++ and word alignment tools. Evaluation packages: MTEVAL, GMT Metrics: BLEU, NIST, n-grams, WER, PER and SSER
24
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Software Tools for Graphical Models: BNJ v3 © 2005 KSU Bayesian Network tools in Java (BNJ) Development Team ALARM Network
25
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop OutlineOutline Background: Statistical Approaches to MT State of the Field: Metrics Open Problems New Approaches, Applications and Software Tools Current and Future Research Prospectus
26
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Current Work Development: End-to-End SMT System for NIST 2006 Evaluation Arabic-English Chinese-English Assemblage of Parallel Corpora Software Library Development: SMT Modules Aligners Parsers Phrase-Based Learning Transformation-Based Learning (TBL) Development of Graphical Models Toolkit BNJ v4 under development: http://bnj.sourceforge.net Integration with KSU SMT library Applications: Relational Link Mining in Social Networks © 2005 Walker Blogs
27
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Knowledge Representation Strategy Deep/Complex Shallow/Simple Learn from un- annotated data Phrase tables Word-based only Learn from annotated data Example-based MT Original statistical MT Typical transfer system Classic interlingual system Original direct approach Syntactic Constituent Structure Interlingua New Research: Context-Specificity Semantic analysis Hand-built by non-experts Hand-built by experts Electronic dictionaries Knowledge Acquisition Strategy All manual Fully automated MT Strategies (1954-2006) Slide courtesy of Laurie Gerber Future Research Directions
28
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Questions and Discussion
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.