Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Human Language Technologies (HLT) Workshop.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

Problems for Statistical MT Preprocessing Language modeling Translation modeling Decoding Parameter optimization Evaluation.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Machine Translation- 4 Autumn 2008 Lecture Sep 2008.
MT Evaluation CA446 Week 9 Andy Way School of Computing Dublin City University, Dublin 9, Ireland
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
J. Turmo, 2006 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
CSCI 5582 Artificial Intelligence
Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.
CSCI 5582 Fall 2006 CSCI 5582 Artificial Intelligence Lecture 26 Jim Martin.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Automatic Evaluation Philipp Koehn Computer Science and Artificial Intelligence Lab Massachusetts Institute of Technology.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Computing & Information Sciences Kansas State University Boulder, Colorado First International Conference on Weblogs And Social Media (ICWSM-2007) Structural.
Machine Translation- 5 Autumn 2008 Lecture Sep 2008.
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
An Introduction to SMT Andy Way, DCU. Statistical Machine Translation (SMT) Translation Model Language Model Bilingual and Monolingual Data* Decoder:
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Grammatical Machine Translation Stefan Riezler & John Maxwell.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
SMT – Final thoughts Philipp Koehn USC/Information Sciences Institute USC/Computer Science Department School of Informatics University of Edinburgh Some.
Computing & Information Sciences Kansas State University Boulder, Colorado First International Conference on Weblogs And Social Media (ICWSM-2007) Structural.
1 Machine Translation Dai Xinyu Outline  Introduction  Architecture of MT  Rule-Based MT vs. Data-Driven MT  Evaluation of MT  Development.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Statistical Machine Translation Part III – Phrase-based SMT Alexander Fraser CIS, LMU München WSD and MT.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Phrase Reordering for Statistical Machine Translation Based on Predicate-Argument Structure Mamoru Komachi, Yuji Matsumoto Nara Institute of Science and.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Presenter: Jinhua Du ( 杜金华 ) Xi’an University of Technology 西安理工大学 NLP&CC, Chongqing, Nov , 2013 Discriminative Latent Variable Based Classifier.
LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Haitham Elmarakeby.  Speech recognition
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
The Unreasonable Effectiveness of Data
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
NLP. Machine Translation Tree-to-tree – Yamada and Knight Phrase-based – Och and Ney Syntax-based – Och et al. Alignment templates – Och and Ney.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Approaching a New Language in Machine Translation Anna Sågvall Hein, Per Weijnitz.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
LING 575 Lecture 5 Kristina Toutanova MSR & UW April 27, 2010 With materials borrowed from Philip Koehn, Chris Quirk, David Chiang, Dekai Wu, Aria Haghighi.
Computing & Information Sciences Kansas State University Friday, 05 Dec 2008CIS 530 / 730: Artificial Intelligence Lecture 39 of 42 Friday, 05 December.
Approaches to Machine Translation
Statistical NLP: Lecture 13
--Mengxue Zhang, Qingyang Li
Approaches to Machine Translation
SMT – Final thoughts David Kauchak CS159 – Spring 2019
Presentation transcript:

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Human Language Technologies (HLT) Workshop 2006 Classification-Based Contextual Correction of Mistranslations: A Machine Learning Approach William H. Hsu Joint work with: Waleed Al-Jandal, Martin S. R. Paradesi, Tejaswi Pydimarri, Chris Meyer Thursday, 01 June 2006 Laboratory for Knowledge Discovery in Databases Kansas State University

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop A Technical Survey of Statistical MT: Phrase-Based Methods and Metrics

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop OutlineOutline Background: Statistical Approaches to MT State of the Field: Metrics Open Problems New Approaches, Applications and Software Tools Current and Future Research Prospectus

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Global Search Decoder Algorithm argmax e P(t) *P(s|t) Language Model P(t) Translation Model P(s|t) Input: Source Language s Output: Target Language t Training Program (e.g., GIZA) Bilingual Parallel Corpora Language Modeling toolkit Target Language Machine Translation: Generic System Architecture

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Based on noisy channel model  Source: foreign sentence f  Target: English sentence e Bayesian inference: Maximum A Posteriori (MAP) Background [1]: Phrase Translation Model

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Preliminaries: segmentation of foreign input  Result:  Use: lexical analysis tools – string tokenizer, etc. Goal: decoding  Segmented input:  Output: Distributions  Prediction:  Distortion: a i = start of f i, b i-1 = end of f i-1 Background [2]: Modeling Steps

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Length normalization factor:  Language Model (p LM ): Trigram [Seymour and Rosenfeld, 1997] Background [3]: Probabilistic Formulation

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Methods for Learning in MT: Survey Transformation-Based Learning (TBL) Example-Based Machine Translation (EBMT) Symbolic AI: Frames, Conceptual Grammars, Analogy, CBR Statistical  0. classical / naïve (cf. Weaver’s correspondence with Weiner)  1. phrase alignments from word-aligned model [Och & Ney, 2000]  2. linguistically motivated models [Yamada & Knight, 2001]  3. joint phrase model [Marcu & Wong, 2002]  4. generative phrase alignment [Koehn, Och & Marcu, 2003]  5. hierarchical models [Chiang, 2005; Taskar, 2005]  6. new approaches

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop OutlineOutline Background: Statistical Approaches to MT State of the Field: Metrics Open Problems New Approaches, Applications and Software Tools Current and Future Research Prospectus

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop N-gram precision (score is between 0 & 1) –What percentage of machine n-grams can be found in the reference translation? –n-gram: sequence of n units (words) –Not allowed to use same portion of reference translation twice (can’t cheat by repetition) Brevity penalty –Can’t just type out single word “the” p n : n-gram precision w n : positive weights r : words-in-reference c : words-in-machine Hard to “game” system (i.e., change machine output so that BLEU goes up, but quality doesn’t) Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport. Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail, which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack, [?] highly alerts after the maintenance. Adapted from Knight (2003) Bilingual Evaluation Understudy (BLEU) [1]: Papineni et al. (ACL, 2002)

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Bilingual Evaluation Understudy (BLEU) [2]: Multiple Reference Translations Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport. Reference translation 3: The US International Airport of Guam and its office has received an from a self-claimed Arabian millionaire named Laden, which threatens to launch a biochemical attack on such public places as airport. Guam authority has been on alert. Reference translation 4: US Guam International Airport and its office received an from Mr. Bin Laden and other rich businessman from Saudi Arabia. They said there would be biochemistry air raid to Guam Airport and other public places. Guam needs to be in high precaution about this matter. Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places. Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail, which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack, [?] highly alerts after the maintenance. © 2003 Knight, K.

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Bilingual Evaluation Understudy (BLEU) [3]: Tracking Human Judgment (variant of BLEU) Courtesy G. Doddington (NIST)

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Bilingual Evaluation Understudy (BLEU) [4]: Metrics in Action 枪手被警方击毙。 (Foreign Original) the gunman was shot to death by the police. (Reference Translation) the gunman was police kill. #1 wounded police jaya of #2 the gunman was shot dead by the police. #3 the gunman arrested by police kill. #4 the gunmen were killed. #5 the gunman was shot to death by the police. #6 gunmen were killed by police ?SUB>0 ?SUB>0 #7 al by the police. #8 the ringer is killed by the police. #9 police killed the gunman. #10 green = 4-gram match (good!) red = word not matched (bad!) © 2003 Kevin Knight

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Issues with BLEU Significance of High Correlation with Human Judgment Sensitivity: Need to Recalibrate for Corpus, Language? Generalizability to Other Translation Tasks  Causal explanation  Associative reasoning in customer relationship management  Collaborative recommendation  Diagnosis (form of gisting)  Speech-to-speech Meaning

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop New Technologies and Transfer Plan

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop OutlineOutline Background: Statistical Approaches to MT State of the Field: Metrics Open Problems New Approaches, Applications and Software Tools Current and Future Research Prospectus

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Context-Driven NLP: MT Applications Classical Natural Language Processing (NLP)  (Noun and verb) phrase extraction  Detection of named entity phrases  Word sense disambiguation  Spelling correction Interlingual Challenges  Making use of mixed resources: bilingual & monolingual  Semi-supervised learning Applications  Mixed-mode (semi-interactive) MT – assistive technology  Correcting mistranslations

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop The frequency of named-entity phrases in news text reflects the significance of the events they are associated with. So the news most likely be reported in many languages. For example: Translating Named Entity Phrases [1]: Arabic-English Application Translating Named Entity Phrases [1]: Arabic-English Application The Arabic newspaper article is about negotiations between the US and North Korean authorities regarding the search for the remains of US soldiers who died during the Korean war. [Knight & Al-Onaizan, 2001]

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Generate ranked list of translation candidates  Bilingual resources: parallel corpus  Monolingual resources Re-score list of candidates using different monolingual clues Translating Named Entity Phrases [2]: Two-Phase Approach Translating Named Entity Phrases [2]: Two-Phase Approach [Knight & Al-Onaizan, 2001]

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Correcting Faulty Translations Human-Assistive Technology Semi-Supervised: Two Training Corpora  Labeled: “bad translations” and “near misses”  Unlabeled: candidate translations Interactive Aspect  “Which of these translations is right?”  “Why is this candidate incorrect?” Application: Boosting Accuracy of SMT

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Boosting the Accuracy of SMT Parsing [Koehn et al., 2003]  Pro: Found to slow growth of translation tables  Con: Limited effect on BLEU Context-Specificity  Supported by computational linguistic theory  Some positive results in NLP prediction tasks [Elman, 1994]  Very effective in sequence learning [Barash & Friedman, 2001]  Important for Relational and First-Order Representations New Work: Semi-Supervised Approaches

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop OutlineOutline Background: Statistical Approaches to MT State of the Field: Metrics Open Problems New Approaches, Applications and Software Tools Current and Future Research Prospectus

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Popular SMT Tools Translation Model Generator: GIZA++ Search Decoder : PHARAOH, ISI ReWrite Decoder Language Model Generator : SRILM, CMU-Cambridge Statistical Language Modeling Toolkit EGYPT : A toolkit for SMT that consists GIZA/GIZA++ and word alignment tools. Evaluation packages: MTEVAL, GMT Metrics: BLEU, NIST, n-grams, WER, PER and SSER

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Software Tools for Graphical Models: BNJ v3 © 2005 KSU Bayesian Network tools in Java (BNJ) Development Team ALARM Network

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop OutlineOutline Background: Statistical Approaches to MT State of the Field: Metrics Open Problems New Approaches, Applications and Software Tools Current and Future Research Prospectus

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Current Work Development: End-to-End SMT System for NIST 2006 Evaluation  Arabic-English  Chinese-English Assemblage of Parallel Corpora Software Library Development: SMT Modules  Aligners  Parsers  Phrase-Based Learning  Transformation-Based Learning (TBL) Development of Graphical Models Toolkit  BNJ v4 under development:  Integration with KSU SMT library Applications: Relational Link Mining in Social Networks © 2005 Walker Blogs

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Knowledge Representation Strategy Deep/Complex Shallow/Simple Learn from un- annotated data Phrase tables Word-based only Learn from annotated data Example-based MT Original statistical MT Typical transfer system Classic interlingual system Original direct approach Syntactic Constituent Structure Interlingua New Research: Context-Specificity Semantic analysis Hand-built by non-experts Hand-built by experts Electronic dictionaries Knowledge Acquisition Strategy All manual Fully automated MT Strategies ( ) Slide courtesy of Laurie Gerber Future Research Directions

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Questions and Discussion