The Linguistic-Core Approach to Structured Translation and Analysis of Low- Resource Languages 2011 Program Review for ARL MURI Project 4 November 2011.

Slides:



Advertisements
Similar presentations
A Word-Class Approach to Labeling PSCFG Rules for Machine Translation (ACL 2011) Andreas Zollmann and Stephan Vogel Presented by Yun Huang 01/07/2011.
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Vamshi Ambati | Stephan Vogel | Jaime Carbonell Language Technologies Institute Carnegie Mellon University A ctive Learning and C rowd-Sourcing for Machine.
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
A Non-Parametric Bayesian Approach to Inflectional Morphology Jason Eisner Johns Hopkins University This is joint work with Markus Dreyer. Most of the.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Morphosyntactic correspondence: a progress report on bitext parsing Alexander Fraser, Renjing Wang, Hinrich Schütze Institute for NLP University of Stuttgart.
Survey of Semantic Annotation Platforms
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
10/12/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 10 Giuseppe Carenini.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Morpho Challenge competition Evaluations and results Authors Mikko Kurimo Sami Virpioja Ville Turunen Krista Lagus.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Dependency Tree-to-Dependency Tree Machine Translation November 4, 2011 Presented by: Jeffrey Flanigan (CMU) Lori Levin, Jaime Carbonell In collaboration.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Advanced MT Seminar Spring 2008 Instructors: Alon Lavie and Stephan Vogel.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Publications Vamshi Ambati, Stephan Vogel and Jaime Carbonell. "Collaborative Workflow for Crowdsourcing Translation”, To Appear in the 2012 ACM Conference.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Carnegie Mellon Goal Recycle non-expert post-editing efforts to: - Refine translation rules automatically - Improve overall translation quality Proposed.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
Languages at Inxight Ian Hersey Co-Founder and SVP, Corporate Development and Strategy.
LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU.
The Unreasonable Effectiveness of Data
Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models Elias Ponvert, Jason Baldridge and Katrin Erk The University of Texas.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Discriminative Modeling extraction Sets for Machine Translation Author John DeNero and Dan KleinUC Berkeley Presenter Justin Chiu.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
LING 575 Lecture 5 Kristina Toutanova MSR & UW April 27, 2010 With materials borrowed from Philip Koehn, Chris Quirk, David Chiang, Dekai Wu, Aria Haghighi.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Approaches to Machine Translation
PRESENTED BY: PEAR A BHUIYAN
--Mengxue Zhang, Qingyang Li
Deep Learning based Machine Translation
Machine Learning in Natural Language Processing
Approaches to Machine Translation
Presentation transcript:

The Linguistic-Core Approach to Structured Translation and Analysis of Low- Resource Languages 2011 Program Review for ARL MURI Project 4 November 2011

The Cast CMU: Jaime Carbonell Lori Levin Stephan Vogel Noah Smith ISI: Kevin Knight David Chiang MIT: UT: Regina Barzilay Jason Bladridge Supporting roles: 8 Graduate Students, 2 Postdocs, N Informants, …

3 The Plot Selected Languages – Kinyarwanda Bantu (7.5M speakers) – Malagasy Malayo-Polynesian (14.5M speakers) Objectives – MT & Analysis – Of Low-resource Languages – Based on Linguistics – Supported by Stat Learning Challenges – Data sparsity & collection – Major Divergences from Eng – “Universal” Solutions (vs just for a given language)

The Setting (from Proposal) LR Languages, e.g. in Africa, cannot be ignored MT & TA for LRLs requires a linguistic core – Insufficient parallel text for standard SMT – Insufficient annotations for purely statistical TA Phrasal SMT, even for HRL, errs – E.g. divergences, long-distance movements,… But Computational Linguists are Expensive – Cannot dedicate person-centuries per language to write, test and debug massive rule-based systems – Army needs a more rapid & cost effective approach

The Scientific Questions Can deep linguistic representations benefit practical MT & TA? Can we marry learning from data with expert-crafted declarative linguistics? Can we uncover underlying linguistic structure through comparative language analysis? How can we extend MT-motivated linguistic-core capabilities to related TA tasks? Can different linguistic analyses reinforce each other synergistically? How important is resolving complex morphology? How important are general semantic features for MT? How well can unsupervised learning methods augment linguistically motivated analyses for MT and TA? ….

Act I: Exploratory Research Scene I: Data Obtained Malagasy Bible and align with modern English Bible. Converted 33 KGMC multilingual transcripts (Kinyarwanda, English, French) of interviews of survivors of the Rwandan genocide to clean, aligned XML. Created seed datasets for Malagasy and Kinyarwanda from the linguistic literature and annotate them for syntactic structure. Reached out to Rwandan and Malagasy communities to find native speakers, Translate three BBC Rwanda articles to English and annotate. Translated Malagasy website articles (Lakroa and Lagazette) to English and annotate with syntactic structures Adapted Malagasy morphological transducer from Dalrymple et al and annotate several sentences based on its output. Annotated KGMC transcripts for syntactic structure (about 100 trees). Created tools for supporting consistent annotations that will work for MT researchers (tokenizers, tree validators). Crowd-sourcing for non-linguist native-speaker data collection. Active learning for focusing on most valuable missing data. Curated data releases 1.0 and 2.0 for Malagasy, Kinyarwanda, and English.

Act I: Exploratory Research Scene 2: Linguistic Core + ML Rule-based Kinyarwanda morphological analyzer. Development of semantic representation graphs for general-purpose translation. Development of probabilistic acceptors and transducers for graph structures. Tokenizer for Kinyarwanda and Malagasy. Completed formalism design (dependency to dependency MT) Investigate hand-written synchronous tree-adjoining grammar rules for Kinyarwanda. Learning syntactic structure from sparse semantic representations. Learning unsupervised morphology by modeling syntactic context. Upparse unsupervised parsing methodology based on finite-state methods and evaluated on English, German and Chinese data. Bilingual part of speech model based on feature-rich Markov random fields. Method for transferring information in supervised models for one or more resource- rich languages to an unsupervised learner for a resource-poor language, tested on part-of-speech tagging and parsin. Model for discovering multi-word, gappy expressions in monolingual and bilingual text, evaluated within a translation system Model for word alignment based on feature-rich conditional random fields

Act I: Exploratory Research Scene 3: MT Frameworks & Systems Phrase-based Malagasy and Kinyarwanda systems build for initial baseline Four end-to-end MT systems (m2e and k2e, using Hiero and syntax- based MT systems). Kinyarwanda Synchronous-grammar (SAMT) system build Implemented a hierarchical phrase-based German-English translation system that was ranked #2 in the SMT competition (after Google). This system incorporated a discriminative German parsing model Designed, implemented, and tested a new translation model based on dependencies over phrases. Explored methodology for testing hypotheses about translation systems, leading to practical recommendations for researchers in the field. Translation systems targeting Kinyarwanda and Malagasy incorporating the data developed by MURI collaborators were developed, and improvements using CRF word alignments were replicated in these new language pairs.

Our Slightly-Revised Approach Linguistic core: Universals & Specifics – Specialize core to each language pair minimally as needed – Active Learning when annotations/translations required – Unsupervised learning when possible Parallel activities: – Development of training/testing data & annotations (on targeted L’s) – Linguistic analysis (of targeted L’s) – Core linguistic engine development (on other L’s) Exploration of multiple paradigms – E.g. Dependency parsing – E.g. Finite-state transducers – Ensemble methods Build, evaluate, refine glass-box end-to-end prototypes – Requires baselines, and end-to-end MT systems

Linguistic Core Team (LL, JB, SV, JC) Linguistic Analyzers Team (NS, RB, JB) MT Systems Team (KK, DC, SV, JC) Parser, Taggers, Morph. Analyzers Hand-built Linguistic Core Triple Gold Data Triple Ungold Data MT Visualizations and logs MT Features MT Error Analysis MT Systems Inference Algorithms Data: Parallel Monolingual Elicited Related language Multi-parallel Comparable Elicitation corpus Data selection for annotation Functional Collaboration

Publications Vamshi Ambati, Stephan Vogel and Jaime Carbonell. "Collaborative Workflow for Crowdsourcing Translation”, To Appear in the 2012 ACM Conference on Computer Supported Cooperative Work, Washington, USA Vamshi Ambati, Stephan Vogel and Jaime Carbonell. "Multi-Strategy Approaches to Active Learning for Statistical Machine Translation”, Accepted to the 13th Machine Translation Summit, Xiamen, China, 2011 Vamshi Ambati, Stephan Vogel and Jaime Carbonell. "Towards Task Recommendation in Micro-Task Markets”, In the Proc. of the 3rd workshop on Human Computation, AAAI Vamshi Ambati, Sanjika Hewavitharana, Stephan Vogel and Jaime Carbonell. "Active Learning with Multiple Annotations for Comparable Data Classification Task”, In the Proc. of Building Comparable Corpora Workshop, ACL Desai Chen, Chris Dyer, Shay B. Cohen, and Noah A. Smith. Unsupervised Bilingual POS Tagging with Markov Random Fields. In Proceedings of the First Workshop on Unsupervised Learning in NLP Jonathan H. Clark, Chris Dyer, Alon Lavie, and Noah A. Smith. Better Hypothesis Testing for Machine Translation: Controlling for Optimizer Instability. In Proc. of ACL Shay B. Cohen, Dipanjan Das, and Noah A. Smith. Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance. In Proc. EMNLP

More Publications Chris Dyer, Kevin Gimpel, Jonathan H. Clark, and Noah A. Smith. The CMU-ARK German-English Translation System. In Proc. WMT Chris Dyer, Jonathan H. Clark, Alon Lavie, and Noah A. Smith. Unsupervised Word Alignment with Arbitrary Features. In Proceedings of ACL Kevin Gimpel and Noah A. Smith. Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. In Proc. EMNLP Kevin Gimpel and Noah A. Smith. Generative Models of Monolingual and Bilingual Gappy Patterns. In Proc. WMT Kevin Gimpel, Nathan Schneider, Brendan O'Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments. In Proceedings ACL Yoong Keok Lee, Aria Haghighi and Regina Barzilay. Modeling Syntactic Context Improves Morphological Segmentation. In Proc. of CoNLL, Tahira Naseem, Regina Barzilay. Using Semantic Cues to Learn Syntax. In Proc. AAAI Elias Ponvert, Jason Baldridge and Katrin Erk Simple unsupervised grammar induction from raw text with cascaded finite state models. In Proceedings of ACL Sanjika Hewavitharana, Nguyen Bach, Qin Gao, Vamshi Ambati and Stephan Vogel, “CMU Haitian Creole-English Translation System”, In Proc. WMT

Jaime Carbonell, CMU13 THANK YOU!