--Mengxue Zhang, Qingyang Li

Slides:

Advertisements

Similar presentations

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.

Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.

UNIT-III By Mr. M. V. Nikum (B.E.I.T). Programming Language Lexical and Syntactic features of a programming Language are specified by its grammar Language:-

INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING NLP-AI IIIT-Hyderabad CIIL, Mysore ICON DECEMBER, 2003.

Vamshi Ambati | Stephan Vogel | Jaime Carbonell Language Technologies Institute Carnegie Mellon University A ctive Learning and C rowd-Sourcing for Machine.

Center for Computational Learning Systems Independent research center within the Engineering School NLP people at CCLS: Mona Diab, Nizar Habash, Martin.

Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.

Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.

Center for Computational Learning Systems Independent research center within the Engineering School NLP people at CCLS: Mona Diab, Nizar Habash, Martin.

Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.

Introduction & Overview CS4533 from Cooper & Torczon.

LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.

The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.

1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.

AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.

For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.

Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.

2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.

Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.

Interlingua Annotation Owen Rambow Advaith Siddharthan Kathleen McKeown

For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.

For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.

Overview of Statistical NLP IR Group Meeting March 7, 2006.

NATURAL LANGUAGE PROCESSING

A Simple English-to-Punjabi Translation System By : Shailendra Singh.

The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.

Statistical Machine Translation Part II: Word Alignments and EM

Advanced Computer Systems

Compiler Design (40-414) Main Text Book:

Measuring Monolinguality

Approaches to Machine Translation

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Sentiment analysis algorithms and applications: A survey

CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.

Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.

Tools for Natural Language Processing Applications

Monoligual Semantic Text Alignment and its Applications in Machine Translation Alon Lavie March 29, 2012.

Urdu-to-English Stat-XFER system for NIST MT Eval 2008

Computational and Statistical Methods for Corpus Analysis: Overview

Natural Language Processing (NLP)

Reading Report: Open QA Systems

Statistical NLP: Lecture 13

Deep Learning based Machine Translation

Statistical NLP: Lecture 9

Paraphrase Generation Using Deep Learning

Eiji Aramaki* Sadao Kurohashi* * University of Tokyo

Approaches to Machine Translation

Natural Language Processing

Introduction to Natural Language Processing

Statistical Machine Translation Papers from COLING 2004

Statistical n-gram David ling.

Dialogue State Tracking & Dialogue Corpus Survey

Improving IBM Word-Alignment Model 1(Robert C. MOORE)

PURE Learning Plan Richard Lee, James Chen,.

Natural Language Processing (NLP)

University of Illinois System in HOO Text Correction Shared Task

Compilers Principles, Techniques, & Tools Taught by Jing Zhang

Artificial Intelligence 2004 Speech & Natural Language Processing

Statistical NLP : Lecture 9 Word Sense Disambiguation

Neural Machine Translation by Jointly Learning to Align and Translate

Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen

Natural Language Processing (NLP)

Presentation transcript:

--Mengxue Zhang, Qingyang Li Chris Callison-Burch --Mengxue Zhang, Qingyang Li

Chris Callison-Burch --- Timeline Associate Professor in Information Science Department at University of Pennsylvania Chris Callison-Burch --- Timeline The Symbolic Systems Program (SSP) at Stanford University focuses on computers and minds: artificial and natural systems that use symbols to communicate, and to represent information General Chair for ACL 2017 Secretary-Treasurer for SIGDAT ( organizes the EMNLP) Program Co-Chair for EMNLP 2015 Sloan Research Fellow Got tenure in June 2017

Chris Callison-Burch --- Development PPDB --- the paraphrase database (a resource with 169 million paraphrase) Joshua --- an open source decoder for statistical machine translation ( use synchronous context free grammars and extracts linguistically informed translation rules.) Moses System -- open-source toolkit for statistical machine translation Synchronous context free grammars: Rules in these grammars apply to two languages at the same time, capturing grammatical structures that are each other's translations.

Research Interests: Natural Language Understanding via Paraphrasing Method that extracts paraphrases from bilingual parallel corpora. Paraphrasing with Bilingual Parallel Corpora, ACL 2005 Paraphrasing and Translation, PHD Thesis Extend his bilingual pivoting methodology to syntactic representation of translation rules. Semantically-informed syntactic machine translation, AMTA-2010 Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to Text Generation, EMNLP 2011 Then use the bilingual pivoting technique to create the paraphrase database PPDB: The Paraphrase Database, NAACL 2013 Made several Advances to PPDB: Semantics: add an interpretable semantics to PPDB Adding semantics to Data-Driven Paraphrasing, ACL 2015 Domain adaptation: Language is used differently in different domains. An algorithm automatically adapt paraphrases to suit a particular domain Domain-Specific Paraphrase Extraction, ACL, 2015 Natural Language generation: problem for text simplification Problems in current text simplification research, TACL 2015 Optimizing statistical machine translation for text simplification, 2016 I developed a method that extracts paraphrases from bilingual parallel corpora by identifying equivalent English expressions using a shared foreign phrase. This ensures that their meaning is similar Joshua Decoder It is useful for translating between languages with different word orders. Instead of pivoting over foreign phrases, they pivot over foreign synchronous context free grammar rules.

Research Interests: Statistical Machine Translation To Build statistical machine translation systems without parallel corpora He used a bilingual lexicon induction method to estimate the parameters of phrase-based statistical machine translation systems A Comprehensive analysis of bilingual lexicon induction, Computational linguistics 2016 Joshua: An Open source Toolkit for parsing-based machine translation, WMT, 2009 Joshua 2.0 -- 5.0 Combining a diverse set of monolingually-derived signals of translation equivalence Supervised Bilingual lexicon induction with multiple monolingual signals, NAACL 2013 His Goal is: To go beyond simply expanding bilingual dictionaries so that he can use bilingual lexicon induction techniques in order to produce full translations systems

Research Interests: Crowdsourcing To show that the quality of Urdu-English translations produced by non-professional translators can be made to approach the quality of professional translation at a fraction of the cost. Crowdsourcing Translation: Professional Qualify from Non-Professionals, ACL 2011 Use crowdsourcing to create a wide range of new NLP data sets The Arabic Online Commentary Dataset: An annotated Dataset of informal arabic with high dialectal content, ACL 2011 Constructing parallel corpora for six indian languages via crowdsourcing, WMT 2012 Translations of the CALLHOME Egyptian Arabic corpus for conversational speech translation, IWSLT 2014 ...etc Beyond NLP, design tool to help crowd workers find better higher paying work Crowd-Workers: Aggregating Information Across Turkers to Help them find higher paying work, HCOMP, 2014 Third research focus is crowdsourcing. The idea of using crowdsourcing to create annotated data for nlp applications is a relatively new topic.

Important Work: Moses Moses: Open source toolkit for statistical machine translation. ACL 2007 Citation: 3905. Motivation: Phrase-based statistical machine translation has been dominant, but lack of openness. An implementation of the data-driven approach to machine translation. Automatically train translation models for any language pair. Support multiple translation types Phrase-based machine translation Syntax-based translation Factored matchine translation Support multiple language models The most influential work of Chris is the paper Moses: Open source toolkit for statistical machine translation, which was accepted to ACL in 2007. The citation of this paper is now 3905, which is pretty high. The motivation of this paper is that Phrase-based statistical machine translation has been dominant in machine translation research. but most work in this field has been in-house research. There is a lack of openness. Therefore they implemented this open source toolkit for statistical machine translation. The toolkit can automatically train translation models for any language pair. All you need is a parallel corpus. It support multiple translation typles includes Phrase-based machine translation, Syntax-based translation, and Factored translation. Moses also support multiple language models.

Important Work: Moses Consists of all the components needed to preprocess data, train the language models and the translation models. Contains tools for tuning these models. Use standard external tools for some of the tasks GIZA++ (Och and Ney 2003) for word alignments. SRILM for language modeling. Two Main Components: Training Pipeline: Turn raw data into a machine translation model. mainly in perl, some in C++. Decoder: Translate the source sentence into the target language. single C++ Moses consists of all the components needed to preprocess data, train the language models and the translation models. it also contains tools for tuning these models. It uses some standard external tools for some of the tasks, for example, they use GIZA++ for word alignments, and SRILM for language Modeling. Two Main Components of Moses are Training Pipeline and the Decoder. the training pipeline is a collection of tools which take the raw data and turn it into a machine translation model. and the decoder will translate the source sentence into the target language.

Important Work: Moses Noval Contributions: Support for linguistically motivated factors, such as POS tags or lemma. Besides an open-source toolkit, the paper has its noval contributions. the first is that it supports for linguistically motivated factors, such as Part Of Speech tags, the factors turns out to be informative in translation. The left picture is a phrase table with no factors. and the right picture is an augmented phrase table that contains factors of POS tags.

Noval Contributions: Confusion network decoding. A weighted directed graph with the peculiarity that each path from the start node to the end node goes through all the other nodes. Allow multiple, ambiguous input hypotheses. Improve spoken language translation, which is prone to speech recognition errors Efficient data formats for translation models and language models. The second noval contributions is the confusion network decoding. In spoken language translation, the input maybe noisy and ambiguous. To address this issue. they include confusion network decoding in Moses. Confusion network is a A weighted directed graph with the peculiarity that each path from the start node to the end node goes through all the other nodes, as shown in the picture. This contribution will allow multiple, ambiguous input hypotheses and therefore improve spoken language translation. The paper also intrdouced an efficient data structure for translation models and language models, which reduces memory use and maintains translation speed as well.

Thanks&QA