Large Language Models in Machine Translation Conference on Empirical Methods in Natural Language Processing 2007 報告者:郝柏翰 2013/06/04 Thorsten Brants, Ashok.

Slides:



Advertisements
Similar presentations
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Advertisements

Measuring the Influence of Long Range Dependencies with Neural Network Language Models Le Hai Son, Alexandre Allauzen, Franc¸ois Yvon Univ. Paris-Sud and.
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
1 Language Model Adaptation in Machine Translation from Speech Ivan Bulyko, Spyros Matsoukas, Richard Schwartz, Long Nguyen, and John Makhoul.
On G-network and resource allocation in multimedia systems 報告者 : 王敬育.
Scalable Text Mining with Sparse Generative Models
Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.
PROMISE: Peer-to-Peer Media Streaming Using CollectCast Presented by: Randeep Singh Gakhal CMPT 886, July 2004.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
A Survey of ICASSP 2013 Language Model Department of Computer Science & Information Engineering National Taiwan Normal University 報告者:郝柏翰 2013/06/19.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Achieving Domain Specificity in SMT without Over Siloing William Lewis, Chris Wendt, David Bullock Microsoft Research Machine Translation.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
Master Thesis Defense Jan Fiedler 04/17/98
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Google News Personalization: Scalable Online Collaborative Filtering
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
Multiple parallel hidden layers and other improvements to recurrent neural network language modeling ICASSP 2013 Diamantino Caseiro, Andrej Ljolje AT&T.
Large Scale Machine Translation Architectures Qin Gao.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Efficient Estimation of Word Representations in Vector Space
Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.
Compact WFSA based Language Model and Its Application in Statistical Machine Translation Xiaoyin Fu, Wei Wei, Shixiang Lu, Dengfeng Ke, Bo Xu Interactive.
Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alexander Fraser Institute for Natural Language Processing Universität Stuttgart.
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
Semantic v.s. Positions: Utilizing Balanced Proximity in Language Model Smoothing for Information Retrieval Rui Yan†, ♮, Han Jiang†, ♮, Mirella Lapata‡,
Recurrent neural network based language model Tom´aˇs Mikolov, Martin Karafia´t, Luka´sˇ Burget, Jan “Honza” Cˇernocky, Sanjeev Khudanpur INTERSPEECH 2010.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Bloom Cookies: Web Search Personalization without User Tracking Authors: Nitesh Mor, Oriana Riva, Suman Nath, and John Kubiatowicz Presented by Ben Summers.
Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Using Neural Network Language Models for LVCSR Holger Schwenk and Jean-Luc Gauvain Presented by Erin Fitzgerald CLSP Reading Group December 10, 2004.
An Efficient Implementation of File Sharing Systems on the Basis of WiMAX and Wi-Fi Jingyuan Li, Liusheng Huang, Weijia Jia, Mingjun Xiao and Peng Du Joint.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Automatic Question Answering Beyond the Factoid Radu Soricut Information Sciences Institute University of Southern California Eric Brill Microsoft Research.
An Improved Hierarchical Word Sequence Language Model Using Word Association NARA Institute of Science and Technology Xiaoyi WuYuji Matsumoto.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
Xiaolin Wang Andrew Finch Masao Utiyama Eiichiro Sumita
Statistical Machine Translation Part III – Phrase-based SMT / Decoding
Getting Started BPS 7e Chapter 0 © 2014 W. H. Freeman and Company.
Memory-augmented Chinese-Uyghur Neural Machine Translation
Statistical n-gram David ling.
Domain Mixing for Chinese-English Neural Machine Translation
Motion-Aware Routing in Vehicular Ad-hoc Networks
Outline Review of Quiz #1 Distributed File Systems 4/20/2019 COP5611.
Ch 9 – Distributed Filesystem
Information Retrieval and Web Design
Neural Machine Translation by Jointly Learning to Align and Translate
Presentation transcript:

Large Language Models in Machine Translation Conference on Empirical Methods in Natural Language Processing 2007 報告者:郝柏翰 2013/06/04 Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, Jeffrey Dean Google, Inc.

Outline Introduction Stupid Backoff Experiments Conclusion 2

Introduction This paper reports on the benefits of large-scale statistical language modeling in machine translation. A distributed infrastructure is proposed which we use to train on up to 2 trillion tokens, resulting in language models having up to 300 billion n-grams. It is capable of providing smoothed probabilities for fast, single-pass decoding We introduce a new smoothing method, dubbed Stupid Backoff, that is inexpensive to train on large data sets and approaches the quality of Kneser-Ney Smoothing as the amount of training data increases. 3

Introduction 4

Questions that arise in this context include: 1.How might one build a language model that allows scaling to very large amounts of training data? 2.How much does translation performance improve as the size of the language model increases? 3.Is there a point of diminishing returns in performance as a function of language model size? This paper proposes one possible answer to the first question, explores the second by providing learning curves in the context of a particular statistical machine translation system, and hints that the third may yet be some time in answering. 5

Stupid Backoff State-of-the-art smoothing uses variations of context- dependent backoff with the following scheme: 6 State-of-the-art techniques require additional, more expensive steps. At runtime, the client needs to additionally request up to 4 backoff factors for each 5- gram requested from the servers, thereby multiplying network traffic.

Stupid Backoff We introduce a similar but simpler scheme, named Stupid Backoff The main difference is that we don’t apply any discounting and instead directly use the relative frequencies 7

Distributed Application Our goal is to use distributed language models integrated into the first pass of a decoder The decoder accesses n-grams whenever necessary. This is inefficient in a distributed system because network latency causes a constant overhead on the order of milliseconds. Onboard memory is around 10,000 times faster. We therefore implemented a new decoder architecture. The decoder first queues some number of requests, e.g. 1,000 or 10,000 n-grams, and then sends them together to the servers, thereby exploiting the fact that network requests with large numbers of n-grams take roughly the same time to complete as requests with single n-grams. 8

Distributed Application When the scores are returned, the decoder re-visits all of these tentative hypotheses, assigns scores, and re-prunes the search graph. 9 When using a distributed language model, the decoder first tentatively extends all current hypotheses, taking note of which n-grams are required to score them. These are queued up for transmission as a batch request.

Experiments, Data Sets We trained 5-gram language models on amounts of text varying from 13 million to 2 trillion tokens. The data is divided into four sets. –target: The English side of Arabic-English parallel data provided by LDC5 (237 million tokens). –ldcnews: This is a concatenation of several English news data sets provided by LDC6 (5 billion tokens). –webnews: Data collected over several years, up to December 2005, from web pages containing predominantly English news articles (31 billion tokens). –web: General web data, which was collected in January 2006 (2 trillion tokens). For testing we use the “NIST” part of the 2006 Arabic- English NIST MT evaluation set. 10

Experiments, Size of the Language Models Figure 3 shows the number of n-grams for language models trained on 13 million to 2 trillion tokens. x1.8/x2 means that the number of n-grams grows by a factor of 1.8 each time we double the amount of training data. 11 Both axes are on a logarithmic scale.

Experiments, Size of the Language Models Table 2 shows sizes and approximate training times when training on the full target, webnews, and web data sets. 12

Experiments, Perplexity and n-Gram Coverage Figure 4 shows perplexities for models with Kneser-Ney smoothing. Note that the perplexities of the different language models are not directly comparable because they use different vocabularies. Perplexities cannot be calculated for language models with Stupid Backoff because their scores are not normalized probabilities. 13 In order to nevertheless get an indication of potential quality improvements with increased training sizes we looked at the 5-gram coverage instead.

Experiments, Machine Translation Results 1 BP = 0.01 BLEU. We show system scores as BLEU, differences as BP. We use a state-of-the-art machine translation system for translating from Arabic to English. In large corpus, the gap between Kneser-Ney Smoothing and Stupid Backoff narrows. 14

Conclusion A distributed infrastructure has been described to train and apply large-scale language models to machine translation. The technique is made efficient by judicious batching of score requests by the decoder in a server-client architecture. Significantly, we found that translation quality as indicated by BLEU score continues to improve with increasing language model size, at even the largest sizes considered. 15