Large Scale Machine Translation Architectures Qin Gao.

Slides:



Advertisements
Similar presentations
MapReduce Simplified Data Processing on Large Clusters
Advertisements

Tomcy Thankachan  Introduction  Data model  Building Blocks  Implementation  Refinements  Performance Evaluation  Real applications  Conclusion.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Map/Reduce in Practice Hadoop, Hbase, MongoDB, Accumulo, and related Map/Reduce- enabled data stores.
Bigtable: A Distributed Storage System for Structured Data Presenter: Guangdong Liu Jan 24 th, 2012.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
7/2/2015EECS 584, Fall Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Bigtable: A Distributed Storage System for Structured Data
BigTable CSE 490h, Autumn What is BigTable? z “A BigTable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Hadoop and HDFS
Bigtable: A Distributed Storage System for Structured Data Google’s NoSQL Solution 2013/4/1Title1 Chao Wang Fay Chang, Jeffrey Dean, Sanjay.
MapReduce How to painlessly process terabytes of data.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
Hypertable Doug Judd Zvents, Inc.. hypertable.org Background.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
Big Table - Slides by Jatin. Goals wide applicability Scalability high performance and high availability.
Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Presenters: Rezan Amiri Sahar Delroshan
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
Toward Efficient and Simplified Distributed Data Intensive Computing IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 6, JUNE 2011PPT.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Bigtable : A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows,
Bigtable: A Distributed Storage System for Structured Data
Bigtable: A Distributed Storage System for Structured Data Google Inc. OSDI 2006.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
BIG DATA/ Hadoop Interview Questions.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
Cse 344 May 4th – Map/Reduce.
Cloud Computing Storage Systems
A Distributed Storage System for Structured Data
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Large Scale Machine Translation Architectures Qin Gao

Outline Typical Problems in Machine Translation Program Model for Machine Translation MapReduce Required System Component Supporting software Distributed streaming data storage system Distributed structured data storage system Integrating – How to make a full-distributed system Qin Gao, LTI, CMU

Why large scale MT We need more data.. But… Qin Gao, LTI, CMU

Some representative MT problems Counting events in corpora ◦  Ngram count Sorting ◦  Phrase table extraction Preprocessing Data ◦  Parsing, tokenizing, etc Iterative optimization ◦  GIZA++ (All EM algorithms) Qin Gao, LTI, CMU

Characteristics of different tasks Counting events in corpora ◦ Extract knowledge from data Sorting ◦ Process data, knowledge is inside data Preprocessing Data ◦ Process data, require external knowledge Iterative optimization ◦ For each iteration, process data using existing knowledge and update knowledge Qin Gao, LTI, CMU

Components required for large scale MT Data Knowledge Qin Gao, LTI, CMU

Components required for large scale MT Data Knowledge Qin Gao, LTI, CMU

Components required for large scale MT Data Knowledge Stream Data Structured Knowledge Processor Qin Gao, LTI, CMU

Problem for each component Stream data: ◦ As the amount of data grows, even a complete navigation is impossible. Processor: ◦ Single processor’s computation power is not enough Knowledge: ◦ The size of the table is too large to fit into memory ◦ Cache-based/distributed knowledge base suffers from low speed Qin Gao, LTI, CMU

Make it simple: What is the underlying problem? We have a huge cake and we want to cut them into pieces and eat. Different cases: ◦ We just need to eat the cake. ◦ We also want to count how many peanuts inside the cake ◦ (Sometimes)We have only one folk! Qin Gao, LTI, CMU

Parallelization Data Knowledge Qin Gao, LTI, CMU

Solutions Large-scale distributed processing ◦ MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean, Sanjay Ghemawat, Communications of the ACM, vol. 51, no. 1 (2008), pp Handling huge streaming data ◦ The Google File System, Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung, Proceedings of the 19th ACM Symposium on Operating Systems Principles, 2003, pp Handling structured data ◦ Large Language Models in Machine Translation, Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, Jeffrey Dean, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp ◦ Bigtable: A Distributed Storage System for Structured Data, Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber, 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2006, pp Qin Gao, LTI, CMU

MapReduce MapReduce can refer to ◦ A programming model that deal with massive, unordered, streaming data processing tasks(MUD) ◦ A set of supporting software environment implemented by Google Inc Alternative implementation: ◦ Hadoop by Apache fundation Qin Gao, LTI, CMU

MapReduce programming model Abstracts the computation into two functions: ◦ MAP ◦ Reduce User is responsible for the implementation of the Map and Reduce functions, and supporting software take care of executing them Qin Gao, LTI, CMU

Representation of data The streaming data is abstracted as a sequence of key/value pairs Example: ◦ (sentence_id : sentence_content) Qin Gao, LTI, CMU

Map function The Map function takes an input key/value pair, and output a set of intermediate key/value pairs Key 1 : Value 1 Key 2 : Value 2 Map() Key 1 : Value 1 Key 2 : Value 2 Key 3 : Value 3 …….. Map() Key 1 : Value 2 Key 2 : Value 1 Key 3 : Value 3 …… Qin Gao, LTI, CMU

Reduce function Reduce function accepts one intermediate key and a set of intermediate values, and produce the result Key 1 : Value 1 Key 1 : Value 2 Key 1 : Value 3 …….. Key 2 : Value 1 Key 2 : Value 2 Key 2 : Value 3 …….. Reduce() Result Qin Gao, LTI, CMU

The architecture of MapReduce Map function Reduce Function Distributed Sort Qin Gao, LTI, CMU

Benefit of MapReduce Automatic splitting data Fault tolerance High-throughput computing, uses the nodes efficiently Most important: Simplicity, just need to convert your algorithm to the MapReduce model Qin Gao, LTI, CMU

Requirement for expressing algorithm in MapReduce Process Unordered data ◦ The data must be unordered, which means no matter in what order the data is processed, the result should be the same Produce Independent intermediate key ◦ Reduce function can not see the value of other keys Qin Gao, LTI, CMU

Example Distributed Word Count (1) ◦ Input key : word ◦ Input value : 1 ◦ Intermediate key : constant ◦ Intermediate value: 1 ◦ Reduce() : Count all intermediate values Distributed Word Count (2) ◦ Input key : Document/Sentence ID ◦ Input value : Document/Sentence content ◦ Intermediate key : constant ◦ Intermediate value: number of words in the document/sentence ◦ Reduce() : Count all intermediate values Qin Gao, LTI, CMU

Example 2 Distributed unigram count ◦ Input key : Document/Sentence ID ◦ Input value : Document/Sentence content ◦ Intermediate key : Word ◦ Intermediate value: Number of the word in the document/sentence ◦ Reduce() : Count all intermediate values Qin Gao, LTI, CMU

Example 3 Distributed Sort ◦ Input key : Entry key ◦ Input value : Entry content ◦ Intermediate key : Entry key (modification may be needed for ascend/descend order) ◦ Intermediate value: Entry content ◦ Reduce() : All the entry content Making use of built-in sorting functionality Qin Gao, LTI, CMU

Supporting MapReduce: Distributed Storage Reminder what we are dealing with in MapReduce: ◦ Massive, unordered, streaming data Motivation: ◦ We need to store large amount of data ◦ Make use of storage in all the nodes ◦ Automatic replication  Fault tolerant  Avoid hot spots client can read from many servers Google FS and Hadoop FS (HDFS) Qin Gao, LTI, CMU

Design principle of Google FS Optimizing for special workload: ◦ Large streaming reads, small random reads ◦ Large streaming writes, rare modification Support concurrent appending ◦ It actually assumes data are unordered High sustained bandwidth is more important than low latency, fast response time is not important Fault tolerant Qin Gao, LTI, CMU

Google FS Architecture Optimize for large streaming reading and large, concurrent writing Small random reading/writing is also supported, but not optimized Allow appending to existing files File are spitted into chunks and stored in several chunk servers A master is responsible for storage and query of chunk information Qin Gao, LTI, CMU

Google FS architecture Qin Gao, LTI, CMU

Replication When a chunk is frequently or “simultaneously” read from a client, the client may fail A fault in one client may cause the file not usable Solution: store the chunks in multiple machines. The number of replica of each chunk : replication factor Qin Gao, LTI, CMU

HDFS HDFS shares similar design principle of Google FS Write-once-read-many : Can only write file once, even appending is now allowed “Moving computation is cheaper than moving data” Qin Gao, LTI, CMU

Are we done? NO… Problems about the existing architecture Qin Gao, LTI, CMU

We are good at dealing with data What about knowledge? I.E. structured data? What if the size of the knowledge is HUGE? Qin Gao, LTI, CMU

A good example: GIZA A typical EM algorithm World Alignment Collect Counts Has More Sentences? Y Normalize Counts N Has More Iterations? Y N Qin Gao, LTI, CMU

When parallelized: seems to be a perfect MapReduce application Word Alignment Collect Counts Has More Sentences? Y Normalize Counts N Has More Iterations? Y N Word Alignment Collect Counts Has More Sentences? Y Word Alignment Collect Counts Has More Sentences? Y NN Run on cluster Qin Gao, LTI, CMU

However: … Large parallel corpus Corpus chunks Count tables Combined count table Statistical lexicon Renormalization Redistribute for next iteration Memory Data I/O Map Reduce Memory Qin Gao, LTI, CMU

Huge tables Lexicon probability table: T-Table Up to 3G in early stages As the number of workers increases, they all need to load this 3G file! And all the nodes need to have 3G+ memory – we need a cluster of super computers? Qin Gao, LTI, CMU

Another example, decoding Consider language models, what can we do if the language model grows to several TBs We need storage/query mechanism for large, structured data Consideration: ◦ Distributed storage ◦ Fast access: network has high latency Qin Gao, LTI, CMU

Google Language Model Storage: ◦ Central storage or distributed storage How to deal with latency? ◦ Modify the decoder, collect a number of queries and send them in one time. It is a specific application, we still need something more general Qin Gao, LTI, CMU

Again, made in Google: Bigtable It is the specially optimized for structured data Serving many applications now It is not a complete database Definition: ◦ A Bigtable is a sparse, distributed, persistent, multi-dimensional, sorted map Qin Gao, LTI, CMU

Data model in Bigtable Four dimension table: ◦ Row ◦ Column family ◦ Column ◦ Timestamp Row Column familyColumn Timestamp Qin Gao, LTI, CMU

Distributed storage unit : Tablet A tablet consists a range of rows Tablets can be stored in different nodes, and served by different servers Concurrent reading multiple rows can be fast Qin Gao, LTI, CMU

Random access unit : Column family Each tablet is a string-to-string map (Though not mentioned, the API shows that: ) In the level of column family, the index is loaded into memory so fast random access is possible Column family should be fixed Qin Gao, LTI, CMU

Tables inside table: Column and Timestamp Column can be any arbitrary string value Timestamp is an integer Value is byte array Actually it is a table of tables Qin Gao, LTI, CMU

Performance Number of 1000-byte values read/write per second. What is shocking: ◦ Effective IO for random read (from GFS) is more than 100 MB/second ◦ Effective IO for random read from memory is more than 3 GB/second Qin Gao, LTI, CMU

An example : Phrase Table Row: First bigram/trigram of the source phrase Column Family: Length of source phrase or some hashed number of remaining part of source phrase Column: Remaining part of the source phrase Value: All the phrase pairs of the source phrase Qin Gao, LTI, CMU

Benefit Different source phrase comes from different servers The load is balanced and the reading can be concurrent and much faster. Filtering the phrase table before decoding becomes much more efficient Qin Gao, LTI, CMU

Another Example: GIZA++ Lexicon table: ◦ Row: Source word id ◦ Column Family: nothing ◦ Column: Target word id ◦ Value: The probability value With a simple local cache, the table loading can be extremely efficient comparing to current implemenetation Qin Gao, LTI, CMU

Conclusion Strangely, the talk is all about how Google does it A useful framework for distributed MT systems require three components: ◦ MapReduce software ◦ Distributed streaming data storage system ◦ Distributed structured data storage system Qin Gao, LTI, CMU

Open Source Alternatives MapReduce Library  Hadoop GoogleFS  Hadoop FS (HDFS) BigTable  HyperTable Qin Gao, LTI, CMU

THANK YOU! Qin Gao, LTI, CMU