MapReduce Simplied Data Processing on Large Clusters

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Distributed Computations
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
MapReduce: Simplified Data Processing on Large Clusters
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science MapReduce:
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
MapReduce “Divide and Conquer”.
Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수.
Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Take a Close Look at MapReduce Xuanhua Shi. Acknowledgement  Most of the slides are from Dr. Bing Chen,
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
MAPREDUCE PRESENTED BY: KATIE WOODS & JORDAN HOWELL.
SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Simplified Data Processing on Large Cluster Authors: Jeffrey Dean and Sanjay Ghemawat Presented by: Yang Liu, University of Michigan EECS 582.
Dr Zahoor Tanoli COMSATS.  Certainly not suitable to process huge volumes of scalable data  Creates too much of a bottleneck.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
MapReduce: Simplified Data Processing on Large Clusters
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Cse 344 May 4th – Map/Reduce.
Map-Reduce framework -By Jagadish Rouniyar.
CS 345A Data Mining MapReduce This presentation has been altered.
Cloud Computing MapReduce, Batch Processing
MapReduce: Simplified Data Processing on Large Clusters
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

MapReduce Simplied Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat, Google, Inc. Presented by Zhiqin Chen

Motivation Parallel applications Common issues Inverted indices Summaries of web pages Most frequent queries Common issues Parallelize computation Distribute data Handle failures

Overview Key/Value pairs map reduce Input: input Key/Value Output: intermediate Key/Value reduce Input: intermediate Key/{Value} Output: output Key/Value

Word Count Example 1.txt A B C 2.txt B B C 3.txt C B C 4.txt A A C key: document name value: document contents map(String key, String value): for each word w in value: Emit_Intermediate( w, 1 );

Example - Map 1.txt A B C 2.txt B B C 3.txt C B C 4.txt A A C map(String key, String value): for each word w in value: Emit_Intermediate( w, 1 ); Worker_1 Worker_2 A, 1 B, 1 C, 1 C, 1 B, 1 C, 1 B, 1 A, 1 C, 1 (Local disk)

Example - Iterator Worker_1 Worker_2 A, 1 B, 1 C, 1 C, 1 B, 1 C, 1 Intermediate Value Iterator (Users don’t need to write this) A, 1 A, 1 A, { 1, 1, 1 } B, { 1, 1, 1 } C, { 1, 1, 1, 1, 1, 1 } LAN Worker_3 Worker_4 A, { 1, 1, 1 } key: a word values: a list of counts

Example - Reduce A, { 1, 1, 1 } B, { 1, 1, 1 } C, { 1, 1, 1, 1, 1, 1 } Worker_3 Worker_4 reduce(String key, Iterator values): result = 0; for each v in values: result += v; Emit( result ); A, 3 B, 3 C, 6

Implementation: Overview

Implementation: Split Split the input files into M pieces Start up many copies of the program on a cluster of machines

Implementation: Master Picks idle workers and assigns tasks M map tasks R reduce tasks Can assign multiple tasks on the same worker

Implementation: Map worker Reads the input split Parses K/V pairs Passes K/V pairs to the map function Intermediate pairs are periodically written to local disk

Implementation: Local write Local disk is partitioned into R regions The locations are passed back to the master Master forwards these locations to the reduce workers.

Implementation: Reduce worker Remotley reads all intermediate data Sorts it by the intermediate keys

Implementation: Reduce worker Iterates over the sorted intermediate data Passes the Key/List pairs to the Reduce function The output is appended to a final output file

Implementation: Locality Network bandwidth is scarce Google File System Divides each file into blocks Stores several copies on different machines MapReduce master Schedule a map task on a machine that contains a replica of the corresponding input data near a replica of the input data Most input data is read locally

Implementation: Fault tolerance Worker failure Common Master pings workers Incomplete tasks rescheduled Complete map rescheduled Complete reduce ignored Master failure Uncommon Checkpoints

Implementation: Tasks M pieces of Map, R pieces of Reduce Much larger than the number of workers Improve dynamic load balancing Speeds up recovery Need to be tuned accordingly e.g. 2,000 workers M = 200,000 R = 5,000

Implementation: Backups Problem: Stragglers Unusually slow machines Solution: backups When MR is close to completion Re-launch backups for remaining in-progress tasks Significantly reduce the time (44% in experiment)

Performance: Experimental setup Measure I/O Scarce resource Cluster Approximately 1800 machines Each with two 2GHz Intel Xeon processors with Hyper-Threading enabled 4GB memory Two 160GB IDE disks Gigabit Ethernet

Performance: Grep Grep for rare three-character pattern 1010 100-byte records ~100,000 hits Large map small reduce M = 15,000 R = 1

Performance: Grep Execution time: 150 seconds 1 minute startup overhead Propagate the program to all workers Open 1000 input files for locality optimization

Performance: Sort Large sort, based on TeraSort benchmark 1 TB data 1010 100-byte records Additional experiment Turning off backups Inducing machine failures

Performance: Sort 933 seconds 891 seconds 1283 seconds

Performance: Backups Similar execution pattern overall Minimal overhead Reducing computation time All but 5 tasks finished at 960 seconds Without backups, finishes at 1283 seconds Stragglers finish 300 seconds later (23%) 44% slower than backup execution

Performance: Sort 933 seconds 891 seconds 1283 seconds

Performance: Failures Killed 200 of 1746 workers intentionally Happens at between 200 and 300 seconds Re-execution begins immediately Results in only 5% total time increase

Experience First released in February 2003 Extremely reusable Significant improvements in August 2003 Extremely reusable Simplified code Applications large-scale machine learning problems clustering problems extraction of data or properties large-scale graph computations

Problems Might be hard to express problem in MapReduce. (People are more familiar with SQL) MapReduce is closed-source (to Google) C++. Hadoop is open-source Java-based rewrite. *Why not use a parallel DBMS instead?

To be continued … Q&A

Refinements Partitioning Ordering guarantees Allow users to specify the partition of reduce tasks/output files e.g Partition URLs by host Ordering guarantees Intermediate key/value pairs processed in increasing key order

Refinements Combiner function Optional step between map and reduce e.g. Reducing size of word count data Worker_2 Worker_2 C, 1 B, 1 A, 1 C, 1 A, 2 B, 1 C, 3

Refinements Skipping bad records Local execution Counters Master skips records that continue to fail Local execution Counters Counter* uppercase; uppercase = GetCounter("uppercase"); Map (String name, String contents): for each word w in contents: if (IsCapitalized(w)): uppercase->Increment(); EmitIntermediate(w, "1");