MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang 2014.10.6.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce: Simplified Data Processing on Large Clusters
Google Distributed System and Hadoop Lakshmi Thyagarajan.
MapReduce: Simpliyed Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat To appear in OSDI 2004 (Operating Systems Design and Implementation)
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
MapReduce “Divide and Conquer”.
Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수.
Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)
Introduction to MapReduce Amit K Singh. “The density of transistors on a chip doubles every 18 months, for the same cost” (1965) Do you recognize this.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Take a Close Look at MapReduce Xuanhua Shi. Acknowledgement  Most of the slides are from Dr. Bing Chen,
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Information Retrieval Lecture 9. Outline Map Reduce, cont. Index compression [Amazon Web Services]
MAPREDUCE PRESENTED BY: KATIE WOODS & JORDAN HOWELL.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.
MapReduce and the New Software Stack CHAPTER 2 1.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Map Reduce. Functional Programming Review r Functional operations do not modify data structures: They always create new ones r Original data still exists.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
Dr Zahoor Tanoli COMSATS.  Certainly not suitable to process huge volumes of scalable data  Creates too much of a bottleneck.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
MapReduce: Simplified Data Processing on Large Clusters
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
Cloud Computing MapReduce, Batch Processing
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang

Outline  Introduction  Programming model  Implementation  Refinements  Performance  Conclusion

1. Introduction

What is MapReduce  Origin from Google, [OSDI’04]  A simple programming model  Functional model  For large-scale data processing  Exploits large set of commodity computers  Executes process in distributed manner  Offers high availability

Motivation  Lots of demands for very large scale data processing :  computation are conceptually straightforward  input data is large  distributed across thousands of machines  The issue of how to parallelize computation, distribute the data, and handle failures obscure the original computation with complex code to deal with these issues

Distributed Grep Very big data Split data grep matches cat All matches

Distributed Word Count Very big data Split data count merge merged count

Goal Design a new abstraction that allows us to:  express the simple computation we are trying to perform  hides the messy details of parallelization, fault- tolerance, data distribution and load balancing in a library

2. Programming Model

Map + Reduce  Map:  Accepts input key/value pair  Emits intermediate key/value pair  Reduce :  Accepts intermediate key/value* pair  Emits output key/value pair Very big data Result MAPMAP REDUCEREDUCE Partitioning Function

A Simple Example  Counting words in a large set of documents : map(string value)‏ //key: document name //value: document contents for each word w in value EmitIntermediate(w, “1”); reduce(string key, iterator values)‏ //key: word //values: list of counts int results = 0; for each v in values result += ParseInt(v); Emit(AsString(result));

More Examples  Distributed Grep  Map: emits a line that matches the pattern  Reduce: identity function  Count of URL Access Frequency  Map: processes logs and emit  Map: processes logs and emit  Reduce: adds together all values for the same URL and emits a  Reduce: adds together all values for the same URL and emits a  Distributed Sort  Map: extracts key from each record and emit  Map: extracts key from each record and emit  Reduce: identity function

3. Implementation

Environment Implementation depends on the environment:  Machines with x86 dual-CPU, 2-4 GB of memory;  Commodity networking hardware, 100 Mb/s or 1 Gb/s at machine level;  A cluster consists of hundreds or thousands of machines;  Embedded inexpensive IDE disks provides storage

Execution Overview

1. Input data partitioning (M splits, each 16-64MB); Starting up copies of program on a cluster 2. Tasks assignment: master assigns Map or Reduce to workers 3. Map task: parse key/value pair from input; produce intermediate key/value pair by Map function

Execution Overview 4. Pairs partitioning (hash function, typically mod); Location forwarding by master 5. Reduce task: read data from map worker; sort it by intermediate key; group 6. Reduce function: deal with the groups passed by Reduce task 7. All tasks completed. The MapReduce call returns

Details of Map/Reduce Task

Master Data Structure  Master keeps several data structure:  It stores the state (idle, in-process, or completed) for each map and reduce task  It stores the identity of the worker machine  Master is the conduit:  With master the location of intermediate file is propagated from map task to reduce task

Fault Tolerance  Worker failure  Master pings workers periodically  Any machine who does not respond is considered “dead”  For both Map and Reduce machines, any task in progress needs to be re-executed  For Map machines, completed tasks are also reset because results are stored on local disk  Master failure  Abort entire computation

Locality Issue  Master scheduling policy  Asks GFS for locations of replicas of input file blocks  Map tasks typically split into 64MB (== GFS block size)  Map tasks scheduled so GFS input block replica are on same or nearby machine  Effect  Most input data is read locally  Consumes no network bandwidth

Choice of M and R :  Ideally, M and R should be much larger than the number of work machines  There are practical bounds on M and R:  O(M+R) scheduling decisions  O(M*R) state in memory  M=200,000 and R=5,000, using 2,000 working machines Task Granularity

Backup Tasks  Some “straggler” not performing optimally  Near end of computation, schedule redundant execution of in-process tasks  First to complete “wins”

4. Refinements

Refinements  An Input Reader  Support read input data in different formats  Support read records from database or memory  An output writer  Support produce data in different formats

Refinements  A Partition Function  Data gets partitioned using the function on the intermediate key  Default: hash(key) mod R  A Combiner Function  Do partial merging of data before it is send over network  Typically the same code is used for the combiner and the reduce

Refinements  Ordering Guarantees  The intermediate key/value pairs are processed in increasing key order  Generate a sorted output file per partition  Side-effects  Produce auxiliary files as additional outputs  Write to a temporary file and atomically rename it

Refinements  Skipping Bad Records  map/reduce functions might fail for particular inputs  Fixing the bug might not be possible: third party libraries  On error  Worker sends signal to master  If multiple error on the same record, skip record

Refinements  Local Execution  Debugging problems can be tricky: distributed system  An alternative implementation: execute on local machine  Computation can be limited to particular map tasks

Refinements  Status Information  The master exports a set of status pages for human consumption  Useful for diagnose bugs  Counters  Count occurrences of various events  The counter are periodically propagated to the master  Display on the status page

Status monitor

5. Performance

Performance Boasts  Distributed grep  byte files (~1TB of data)‏  3-character substring found in ~100k files  ~1800 workers  150 seconds start to finish, including ~60 seconds startup overhead

Performance Boasts  Distributed sort  Same files/workers as above  50 lines of MapReduce code  891 seconds, including overhead  Best reported result of 1057 seconds for TeraSort benchmark

Performance Boasts

6. Conclusion

Conclusion  Easy to use  A large variety of problems are easily expressible as MapReduce computation  Develop an implementation of MapReduce

Thank you!