MapReduce: Simplified Data Processing on Large Clusters 2009-21146 Lim JunSeok.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
These are slides with a history. I found them on the web... They are apparently based on Dan Weld’s class at U. Washington, (who in turn based his slides.
MapReduce. 2 (2012) Average Searches Per Day: 5,134,000,000 (2012) Average Searches Per Day: 5,134,000,000.
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science MapReduce:
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
MapReduce “Divide and Conquer”.
Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Take a Close Look at MapReduce Xuanhua Shi. Acknowledgement  Most of the slides are from Dr. Bing Chen,
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
MAPREDUCE PRESENTED BY: KATIE WOODS & JORDAN HOWELL.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.
MapReduce and the New Software Stack CHAPTER 2 1.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Simplified Data Processing on Large Cluster Authors: Jeffrey Dean and Sanjay Ghemawat Presented by: Yang Liu, University of Michigan EECS 582.
Dr Zahoor Tanoli COMSATS.  Certainly not suitable to process huge volumes of scalable data  Creates too much of a bottleneck.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Large-scale file systems and Map-Reduce
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce.
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
Map-Reduce framework -By Jagadish Rouniyar.
CS 345A Data Mining MapReduce This presentation has been altered.
Introduction to MapReduce
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok

1. Introduction 2. Programming Model 3. Structure 4. Performance & Experience 5. Conclusion Contents 2

Introduction 3

 What is MapReduce?  A simple and powerful interface that enables automatic parallelization and distribution of large-scale computations.  A programming model  executes process in distributed manner  exploits large set of commodity computers  for large data set(> 1 TB)  with underlying runtime System  parallelizes the computation across large-scale clusters of machines  handles machine failures  schedules inter-machine communication to make efficient use of the network and disk 4

Motivation  Want to process lots of data( > 1TB)  E.g.  Raw data: crawled documents, Web request logs, …  Derived data: inverted indices, summaries of the number of pages, a set of most frequent queries in a given day.  Want to parallelize across hundreds/thousands of CPUs  And, want to make these easy Google Data Centers – File System DistributedThe Digital Universe

Motivation  Application: Sifting through large amounts of data  Used for  Generating the Google search index  Clustering problems for Google News and Froogle products  Extraction of data used to produce reports of popular queries  Large scale graph computation  Large scale machine learning  … Google SearchPageRankMachine learning 6

Motivation  Platform: clusters of inexpensive machines  Commodity computers(15,000 Machines in 2003)  Scale to large clusters: thousands of machines  Data distributed and replicated across machines of the cluster  Recover from machine failure  Hadoop, Google File System Hadoop Google File System 7

Programming Model 8

MapReduce Programming Model Map Reduce Partitioning function 9

MapReduce Programming Model  Map phase  Local computation  Process each record independently and locally  Reduce phase  Aggregate the filtered output Local Storage Map Reduce Result Commodity computers 10

Example: Word Counting File 1: Hello World Bye SQL File 2: Hello Map Bye Reduce Map procedure Reduce procedure Partitioning Function 11

Example: PageRank  PageRank review:  Link analysis algorithm 12

Example: PageRank  Key ideas for Map Reduce  RageRank calculation only depends on the PageRank values of previous iteration  PageRank calculation of each Web pages can be processed in parallel  Algorithm:  Map: Provide each page’s PageRank ‘fragments’ to the links  Reduce: Sum up the PageRank fragments for each page 13

Example: PageRank  Key ideas for Map Reduce 14

Example: PageRank  PageRank calculation with 4 pages 15

Example: PageRank  Map phase: Provide each page’s PageRank ‘fragments’ to the links PageRank fragment computation of page 1 PageRank fragment computation of page 2 16

Example: PageRank  Map phase: Provide each page’s PageRank ‘fragments’ to the links PageRank fragment computation of page 3PageRank fragment computation of page 4 17

Example: PageRank  Reduce phase: Sum up the PageRank fragments for each page 18

Structure 19

Execution Overview (1) Split the input files into M pieces of 16-64MB per piece. Then start many copies of program (2) Master is special: the rest are workers that are assigned work by the master.  M map tasks and R reduce tasks (3) Map phase  Assigned worker read the input files  Parse the input data into key/value pairs  Produce intermediate key/value pairs (7) Return to user code 20

Execution Overview (4) Buffered pairs are written to local disk, partitioned into R regions by the partitioning function  The locations are passed back to the master  Master forwards these locations to the reduce workers (5) Reduce phase 1: read and sort  Reduce workers read the data from intermediate data for its partition  Sort intermediate key/value pairs to group data by same key (7) Return to user code 21

Execution Overview (6) Reduce phase 2: reduce function  Iterate over the sorted intermediate data in the reduce function  The output is appended to a final output file for the reduce function (7) Return to user code  The master wakes up the user program  Return back to the user code (7) Return to user code 22

Failure Tolerance  Handled via re-execution: worker failure  Failure detection: heartbeat  The master pings every worker periodically  Handling Failure: re-execution  Map task:  Re-execute completed and in-progress map tasks since map tasks are performed in the local  Reset the state of map tasks and re-schedule  Reduce tasks  Re-execute in-progress map tasks since the data is stored in local  Completed reduce tasks do NOT need to be re-executed  The results are stored in global file system 23

Failure Tolerance  Master failure:  Job state is checkpointed to global file system  New master recovers and continues the tasks from checkpoint  Robust to large-scale worker failure:  Simply re-execute the tasks!  Simply make new masters!  E.g.  Lost 1600 of 1800 machines once, but finished fine. 24

Locality  Network bandwidth is a relatively scarce resource  Input data is stored on the local disks of the machines  GFS divides each file into 64MB blocks  Store several copies of each block on different machines  Local computation:  Master takes the information of location of input data’s replica  Map task is performed in the local disk that contains the replica of the input data  If it fails, master schedules the map task near a replica  E.g.: worker on the same network switch  Most input data is read locally and consumes no network bandwidth 25

Task Granularity 26

Backup Tasks  Slow workers significantly lengthen completion time  Other jobs consuming resources on machine  Bad disks with soft errors  Data transfer very slowly  Weird things  Processor cashes disabled  Solution: Near end of phase, spawn backup copies of tasks  Whichever, one finishes first wins  As a result, job completion time dramatically shortened  E.g. 44% longer to complete if backup task mechanism is disabled 27

Performance & Experience 28

Performance  Experiment setting  1,800 machines  4 GB of memory  Dual-processor 2 GHz Xeons with Hyperthreading  Dual 160 GB IDE disks  Gigabit Ethernet per machine  Approximately Gbps of aggregate bandwith 29

Performance  MR_Grep: Grep task with MapReduce  Grep: search relatively rare three-character pattern through 1 terabyte  80 sec to hit zero  Computation peaks at over 30GB/s when 1764 workers are assigned  Locality optimization helps  Without this, rack switches would limit to 10GB/s Data transfer rate over time 30

Performance  MR_Sort: Sorting task with MapReduce  Sort: sort 1 terabyte of 100 byte records  Takes about 14 min.  Input rate is higher than the shuffle rate and the output rate; locality  Shuffle rate is higher than output rate  Output phase writes two copies for reliability 31

Performance  MR_Sort: Backup task and failure tolerance  Backup tasks reduce job completion time significantly  System deal well with failures 32

Experience  Large-scale indexing  MapReduce used for the Google Web search service  As a results,  The indexing code is simpler, smaller, and easier to understand  Performance is good enough  Locality makes it easy to change the indexing process  A few months  a few days  MapReduce takes care of failures, slow machines  Easy to make indexing faster by adding more machines 33

 The number of MapReduce instances grows significantly over time  2003/02: first version  2004/09: almost 900  2006/03: about 4000  2007/01: over 6000 Experience MapReduce instances over time 34

 New MapReduce Programs Per Month  The number of new MapReduce programs increases continuously Experience 35

Experience  MapReduce statistics for different months Aug. ‘04Mar. ‘06Sep. ‘07 Number of jobs(1000s) ,217 Avg. completion time (secs) Machine years used 2172,00211,081 Map input data (TB) 3,28852,254403,152 Map output data (TB) 7586,74334,774 Reduce output data(TB) 1932,97014,018 Avg. machines per job Unique implementation Map Reduce

Conclusion 37

Are every tasks suitable for MapReduce? Have a cluster; local computation Working with large dataset Working with independent data Information to share across clusters is small e.g. word count, grep, K-means clustering, PageRank Cannot work independently with data Cannot be cast into Map and Reduce Information to share across clusters is large Exponential size or even linear size Suitable if… NOT Suitable if…  NOT every tasks are suitable for MapReduce: 38

Is it trend? Really? Percentage of matching job postings SQL: 4% MapReduce: 0……% 39

Conclusion  Focus on problem:  let library deal with messy details  Automatic parallelization and distribution  MapReduce has proven to be a useful abstraction  MapReduce Simplifies large-scale computations at Google  Functional programming paradigm can be applied to large- scale application 40

EOD 41