Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce Simplified Data Processing on Large Clusters
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Overview of MapReduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce. 2 (2012) Average Searches Per Day: 5,134,000,000 (2012) Average Searches Per Day: 5,134,000,000.
Distributed Computations
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science MapReduce:
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Ch 4. The Evolution of Analytic Scalability
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Introduction to MapReduce Amit K Singh. “The density of transistors on a chip doubles every 18 months, for the same cost” (1965) Do you recognize this.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Take a Close Look at MapReduce Xuanhua Shi. Acknowledgement  Most of the slides are from Dr. Bing Chen,
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Introduction to MapReduce ECE7610. The Age of Big-Data  Big-data age  Facebook collects 500 terabytes a day(2011)  Google collects 20000PB a day (2011)
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
MapReduce and the New Software Stack CHAPTER 2 1.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Dr Zahoor Tanoli COMSATS.  Certainly not suitable to process huge volumes of scalable data  Creates too much of a bottleneck.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc.
MapReduce B.Ramamurthy & K.Madurai 1. Motivation Process lots of data Google processed about 24 petabytes of data per day in A single machine A.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
MapReduce “MapReduce allows us to stop thinking about fault tolerance.” Cathy O’Neil & Rachel Schutt, 2013.
Hadoop Aakash Kag What Why How 1.
MapReduce: Simplified Data Processing on Large Clusters
Hadoop MapReduce Framework
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
Ch 4. The Evolution of Analytic Scalability
Introduction to MapReduce
MapReduce: Simplified Data Processing on Large Clusters
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop Distributed File System Cristiana Amza

– 2 – Motivation: Big Data Analytics 2 (2012) Average Searches Per Day: 5,134,000,000 (2012) Average Searches Per Day: 5,134,000,000

– 3 – Motivation: Big Data Analytics Process lots of data Google processed about 24 petabytes of data per day in A single machine cannot serve all the data in parallel You need a distributed system to store and process in parallel Parallel programming? Threading is hard! communication How do you facilitate communication between nodes? How do you scale to more machines? failures How do you handle machine failures? 3

– 4 – MapReduce MapReduce [OSDI’04] provides Automatic parallelization, distribution I/O scheduling Load balancing Network and data transfer optimization Fault tolerance Handling of machine failures Need more power: Scale out, not up! Large number of commodity servers as opposed to some high end specialized servers 4 Apache Hadoop: Open source implementation of MapReduce Apache Hadoop: Open source implementation of MapReduce

– 5 – Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize, filter, or transform Write the results 5

– 6 – MapReduce workflow 6 Worker read local write remote read, sort Output File 0 Output File 1 write Split 0 Split 1 Split 2 Input DataOutput Data Map extract something you care about from each record Reduce aggregate, summarize, filter, or transform

– 7 – Mappers and Reducers Need to handle more data? Just add more Mappers/Reducers! No need to handle multithreaded code No need to handle multithreaded code Mappers and Reducers are typically single threaded and deterministic Determinism allows for restarting of failed jobs Mappers/Reducers run entirely independent of each other In Hadoop, they run in separate JVMs 7

– 8 – 8 example.html Example: Word Count (2012) Average Searches Per Day: 5,134,000, nodes: each node will process 5,134,000 queries (2012) Average Searches Per Day: 5,134,000, nodes: each node will process 5,134,000 queries

– 9 – Mapper Reads in input pair Reads in input pair Outputs a pair Outputs a pair Let’s count number of each word in user queries (or Tweets/Blogs) The input to the mapper will be : The output would be: 9

– 10 – Reducer Accepts the Mapper output, and aggregates values on the key For our example, the reducer input would be: The output would be: 10

– 11 – MapReduce 11 Hadoop Program Hadoop Program Master fork assign map assign reduce Worker read local write remote read, sort Split 0 Split 1 Split 2 Input Data MapReduce Output File 0 Output File 1 write Output Data Transfer peta-scale data through network

– 12 – Google File System (GFS) Hadoop Distributed File System (HDFS) Split data and store 3 replica on commodity servers 12

– 13 – MapReduce 13 Master assign map assign reduce Worker local write remote read, sort Output File 0 Output File 1 write Split 0 Split 1 Split 2 Split 0 Split 1 Split 2 Input DataOutput Data MapReduce HDFS NameNode HDFS NameNode Read from local disk Where are the chunks of input data? Location of the chunks of input data

– 14 – Locality Optimization Master scheduling policy: Asks GFS for locations of replicas of input file blocks Map tasks scheduled so GFS input block replica are on same machine or same rack Effect: Thousands of machines read input at local disk speed Eliminate network bottleneck! 14

– 15 – Failure in MapReduce Failures are norm in commodity hardware Worker failure Detect failure via periodic heartbeats Re-execute in-progress map/reduce tasks Master failure Single point of failure; Resume from Execution LogRobust Google’s experience: lost 1600 of 1800 machines once!, but finished fine 15

– 16 – Fault tolerance: Handled via re-execution On worker failure: Detect failure via periodic heartbeats Re-execute completed and in-progress map tasks Task completion committed through master Robust: [Google’s experience] lost 1600 of 1800 machines, but finished fine Robust: [Google’s experience] lost 1600 of 1800 machines, but finished fine 16

– 17 – Refinement: Redundant Execution Slow workers significantly lengthen completion time Other jobs consuming resources on machine Bad disks with soft errors transfer data very slowly Weird things: processor caches disabled (!!) Solution: spawn backup copies of tasks Whichever one finishes first "wins" 17

– 18 – Refinement: Skipping Bad Records Map/Reduce functions sometimes fail for particular inputs Best solution is to debug & fix, but not always possible If master sees two failures for the same record: Next worker is told to skip the record 18

– 19 – A MapReduce Job 19 Mapper Reducer Run this program as a MapReduce job

– 20 – 20 Mapper Reducer Run this program as a MapReduce job

– 21 – Summary MapReduce Programming paradigm for data-intensive computing Distributed & parallel execution model Simple to program The framework automates many tedious tasks (machine selection, failure handling, etc.) 21