Table of ContentsTable of Contents  Overview  Scheduling in Hadoop  Heterogeneity in Hadoop  The LATE Scheduler(Longest Approximate Time to End) 

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica.
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
Ragib Hasan University of Alabama at Birmingham CS 491/691/791 Fall 2011 Lecture 10 09/15/2011 Security and Privacy in Cloud Computing.
Distributed Computations
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
Improving MapReduce Performance Using Smart Speculative Execution Strategy Qi Chen, Cheng Liu, and Zhen Xiao Oct 2013 To appear in IEEE Transactions on.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Survey on Programming and Tasking in Cloud Computing Environments PhD Qualifying Exam Zhiqiang Ma Supervisor: Lin Gu Feb. 18, 2011.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
B 葉彥廷 B 林廷韋 B 王頃恩. Why we choose this topic Introduction Programming Model Example Implementation Conclusion.
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Introduction to Hadoop and HDFS
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
Introduction to Search Engines Technology CS Technion, Winter 2013 Amit Gross Some slides are courtesy of: Edward Bortnikov & Ronny Lempel, Yahoo!
MapReduce on FutureGrid Andrew Younge Jerome Mitchell.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Hadoop MapReduce Framework
Edinburgh Napier University
15-826: Multimedia Databases and Data Mining
TABLE OF CONTENTS. TABLE OF CONTENTS Not Possible in single computer and DB Serialised solution not possible Large data backup difficult so data.
MapReduce Simplied Data Processing on Large Clusters
The Basics of Apache Hadoop
COS 418: Distributed Systems Lecture 1 Mike Freedman
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS 345A Data Mining MapReduce This presentation has been altered.
Cloud Computing MapReduce in Heterogeneous Environments
5/7/2019 Map Reduce Map reduce.
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Distributed Systems and Concurrency: Map Reduce
Presentation transcript:

Table of ContentsTable of Contents  Overview  Scheduling in Hadoop  Heterogeneity in Hadoop  The LATE Scheduler(Longest Approximate Time to End)  The SAMR(A Self-adaptive MapReduce Scheduling Algorithm) Scheduler  Experiment  Conclusion

Overview User Program Worker Master Worker fork assign map assign reduce read local write remote read, sort Output File 0 Output File 1 write Split 0 Split 1 Split 2 Input Data

The Map StepThe Map Step v k kv kv map v k v k … kv Input key-value pairs Intermediate key-value pairs … kv

The Reduce StepThe Reduce Step kv … kv kv kv Intermediate key-value pairs group reduce kvkvkv … kv … kv kvv vv Key-value groups Output key-value pairs

Overview  Google has noted that speculative execution improves response time by 44%  The paper shows an efficient way to do speculative execution in order to maximize performance  It also shows that Hadoop’s simple speculative algorithm based on comparing each task’s progress to the average progress brakes down in heterogeneous systems

Overview  The proposed scheduling algorithm increases Hadoop’s response time  The paper addresses two important problems in speculative execution:  Choosing the best node to run the speculative task  Distinguishing between nodes slightly slower than the mean and stragglers

Scheduling in HadoopScheduling in Hadoop  Assumptions made by Hadoop Scheduler:  Nodes can perform work at roughly the same rate  Tasks progress at a constant rate throughout time

Scheduling in HadoopScheduling in Hadoop R1:1/3 Copy data R2:1/3 Order R3:1/3 Merge M1:1 Execute map function M2:0 Reorder intermediate results Reduce Task Map Task

Scheduling in HadoopScheduling in Hadoop

Copy 1/3 Done Sort 1/3 Done Merge 1/4 Processing Copy 1/3 Done Sort 1/3 Done Merge 1/4 Processing Copy 1/3 Done Sort 1/5 DoneProcessing

Scheduling in HadoopScheduling in Hadoop Copy 1/3 Done Sort 1/3 Done Merge 1/4 Processing Copy 1/3 Done Sort 1/3 Done Merge 1/4 Processing Copy 1/3 Done Sort 1/5 Done Merge wating Processing

Scheduling in HadoopScheduling in Hadoop Copy 1/3 Done Sort 1/4 Done Merge waiting Processing Copy 1/3 Done Sort 1/12 Done Merge wating Processing

Scheduling in HadoopScheduling in Hadoop Copy 1/3 Done Sort waiting Done Merge waiting Processing Copy 1/3 Done Sort 1/12 Done Merge wating Processing

The LATE SchedulerThe LATE Scheduler

R1:1/3 Copy data R2:1/3 Order R3:1/3 Merge M1:1 Execute map function M2:0 Reorder intermediate results Reduce Task Map Task

The LATE SchedulerThe LATE Scheduler Copy 1/3 Done Sort 1/3 Done Merge 1/4 Processing Copy 1/3 Done Sort 1/4 Done Merge waiting Processing

The LATE SchedulerThe LATE Scheduler Copy 1/3 Done Sort waiting Done Merge waiting Processing Copy 1/3 Done Sort 1/12 Done Merge wating Processing

The LATE SchedulerThe LATE Scheduler  In order to get the best chance to beat the original task which was speculated the algorithm launches speculative tasks only on fast nodes  It does this using a SlowNodeThreshold which is a metric of the total work performed  Because speculative tasks cost resources LATE uses two additional heuristics:  A limit on the number of speculative tasks executed (SpeculativeCap)  A SlowTaskThreshold that determines if a task is slow enough in order to get speculated (uses progress rate for comparison)

The SAMR SchedulerThe SAMR Scheduler R1: ? Copy data R2:? Order R3:? Merge M1:? Execute map function M2:? Reorder intermediate results Reduce Task Map Task

The SAMR SchedulerThe SAMR Scheduler The way to use and update historical information

The SAMR SchedulerThe SAMR Scheduler  SLOW_TASK_CAP (STaC)

The SAMR SchedulerThe SAMR Scheduler  SLOW_TRACKER_CAP (STrC)

The SAMR SchedulerThe SAMR Scheduler

 SLOW_TRACKER_PRO (STrP) SlowTrackerNum< STrP*TrackerNum (14)

The SAMR SchedulerThe SAMR Scheduler  Launching backup tasks BackupNum <BP(Backup Pro) * TaskNum (15)

The SAMR SchedulerThe SAMR Scheduler

Experiment Affection of “HP” on the execute time

Experiment Affection of “STac”,”STrC”, and “STrP” on the execute time

Experiment Affection of “BP” on the execute time

Experiment Historical information and Real information on all 8 nodes

Experiment  HP=0.2  STaC=0.3  STrC=0.2  STrP=0.3  and BP=0.2

Experiment The execute results of “Sort” running on the experiment platform.

Experiment  LATE decreases about 7% execute time  LATE using historical information decrease about 15% execute time  SAMR decreases about 24% execute time compared to Hadoop

Conclusion  Identify the problem in Hadoop’s scheduler  Compare two schedulers for improving the performance of MapReduce in heterogeneous environment  How to improve the performance of SAMR