GreenSched: An Energy-Aware Hadoop Workflow Scheduler

Slides:



Advertisements
Similar presentations
Starfish: A Self-tuning System for Big Data Analytics.
Advertisements

The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
University of Minnesota Optimizing MapReduce Provisioning in the Cloud Michael Cardosa, Aameek Singh†, Himabindu Pucha†, Abhishek Chandra
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
SDN + Storage.
Daniel Schall, Volker Höfner, Prof. Dr. Theo Härder TU Kaiserslautern.
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
Developing a MapReduce Application – packet dissection.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
Wei-Chiu Chuang 10/17/2013 Permission to copy/distribute/adapt the work except the figures which are copyrighted by ACM.
Resource Management with YARN: YARN Past, Present and Future
Clydesdale: Structured Data Processing on MapReduce Jackie.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
VIRTUALISATION OF HADOOP CLUSTERS Dr G Sudha Sadasivam Assistant Professor Department of CSE PSGCT.
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
A Dynamic MapReduce Scheduler for Heterogeneous Workloads Chao Tian, Haojie Zhou, Yongqiang He,Li Zha 簡報人:碩資工一甲 董耀文.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
HAMS Technologies 1
Cluster Reliability Project ISIS Vanderbilt University.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
Hadoop Ali Sharza Khan High Performance Computing 1.
The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.
Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.
Using Map-reduce to Support MPMD Peng
Software Architecture for Dynamic Thermal Management in Datacenters Tridib Mukherjee Graduate Research Assistant IMPACT Lab ( Department.
Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Programming in Hadoop Guangda HU Huayang GUO
Dynamic Slot Allocation Technique for MapReduce Clusters School of Computer Engineering Nanyang Technological University 25 th Sept 2013 Shanjiang Tang,
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
The IEEE International Conference on Cluster Computing 2010
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Using Map-reduce to Support MPMD Peng
Hadoop & Neptune Feb 김형준.
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Next Generation of Apache Hadoop MapReduce Owen
Part III BigData Analysis Tools (YARN) Yuan Xue
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
Prediction-Based Multivariate Query Modeling Analytic Queries.
PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,
Organizations Are Embracing New Opportunities
INTRODUCTION TO BIGDATA & HADOOP
Chapter 10 Data Analytics for IoT
Hadoop MapReduce Framework
Edinburgh Napier University
Apache Hadoop YARN: Yet Another Resource Manager
PA an Coordinated Memory Caching for Parallel Jobs
The Basics of Apache Hadoop
Cloud Distributed Computing Environment Hadoop
Introduction to Apache
Presentation transcript:

GreenSched: An Energy-Aware Hadoop Workflow Scheduler Krish K. R. † , M. Safdar Iqbal† , M. Mustafa Rafique‡ † Distributed Systems and Storage Lab, Virgnia Tech ‡IBM Research, Ireland

Hadoop has become the de facto bigdata processing framework! Yahoo! : 40,000 nodes running Hadoop Google : 20 PB of data/day Facebook : 3000 jobs/day DSFs such as Hadoop/MapReduce have become synonymous with Bigdata. Several key industry players in Bigdata use these DSFs and DSFs that need this.

An overview of Hadoop architecture Master Node Job Reliability Client Scalability JobTracker Job Job Homogeneity Output Output Slave Node Slave Node Slave Node TaskTracker TaskTracker TaskTracker Hadoop has become the defacto platform for the bigdata, and is been used extensively by FB, Yahoo! To process large amount of data. Typically Hadoop jobs are batch processed and the performance mostly dependents on Performance of the Storage Fault Tolerance Simplicity of management Task Task Task Task Task Task HDFS Data Node Data Node Data Node HDFS - Hadoop Distributed File System.

Heterogeneity is inevitable Emergence of specialized architectures such as GPUs, FPGAs, micro-servers, SSDs and customized SAN/NAS Regular system upgrades Heterogeneity is not an exception but the norm in large data centers Optimizing Hadoop for Microservers – Hortonworks Heterogeneous Storages in HDFS– Hortonworks

Goal Design a workflow management system to effectively schedule jobs to multiple heterogeneous clusters while considering the power and performance goals Study the power characteristics of Hadoop applications and perform power-conserving job scheduling for Hadoop

Enabling Technology: ΦSched: Heterogeneity aware Hadoop workflow scheduler Manager File a Job 1 1 2 3 Job Tracker Job Tracker Job 2 1 2 3 Optimize placement / retrieval to be region aware Master To this end we propose, decoupling storage and compute… Slave nodes 1 3 2 2 1 3 a Cluster 1 Cluster 2 Map Slot DataNode

Design of GreenSched ΦSched invokes Cluster Manager to compute the expected execution times Cluster Manager consults the Execution Predictor to return a sorted list of optimal clusters ΦSched uses the list and the enhanced HDFS APIs to schedule a job Energy Profiler track power usage characteristics of the applications on a particular hardware Job Queue ΦSched Cluster Manager Energy Profiler HDFS APIs Exec. Predictor Compute Substrate Compute Substrate Compute Substrate Static and dynamic profilg are don using SAR tool. Storage Substrate Storage Substrate Storage Substrate Different Heterogeneous Configurations

Execution Predictor Maintains a catalog of job execution histories Profiles the job statically and dynamically fine tune the profile of the job Determines expected execution time and the required resources for submitted jobs

Cluster Manager Manages all the clusters in a deployment Profiles and predicts the utilization of resources Tracks the load as jobs assigned to the clusters Balances load in the cluster

HDFS enhancement Provides APIs to move / prefetch data across regions Ensures data availability / data mobility across multiple clusters Associates each DataNode with an unique ‘region’ Identifies all DataNodes in a cluster (of clusters) with the same region Proposes ‘region-aware’ placement and retrieval policy

Energy Profiler Tracks, records and analyzes the power usage characteristics of the application on a particular hardware. Computes performance to power ratio for different jobs in each cluster Schedules the job to the cluster with least performance to power ratio

Evaluation testbed Three clusters each containing 5 homogeneous nodes Name CPUs RAM (GB) Storage (GB) Network Map Slots Reduce slots Cluster 1 8 SSD 1 Gbps 4 2 Cluster 2 HDD 128 Mbps Cluster 3 16 10 Gbps

Performance of applications on different clusters We observe the overall I/O throughput across map tasks and the average I/O rate calculated as the average of the throughputs achieved per map task.

Power consumption of applications on different clusters We observe the overall I/O throughput across map tasks and the average I/O rate calculated as the average of the throughputs achieved per map task.

Performance comparison of Cluster 1 Vs Cluster 2 We observe the overall I/O throughput across map tasks and the average I/O rate calculated as the average of the throughputs achieved per map task.

Power consumption comparison of Cluster 1 Vs Cluster 2 We observe the overall I/O throughput across map tasks and the average I/O rate calculated as the average of the throughputs achieved per map task.

Comparison of performance to power ratio No single ‘Best’ cluster for all workload

Overall performance of GreenSched GreenSched achieves 12.8% performance improvement and 21% reduction in power consumption over hardware-oblivious policies

Conclusion: GreenSched enables power-aware Hadoop scheduler Exploits heterogeneity to perform power-aware hardware Hadoop scheduler Provides APIs for effective data management Realizes significant performance benefits 12.8% increase in execution time 21% less power

QUESTIONS