Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.

Slides:

Advertisements

Similar presentations

Starfish: A Self-tuning System for Big Data Analytics.

Advertisements

Energy-efficient Task Scheduling in Heterogeneous Environment 2013/10/25.

University of Minnesota Optimizing MapReduce Provisioning in the Cloud Michael Cardosa, Aameek Singh†, Himabindu Pucha†, Abhishek Chandra

Achieving Elasticity for Cloud MapReduce Jobs Khaled Salah IEEE CloudNet 2013 – San Francisco November 13, 2013.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

SLA-Oriented Resource Provisioning for Cloud Computing

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon Thau Loo University of Pennsylvania Hewlett-Packard.

Towards Energy Efficient Hadoop Wednesday, June 10, 2009 Santa Clara Marriott Yanpei Chen, Laura Keys, Randy Katz RAD Lab, UC Berkeley.

Shimin Chen Big Data Reading Group.  Energy efficiency of: ◦ Single-machine instance of DBMS ◦ Standard server-grade hardware components ◦ A wide spectrum.

Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.

Towards Energy Efficient MapReduce Yanpei Chen, Laura Keys, Randy H. Katz University of California, Berkeley LoCal Retreat June 2009.

Energy Efficient Prefetching – from models to Implementation 6/19/ Adam Manzanares and Xiao Qin Department of Computer Science and Software Engineering.

Energy Efficient Prefetching with Buffer Disks for Cluster File Systems 6/19/ Adam Manzanares and Xiao Qin Department of Computer Science and Software.

1 Software Testing and Quality Assurance Lecture 40 – Software Quality Assurance.

Akhil Langer, Harshit Dokania, Laxmikant Kale, Udatta Palekar* Parallel Programming Laboratory Department of Computer Science University of Illinois at.

CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.

CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

Cloud Data Center/Storage Power Efficiency Solutions Junyao Zhang 1.

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

Folklore Confirmed: Compiling for Speed = Compiling for Energy Tomofumi Yuki INRIA, Rennes Sanjay Rajopadhye Colorado State University 1.

A Dynamic MapReduce Scheduler for Heterogeneous Workloads Chao Tian, Haojie Zhou, Yongqiang He,Li Zha 簡報人：碩資工一甲董耀文.

Dynamic Resource Allocation Using Virtual Machines for Cloud Computing Environment.

CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop

Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.

Cloud Computing Energy efficient cloud computing Keke Chen.

Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui, Ku.

Storage Management in Virtualized Cloud Environments Sankaran Sivathanu, Ling Liu, Mei Yiduo and Xing Pu Student Workshop on Frontiers of Cloud Computing,

Improving Network I/O Virtualization for Cloud Computing.

Extreme-scale computing systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward exa-scale computing.

Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Summer Report Xi He Golisano College of Computing and Information Sciences Rochester Institute of Technology Rochester, NY

Papers on Storage Systems 1) Purlieus: Locality-aware Resource Allocation for MapReduce in a Cloud, SC ) Making Cloud Intermediate Data Fault-Tolerant,

1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Data Placement and Task Scheduling in cloud, Online and Offline 赵青天津科技大学

MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.

Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.

Matchmaking: A New MapReduce Scheduling Technique

Scale up Vs. Scale out in Cloud Storage and Graph Processing Systems

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

PERFORMANCE STUDY OF BIG DATA ON SMALL NODES. Ομάδα: Παναγιώτης Μιχαηλίδης Αντρέας Σόλου Instructor: Demetris Zeinalipour.

Performance and Energy Efficiency Evaluation of Big Data Systems Presented by Yingjie Shi Institute of Computing Technology, CAS

ApproxHadoop Bringing Approximations to MapReduce Frameworks

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

The Sort Benchmark AlgorithmsSolid State Disks External Memory Multiway Mergesort  Phase 1: Run Formation  Phase 2: Merge Runs  Careful parameter selection.

Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

Load Rebalancing for Distributed File Systems in Clouds.

The Sort Benchmark AlgorithmsSolid State Disks External Memory Multiway Mergesort  Phase 1: Run Formation  Phase 2: Merge Runs  Careful parameter selection.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )

Hadoop Aakash Kag What Why How 1.

Seth Pugsley, Jeffrey Jestes,

Tao Zhu1,2, Chengchun Shu1, Haiyan Yu1

Resource Elasticity for Large-Scale Machine Learning

PA an Coordinated Memory Caching for Parallel Jobs

Applying Twister to Scientific Applications

湖南大学-信息科学与工程学院-计算机与科学系

Department of Computer Science University of California, Santa Barbara

Data-Intensive Computing: From Clouds to GPU Clusters

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Department of Computer Science University of California, Santa Barbara

Presentation transcript:

Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19 ADC 2012

Outline  Why Energy?  Factors Affecting Energy Efficiency of MapReduce  Experimental Design  Analysis of Result  Key Finding and Recommendations  Conclusion and Future Work

Why energy?  Cooling  Cost  Enviormental Effect  Perfomance

Data Center: some numbers  The data center in Dallas, Oregon: ~50 MW Average electricity consumption in USA: ~900kwh/month/family, or 1.25KW  Power consumption is the major cost and constraint of data center  About 7000 data centers in USA  In US the data centers accounted for roughly 61 billion kWh (1.5% of the total U.S. electricity consumption) in 2006 (EPA 2007) The number is expected to be doubled by 2011

Green Computing in Cloud  Physical construction & Chip level  system software Virtiual Datacenter OS IBM: Power-Aware Request Distribution  Cluster level view Dynamic Resource Configuration Workload distribution …  Green application DBMS, MapReduce  Industrial standard Green Grid, PUE(Power Usage Effectiveness), DiCE

Why MapReduce?  MapReduce & Hadoop MapReduce popular, fashionable distributed processing model for parallel computing in data centers. Hadoop is an open-source implementation of MapReduce  New Challenges Little attention in the design of MapReduce platforms Perform automatic parallelization and distribution of computations MapReduce incorporates mechanisms to be resilient to failures

Aim to Answer 2 Questions  We aim to address the following two questions: Which factors affect the cluster-wise energy efficiency of a MapReduce platform? Is there any opportunity to perform tradeoff?

 Factors Affect Energy Efficiency Identify 4 factors that affect the energy efficiency of MapReduce:  CPU intensiveness, I/O intensiveness  Factors of the underlying distributed file system replica factor as well as the file block size Questions 1

 Is there any opportunity? Identify four typical workloads of MapReduce that present different kinds of application scenarios  TextWrite, WordCount, GrepSearch, Terasort Measuring the energy consumption with varied disparate cluster scales and other related factors Questions 2

 Energy Consumption Model for MapReduce  CPU intensiveness, I/O intensiveness  Replica Factors of the underlying distributed file system as well as the file block size Factors Affecting Energy Efficiency of MapReduce

 Metric Power, Time, Energy Energy Efficiency (EE)  Cluster Setup 2.4GHZ Intel Core Duo processor, 4GB RAM, 1000Mbps NetCard Hadoop Experiment Design

 Workloads: TextWrite: Writes a large unsorted random sequence of words from a word-list. Network-intensive, map only job WordCount: Map-only CPU-intensive job. Matching regular expressions from input ﬁles. High CPU utilization in map stage GrepSearch: Balance between CPU-intensive(map stage) and I/O intensive jobs(reduce stage). High map/reduce ratio Terasort: Sorting the o ﬃ cial input datasets. CPU bound in map stage and I/O bound in reduce stage. Low map/reduce ratio Experiment Design

 Varied cluster parameters: Cluster size: 2~6 nodes Replica factor: 1~5 replicas Block size: 16MB~1GB Data size: 5~20GB Experiment Design

Analysis of Results  We run this four workloads with varied workload size, cluster scale, replica factor and block size

TextWrite(1)  It has almost a linear growth of both latency and energy consumption with the replica factor increasing

TextWrite(2)  With more nodes: great improvement of performance, 51.3%. Energy decreased from 71Wh to 49Wh  More nodes means the increase of power. But the response time reduction can trade-off the energy consumption

TextWrite(3)  Small block size enables more tasks to be processed in parallel.  When larger than 64MB, the parallelism of the system is reduced, so that energy consumption increases significantly

GrepSearch(1)  Higher degree of replica factor means more choices for tasks assignment, improving the load balancing of the system  HDFS replica placement policy not only improve data reliability, availability, but also improve the parallelism of the GrepSearch workload

GrepSearch(2)  More nodes means more resources and better performance  When the workload size is as small, the initial cost can not be amortized, and resources are sufficient. Thus, there is no obvious energy saving with the increase of the cluster size

GrepSearch(3)  With workload size increasing, the value of E-E is reduced  Small block size means large overhead on job initialization  Well-tuned block size can obtain energy saving by as much as 36.8%

TeraSort(1)  Can not see signiﬁcant changes of performance with varied replica factors  More replicas would improve the load balance of map tasks  But would be large data transfer in the shuffle stage, affecting the progress of the whole job

TeraSort(2)  With varied cluster size and workload size  Larger workload increase the job runtime, and more IT components will lead to more energy waste  Add more servers provide higher I/O throughput achieves energy consumption reduction by 20.2%

TeraSort(3)  Fig.3.c shows the value of E-E with varied cluster size and workload size  Small block size means ﬁne- grained input splits which will improve the performance of reduce sort  Big block size means less data shuffle

WordCount(1)  It employs almost 95% of the potential CPU  Increasing the degree of replica factor improve performance, parallelism  Large replica factors implies more opportunities for load balance

WordCount(2)  With more nodes added in the cluster, both response time and energy consumption decrease by 56.8% and 18.2% with 20GB data set

WordCount(3)  Small block size brings out high cost for tasks initialization  A large block size such as 1GB will make negative effects on parallelism of the system

Key Findings  With well-tuned system parameters and adaptive resource configurations, MapReduce cluster can achieve both performance improvement and good energy saving simultaneously in some instances  That is surprisingly contrast to previous works on cluster-level energy conservation.

Recommendations  For CPU intensive and high map/reduce ratio workloads, appropriate number of servers should be provided based on the workload size, ensuring load balance and adequate CPU resource  For I/O intensive workloads, fine-grained input splits is effective for shuffle and reduce stages, which is more energy efficient on condition that the initialization cost can be amortized  Improved data partitioning algorithms in map stage and content-ware reduce tasks scheduling strategies are key areas for energy efficiency, where refinements and improvements are needed.

Conclusion  We identified four factors that affect the energy efficiency of MapReduce based on the cluster  We chose four typical workloads of MapReduce and measured the energy consumption with varied disparate cluster scales and related factors  MapReduce cluster can achieve both significant energy saving and performance improvement simultaneously in some instances

Future Work  Verify the results of this paper in a larger size of clusters  More benchmarks of MapReduce should be introduced  Investigate the effects of changing other parameters: parallel reduce copies, memory limit, and file buffer size

Thank you Q&A