Self-Adaptive, Energy-Conserving variant of Hadoop Distributed File System Kumar Sharshembiev.

Slides:



Advertisements
Similar presentations
Daniel Schall, Volker Höfner, Prof. Dr. Theo Härder TU Kaiserslautern.
Advertisements

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Walter Binder University of Lugano, Switzerland Niranjan Suri IHMC, Florida, USA Green Computing: Energy Consumption Optimized Service Hosting.
Case Tools Trisha Cummings. Our Definition of CASE  CASE is the use of computer-based support in the software development process.  A CASE tool is a.
Green Cloud Computing Hadi Salimi Distributed Systems Lab, School of Computer Engineering, Iran University of Science and Technology,
TRADING OFF PREDICTION ACCURACY AND POWER CONSUMPTION FOR CONTEXT- AWARE WEARABLE COMPUTING Presented By: Jeff Khoshgozaran.
Motivation Due to the development of new Internet access technologies (DSL's and HFC's), VoD services have become increasingly popular Despite the continuous.
Datacenter Power State-of-the-Art Randy H. Katz University of California, Berkeley LoCal 0 th Retreat “Energy permits things to exist; information, to.
Energy Efficient Prefetching – from models to Implementation 6/19/ Adam Manzanares and Xiao Qin Department of Computer Science and Software Engineering.
Energy Efficient Web Server Cluster Andrew Krioukov, Sara Alspaugh, Laura Keys, David Culler, Randy Katz.
Kick-off meeting 3 October 2012 Patras. Research Team B Communication Networks Laboratory (CNL), Computer Engineering & Informatics Department (CEID),
CS Spring 2012 CS 414 – Multimedia Systems Design Lecture 34 – Media Server (Part 3) Klara Nahrstedt Spring 2012.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
CoolAir Temperature- and Variation-Aware Management for Free-Cooled Datacenters Íñigo Goiri, Thu D. Nguyen, and Ricardo Bianchini 1.
1 Energy Efficient Communication in Wireless Sensor Networks Yingyue Xu 8/14/2015.
1 Proxy-Assisted Techniques for Delivering Continuous Multimedia Streams Lixin Gao, Zhi-Li Zhang, and Don Towsley.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Jiazhang Liu;Yiren Ding Team 8 [10/22/13]. Traditional Database Servers Database Admin DBMS 1.
University of Karlsruhe, System Architecture Group Balancing Power Consumption in Multiprocessor Systems Andreas Merkel Frank Bellosa System Architecture.
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Green IT and Data Centers Darshan R. Kapadia Gregor von Laszewski 1.
Introduction Due to the recent advances in smart grid as well as the increasing dissemination of smart meters, the electricity usage of every moment in.
A Dynamic MapReduce Scheduler for Heterogeneous Workloads Chao Tian, Haojie Zhou, Yongqiang He,Li Zha 簡報人:碩資工一甲 董耀文.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
1 An SLA-Oriented Capacity Planning Tool for Streaming Media Services Lucy Cherkasova, Wenting Tang, and Sharad Singhal HPLabs,USA.
OPTIMAL SERVER PROVISIONING AND FREQUENCY ADJUSTMENT IN SERVER CLUSTERS Presented by: Xinying Zheng 09/13/ XINYING ZHENG, YU CAI MICHIGAN TECHNOLOGICAL.
Sensor-Based Fast Thermal Evaluation Model For Energy Efficient High-Performance Datacenters Q. Tang, T. Mukherjee, Sandeep K. S. Gupta Department of Computer.
Low-Power Wireless Sensor Networks
Cloud Computing Energy efficient cloud computing Keke Chen.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Introduction to Hadoop and HDFS
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
Chapter Nine NetWare-Based Networking. Introduction to NetWare In 1983, Novell introduced its NetWare network operating system Versions 3.1 and 3.1—collectively.
Temperature Aware Load Balancing For Parallel Applications Osman Sarood Parallel Programming Lab (PPL) University of Illinois Urbana Champaign.
Tag line, tag line Power Management in Storage Systems Kaladhar Voruganti Technical Director CTO Office, Sunnyvale June 12, 2009.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
Papers on Storage Systems 1) Purlieus: Locality-aware Resource Allocation for MapReduce in a Cloud, SC ) Making Cloud Intermediate Data Fault-Tolerant,
1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.
1 PARAID: A Gear-Shifting Power-Aware RAID Charles Weddle, Mathew Oldham, Jin Qian, An-I Andy Wang – Florida St. University Peter Reiher – University of.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
GreenSched: An Energy-Aware Hadoop Workflow Scheduler
Enterprise Grid in Financial Services Nick Werstiuk
A dynamic optimization model for power and performance management of virtualized clusters Vinicius Petrucci, Orlando Loques Univ. Federal Fluminense Niteroi,
Data Replication and Power Consumption in Data Grids Susan V. Vrbsky, Ming Lei, Karl Smith and Jeff Byrd Department of Computer Science The University.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CHAPTER 7 CLUSTERING SERVERS. CLUSTERING TYPES There are 2 types of clustering ; Server clusters Network Load Balancing (NLB) The difference between the.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
#watitis2015 TOWARD A GREENER HORIZON: PROPOSED ENERGY SAVING CHANGES TO MFCF DATA CENTERS Naji Alamrony
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors: Kashi Venkatesh Vishwanath ; Nachiappan Nagappan Presented By: Vibhuti Dhiman.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
BIG DATA/ Hadoop Interview Questions.
Input and Output Optimization in Linux for Appropriate Resource Allocation and Management James Avery King.
Measurement-based Design
Green cloud computing 2 Cs 595 Lecture 15.
Temperature Aware Storage
Memory Management for Scalable Web Data Servers
Storage Virtualization
Automation in IMS Can it help the shrinking talent pool
The Design and Implementation of a Log-Structured File System
Presentation transcript:

Self-Adaptive, Energy-Conserving variant of Hadoop Distributed File System Kumar Sharshembiev

 1. Current energy issues with HDFS and large server farms  2. Past approaches and solutions for energy conservation and cost cut  3. GreenHDFS unique design and solution  4. Conclusions and references

 The purpose of HDFS was to build a scalable file system run on large number of commodity servers – currently ~ 155,500 at Yahoo

 Large number of servers generate heat and consume energy in very large quantities  Over the lifetime of a server, the operating energy cost is comparable to the initial acquisition costs and ownership costs grow – power, cooling etc.  A lot of efforts and research put into solution for energy-conservation for extremely large scale server farms

 One of the commonly used is “Scale-down” approach– transitioning servers into low power consuming state  Example: Many datacenters transfer workloads and their state to a fewer number of servers during low activity hours  Problem? Above approach works only when servers are state-less – i.e. get all of their data from NAS/SAN

 “Scale-down” approaches work only with NAS/SAN since all of the data is stored on dedicated storage devices – possible to migrate workload to fewer number

 Hadoop distributes all of its files among many server – any of the thousand nodes can be participating at any moment

 Self-adaptive – depends only on HDFS and file access patterns  Applies Data-Classification techniques  Does energy-aware placement of data  Trades cost, performance, and power by separating cluster into logical zones

 Team did a detailed analysis of files in a production Yahoo! Hadoop cluster:  Files are heterogeneous in access and lifespan patterns – some are rarely accessed, some get deleted shortly, some stay a while  60% of data is “cold” or dormant – meaning lying without getting accessed – “need to exist for history files”

 95-98% of files had a very short “hotness” lifespan of less than 3 days – meaning it was actively used during the first 3 days  90% of files in the top-level directory were dormant or “cold” for more than 18 days  Majority of the data had a news-server-like access pattern – where most of the computation happens soon after its creation

 GreenHDFS organizes servers into logical Hot and Cold Zones using different policies – FMP, SCP, FRP FMP Performance, Cost and Power

 The goal of GreenHDFS is to have maximum number of servers in the Hot Zone and minimize the number in the Cold Zone  Servers in Cold Zone are storage-heavy  GreenHDFS heavily relies on the “temperature” of the files – higher the dormancy ( rarely accessed) the lower the temperature and vice versa  Dormancy is determined simply by getting the last access information upon file read 

 FMP monitors the dormancy of the files and runs in the Hot Zone  This gives higher storage efficiency for the Hot Zone as less accessed files are moved to the Cold zone  Also gives significant energy-conservation Hot Zone Heavy Computations FMP Cold Zone Idle Servers Coldness > Threshold Hotness > Threshold

 SCP runs in the Cold Zone and determines which servers can go to standby/sleep mode  SCP uses hardware techniques to transfer CPU, Disks and DRAM into low power state  SCP wakes the server up only if: ◦ Data on that server is accessed ◦ New data needs to placed on that server

 FRP runs in the Cold Zone and ensures that QoS, bandwidth, and response time is managed well if the files become “popular”  If the number of accesses to certain file becomes higher than the threshold – then file metadata is changed and gets “moved” to the Hot Zone  All the threshold values of FMP,SCP, FRP should be chosen so that it results in maximum energy efficiency

 File goes to several stages in its lifetime: ◦ File Creation – just created ◦ Hot period – frequently used ◦ Dormant period – not accessed ◦ Deletion  GreenHDFS introduced various lifespan metrics and analyzed lifespan distributions to determine optimal threshold values for their policies ◦ FileLifeSpanCFR - file creation to first read ◦ FileLifeSpanCLR – file creation to last read ◦ FileLifeSpanLRD – last read access and deletion ◦ FileLifeSpanFLR – first read access and last read ◦ FileLifeTime - from the creation to deletion

 Majority of files have short hotness lifespan

 80% of files in d have dormancy period > 20 days

 Simulation to test energy-conservation

 24 % reduction in energy consumption ~ $2.1 million for 38,000 servers or $8.5 million saved on 155K servers today

 More servers and space available = better performance

 GreenHDFS is a policy-driven, self-adaptive, variant of HDFS  It relies on data classification driven data placement that gives significant periods of idleness on a subset of servers  It categorizes files into 2 zones: Hot and Cold  Applies sets of policies to classify files into Hot and Cold

 Energy consumption reduced by 24% and saved $2.1ml for 38,000 servers at that time. Today could be more than $8.5 million saved  Storage efficiency also increased since dormant files get moved to the Cold Zone  More space and better utilization of Hot Zone leads to better performance for HDFS/MapReduce

 resentations/papers/kaushik.pdf resentations/papers/kaushik.pdf   