Download presentation
Presentation is loading. Please wait.
Published byHunter Petre Modified over 9 years ago
1
Self-Adaptive, Energy-Conserving variant of Hadoop Distributed File System Kumar Sharshembiev
2
1. Current energy issues with HDFS and large server farms 2. Past approaches and solutions for energy conservation and cost cut 3. GreenHDFS unique design and solution 4. Conclusions and references
3
The purpose of HDFS was to build a scalable file system run on large number of commodity servers – currently ~ 155,500 at Yahoo
4
Large number of servers generate heat and consume energy in very large quantities Over the lifetime of a server, the operating energy cost is comparable to the initial acquisition costs and ownership costs grow – power, cooling etc. A lot of efforts and research put into solution for energy-conservation for extremely large scale server farms
5
One of the commonly used is “Scale-down” approach– transitioning servers into low power consuming state Example: Many datacenters transfer workloads and their state to a fewer number of servers during low activity hours Problem? Above approach works only when servers are state-less – i.e. get all of their data from NAS/SAN
6
“Scale-down” approaches work only with NAS/SAN since all of the data is stored on dedicated storage devices – possible to migrate workload to fewer number
7
Hadoop distributes all of its files among many server – any of the thousand nodes can be participating at any moment
8
Self-adaptive – depends only on HDFS and file access patterns Applies Data-Classification techniques Does energy-aware placement of data Trades cost, performance, and power by separating cluster into logical zones
9
Team did a detailed analysis of files in a production Yahoo! Hadoop cluster: Files are heterogeneous in access and lifespan patterns – some are rarely accessed, some get deleted shortly, some stay a while 60% of data is “cold” or dormant – meaning lying without getting accessed – “need to exist for history files”
10
95-98% of files had a very short “hotness” lifespan of less than 3 days – meaning it was actively used during the first 3 days 90% of files in the top-level directory were dormant or “cold” for more than 18 days Majority of the data had a news-server-like access pattern – where most of the computation happens soon after its creation
11
GreenHDFS organizes servers into logical Hot and Cold Zones using different policies – FMP, SCP, FRP FMP Performance, Cost and Power
12
The goal of GreenHDFS is to have maximum number of servers in the Hot Zone and minimize the number in the Cold Zone Servers in Cold Zone are storage-heavy GreenHDFS heavily relies on the “temperature” of the files – higher the dormancy ( rarely accessed) the lower the temperature and vice versa Dormancy is determined simply by getting the last access information upon file read
13
FMP monitors the dormancy of the files and runs in the Hot Zone This gives higher storage efficiency for the Hot Zone as less accessed files are moved to the Cold zone Also gives significant energy-conservation Hot Zone Heavy Computations FMP Cold Zone Idle Servers Coldness > Threshold Hotness > Threshold
14
SCP runs in the Cold Zone and determines which servers can go to standby/sleep mode SCP uses hardware techniques to transfer CPU, Disks and DRAM into low power state SCP wakes the server up only if: ◦ Data on that server is accessed ◦ New data needs to placed on that server
15
FRP runs in the Cold Zone and ensures that QoS, bandwidth, and response time is managed well if the files become “popular” If the number of accesses to certain file becomes higher than the threshold – then file metadata is changed and gets “moved” to the Hot Zone All the threshold values of FMP,SCP, FRP should be chosen so that it results in maximum energy efficiency
16
File goes to several stages in its lifetime: ◦ File Creation – just created ◦ Hot period – frequently used ◦ Dormant period – not accessed ◦ Deletion GreenHDFS introduced various lifespan metrics and analyzed lifespan distributions to determine optimal threshold values for their policies ◦ FileLifeSpanCFR - file creation to first read ◦ FileLifeSpanCLR – file creation to last read ◦ FileLifeSpanLRD – last read access and deletion ◦ FileLifeSpanFLR – first read access and last read ◦ FileLifeTime - from the creation to deletion
18
Majority of files have short hotness lifespan
19
80% of files in d have dormancy period > 20 days
20
Simulation to test energy-conservation
21
24 % reduction in energy consumption ~ $2.1 million for 38,000 servers or $8.5 million saved on 155K servers today
22
More servers and space available = better performance
23
GreenHDFS is a policy-driven, self-adaptive, variant of HDFS It relies on data classification driven data placement that gives significant periods of idleness on a subset of servers It categorizes files into 2 zones: Hot and Cold Applies sets of policies to classify files into Hot and Cold
24
Energy consumption reduced by 24% and saved $2.1ml for 38,000 servers at that time. Today could be more than $8.5 million saved Storage efficiency also increased since dormant files get moved to the Cold Zone More space and better utilization of Hot Zone leads to better performance for HDFS/MapReduce
25
http://www.cs.odu.edu/~mukka/cs775s11/P resentations/papers/kaushik.pdf http://www.cs.odu.edu/~mukka/cs775s11/P resentations/papers/kaushik.pdf http://images.google.com/ http://images.google.com/ http://cloudera.com/ http://cloudera.com/ http://hadoop.apache.org/ http://hadoop.apache.org/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.