BALANCED DATA LAYOUT IN HADOOP CPS 216 Kyungmin (Jason) Lee Ke (Jessie) Xu Weiping Zhang
Background How data is stored on HDFS affects Hadoop MapReduce performance Mapper phase: decreased performance if need to fetch input data from remote node across network Imbalance during a MapReduce workflow (output from one job used as input to next) makes problem even worse Project goal: minimize the need to fetch Map input across network by balancing input data across nodes
Previous Work Reactive Solution – HDFS Rebalancer Algorithm to rebalance data layout in HDFS based on storage utilization Reacts to already-existing data layout imbalance, would like way to prevent altogether Proactive Solution – RR Block Placement Policy On HDFS writes, choose target node in round robin fashion, so data guaranteed balance Unnecessary writes across network? Can we do better?
Balanced Block Placement Policy Do writes ‘greedily’ as long as cluster is ‘fairly balanced’ ‘Greedily’ = prioritize target nodes based on location Local node > node on rack > remote node ‘Fairly balanced’ = size of all nodes fall within a specified ranged (windowSize) Algorithm: Sort live nodes on HDFS used; threshold = max – windowSize 1 st replica: write to local node if it is below threshold or if all nodes are above threshold, otherwise write to least utilized node 2 nd replica: least utilized node that is on different rack (if possible) than 1 st replica 3 rd replica and beyond: least utilized remaining node
Test Workloads 4-node cluster Default Policy (DP) vs. Balanced Policy (BP) 2 MapReduce Jobs Balanced Sort (each reducer approx. same output size) Skewed Sort (skewed reducer output sizes) 2 Workloads Single run, vary number reducers (1, 2, 4, 10, 12) Cascaded workflow, 3 sorts in series, reducers = 10 Other parameters RF (replication factor) – 1 and 3 Speculative Execution on and off (SE vs NSE) Monitor amount of data written to node by standard deviation higher StdDev implies more imbalance
Quick Demo on Amazon EC2
Balanced Sort Single Run DP very skewed for reducers < 4, as expected Otherwise both pretty balanced (as expected)
Skewed Sort Single Run DP significantly worse than BP RF3 show better balance than RF1 Disabling SE improves balance in BP
Balanced Sort Workflow
Skewed Sort Workflow
Performance No significant overhead/improvements observed
Speculative Execution Hadoop performance feature that runs same task on 2 nodes concurrently, uses data from task that completes first and discards the other Usually occurs toward end of a job, leading to unintended data imbalance in balanced policy Turning off speculative execution improved data balance, but in practice would like to keep this feature on for performance boost Our policy too greedy, less affected if a node writes approximately equally to all nodes round robin Hybrid policy, some nodes run round robin and some nodes run balanced policy? Tradeoff between balance and network traffic?
Future Considerations Current implementation assumes data will be balanced throughout cluster’s lifetime What if some nodes are down for a period of time and data becomes imbalanced? Data output per job should be spread evenly, vs. overall data layout spread evenly Need additional knowledge of which job each write belongs to Effect of window size on balance/performance? Unable to test due to insufficient funds
Conclusion Implemented new block placement policy that focuses on maintaining data balance while keeping writes local as much as possible Test data showed success at maintaining data balance Greatest improvements with skewed outputs Performance not affected – would expect improvement for skewed datasets given reduction in network usage Only tested on small cluster with small datasets Should be more effective on large datasets Performance weakened by speculative execution In practice should tweak our policy to get best performance results