Download presentation
Presentation is loading. Please wait.
1
Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters Jiong Xie Ph.D. Student April 2010
2
Presentation Outline Background Motivation Related Work Design and Implementation Experimental Result Conclusion/Future Work 2
3
Background MapReduce programming model is growing in popularity Hadoop is used by Yahoo, Facebook, Amazon. 3
4
Data Intensive Applications 4 BioinformaticsWeather forecast AstronauticsMedicine science
5
Hadoop Overview 5 (J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI ’04, pages 137–150, 2008)
6
Hadoop Distributed File System 6 (http://lucene.apache.org/hadoop)
7
Motivational Example Time (min) Node A (fast) Node B (slow) Node C (slowest) 2x slower 3x slower 1 task/min 7
8
The Native Strategy Node A Node B Node C 3 tasks 2 tasks 6 tasks LoadingTransferringProcessing 8 Time (min)
9
Our Solution --Reducing data transfer time 9 Node A’ Node B’ Node C’ 3 tasks 2 tasks 6 tasks LoadingTransferringProcessing 9 Time (min) Node A
10
Preliminary Results 10 Impact of data placement on performance of grep
11
Challenges Does computing ratio depend on the application? Initial data distribution Data skew problem –New data arrival –Data deletion –New joining node –Data updating 11
12
Measure Computing Ratios Computing ratio Fast machines process large data sets 12 Time Node A Node B Node C 2x slower 3x slower 1 task/min
13
Steps to Measure Computing Ratios 13 NodeResponse time(s) Ratio# of File Fragments Speed Node A1016Fastest Node B2023Average Node C3032Slowest 1. Run the application on each node with the same size data, individually collect the response time 2. Set the ratio of the shortest response as 1, accordingly set the ratio of other nodes 3.Caculate the least common multiple of these ratios 4. Count the portion of each node
14
Initial Data Distribution Namenode Datanodes 1 1 2 2 3 3 File1 4 4 5 5 6 6 7 7 8 8 9 9 a a b b c c Input files split into 64MB blocks Round-robin data distribution algorithm CBA 14 Portion 3:2:1
15
1 Data Redistribution 1.Get network topology, the ratio and utilization 2.Build and sort two lists: under-utilized node list L1 over-utilized node list L2 3. Select the source and destination node from the lists. 4.Transfer data 5.Repeat step 3, 4 until the list is empty. Namenode 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 a a b b c c CA CBA B 234 L1 L2 15 Portion 3:2:1
16
Sharing Files among Multiple Applications The computing ratio depends on data- intensive applications. –Redistribution –Redundancy 16
17
Experimental Environment Five nodes in a hadoop heterogeneous cluster 17 NodeCPU ModelCPU(Hz)L1 Cache(KB) Node AIntel core 2 Duo2*1G=2G204 Node BIntel Celeron2.8G256 Node CIntel Pentium 31.2G256 Node DIntel Pentium 31.2G256 Node EIntel Pentium 31.2G256
18
Grep and WordCount Grep is a tool searching for a regular expression in a text file WordCount is a program used to count words in a text file 18
19
Computing ratio for two applications 19 Computing ratio of the five nodes with respective of Grep and Wordcount applications Computing NodeRatios for GrepRatios for Wordcount Node A11 Node B22 Node C3.35 Node D3.35 Node E3.35
20
Response time of Grep and wordcount in each Node 20 Application dependence Data size independence
21
Six Data Placement Decisions 21
22
Impact of data placement on performance of Grep 22
23
Impact of data placement on performance of WordCount 23
24
Conclusion Identify the performance degradation caused by heterogeneity. Designed and implemented a data placement mechanism in HDFS. 24
25
Future Work Data redundancy issue Dynamic data distribution mechanism Prefetching 25
26
Thanks 26
27
Question ? 27
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.