Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters Jiong Xie Ph.D. Student April 2010.

Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters Jiong Xie Ph.D. Student April 2010

Presentation Outline Background Motivation Related Work Design and Implementation Experimental Result Conclusion/Future Work 2

Background MapReduce programming model is growing in popularity Hadoop is used by Yahoo, Facebook, Amazon. 3

Data Intensive Applications 4 BioinformaticsWeather forecast AstronauticsMedicine science

Hadoop Overview 5 (J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI ’04, pages 137–150, 2008)

Hadoop Distributed File System 6 (http://lucene.apache.org/hadoop)

Motivational Example Time (min) Node A (fast) Node B (slow) Node C (slowest) 2x slower 3x slower 1 task/min 7

The Native Strategy Node A Node B Node C 3 tasks 2 tasks 6 tasks LoadingTransferringProcessing 8 Time (min)

Our Solution --Reducing data transfer time 9 Node A’ Node B’ Node C’ 3 tasks 2 tasks 6 tasks LoadingTransferringProcessing 9 Time (min) Node A

Preliminary Results 10 Impact of data placement on performance of grep

Challenges Does computing ratio depend on the application? Initial data distribution Data skew problem –New data arrival –Data deletion –New joining node –Data updating 11

Measure Computing Ratios Computing ratio Fast machines process large data sets 12 Time Node A Node B Node C 2x slower 3x slower 1 task/min

Steps to Measure Computing Ratios 13 NodeResponse time(s) Ratio# of File Fragments Speed Node A1016Fastest Node B2023Average Node C3032Slowest 1. Run the application on each node with the same size data, individually collect the response time 2. Set the ratio of the shortest response as 1, accordingly set the ratio of other nodes 3.Caculate the least common multiple of these ratios 4. Count the portion of each node

Initial Data Distribution Namenode Datanodes 1 1 2 2 3 3 File1 4 4 5 5 6 6 7 7 8 8 9 9 a a b b c c Input files split into 64MB blocks Round-robin data distribution algorithm CBA 14 Portion 3:2:1

1 Data Redistribution 1.Get network topology, the ratio and utilization 2.Build and sort two lists: under-utilized node list  L1 over-utilized node list  L2 3. Select the source and destination node from the lists. 4.Transfer data 5.Repeat step 3, 4 until the list is empty. Namenode 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 a a b b c c CA CBA B 234 L1 L2 15 Portion 3:2:1

Sharing Files among Multiple Applications The computing ratio depends on data- intensive applications. –Redistribution –Redundancy 16

Experimental Environment Five nodes in a hadoop heterogeneous cluster 17 NodeCPU ModelCPU(Hz)L1 Cache(KB) Node AIntel core 2 Duo2*1G=2G204 Node BIntel Celeron2.8G256 Node CIntel Pentium 31.2G256 Node DIntel Pentium 31.2G256 Node EIntel Pentium 31.2G256

Grep and WordCount Grep is a tool searching for a regular expression in a text file WordCount is a program used to count words in a text file 18

Computing ratio for two applications 19 Computing ratio of the five nodes with respective of Grep and Wordcount applications Computing NodeRatios for GrepRatios for Wordcount Node A11 Node B22 Node C3.35 Node D3.35 Node E3.35

Response time of Grep and wordcount in each Node 20 Application dependence Data size independence

Six Data Placement Decisions 21

Impact of data placement on performance of Grep 22

Impact of data placement on performance of WordCount 23

Conclusion Identify the performance degradation caused by heterogeneity. Designed and implemented a data placement mechanism in HDFS. 24

Future Work Data redundancy issue Dynamic data distribution mechanism Prefetching 25

Thanks 26

Question ? 27

Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters Jiong Xie Ph.D. Student April 2010.

Similar presentations

Presentation on theme: "Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters Jiong Xie Ph.D. Student April 2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters Jiong Xie Ph.D. Student April 2010.

Similar presentations

Presentation on theme: "Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters Jiong Xie Ph.D. Student April 2010."— Presentation transcript:

Similar presentations

About project

Feedback