Download presentation
Presentation is loading. Please wait.
Published byOmarion Hinks Modified over 10 years ago
1
HJ-Hadoop An Optimized MapReduce Runtime for Multi-core Systems Yunming Zhang Advised by: Prof. Alan Cox and Vivek Sarkar Rice University 1
2
Social Informatics Slide borrowed from Prof. Geoffrey Fox’s presentation
3
Big Data Era 20 PB/day 100 PB media 120 PB cluster Bing ~ 300 PB 2008 2012 2011 2012 Slide borrowed from Florin Dinu’s presentation
4
MapReduce Runtime 4 Job Starts Job Starts Map ……. Reduce Job Ends Job Ends Figure 1. Map Reduce Programming Model Map Reduce
5
Hadoop Map Reduce Open source implementation of Map Reduce Runtime system – Scalable – Reliable – Available Popular platform for big data analytics 5
6
Habanero Java(HJ) Programming Language and Runtime Developed at Rice University Optimized for multi-core systems – Lightweight async task – Work sharing runtime – Dynamic task parallelism – http://habanero.rice.edu 6
7
Kmeans 7
8
Slice 1 Reducer Join Table pairs… Full Lookup Table Thread 1 Map: Look up a Key in Lookup Table Full Lookup Table Thread 1 Slice 2 Slices …. Slice n pairs… Slice 1 Reducer Join Table pairs… Full Lookup Table Thread 1 Map: Look up a Key in Lookup Table Full Lookup Table Thread 1 Slice 2 Slices …. Slice n pairs… Join Table ComputationMemory Full Lookup Table 1x Look up key Machine1 Duplicated Tables Reducer1 Full Lookup Table 1x Machine2 Machines….. Map task in a JVM Join Table ComputationMemory Full Lookup Table 1x Look up key Machine1 Duplicated Tables Reducer1 Full Lookup Table 1x Machine2 Machines….. Map task in a JVM Join Table ComputationMemory Full Lookup Table 1x Look up key Machine1 Duplicated Tables Reducer1 Full Lookup Table 1x Machine2 Machines….. Map task in a JVM Join Table ComputationMemory Full Lookup Table 1x Look up key Machine1 Duplicated Tables Reducer1 Full Lookup Table 1x Machine2 Machines….. Map task in a JVM Join Table ComputationMemory Full Lookup Table 1x Look up key Machine1 Duplicated Tables Reducer1 Full Lookup Table 1x Machine2 Machines….. Map task in a JVM 8 Cluster Centroids To be Classified Documents Topics To be Classified Documents Kmeans is an application that takes as input a large number of documents and try to classify them into different topics
9
Kmeans using Hadoop 9 To be classified documents Computation Memory Machine1 Map task in a JVM Duplicated In- memory Cluster Centroids Cluster Centroids 1x Machines … To be classified documents Computation Memory Machine 1 Map task in a JVM Topics Machines …
10
Kmeans using Hadoop 10 To be classified documents Computation Memory Machine1 Map task in a JVM Duplicated In- memory Cluster Centroids Cluster Centroids 1x Machines … To be classified documents Computation Memory Machine 1 Map task in a JVM Topics Machines …
11
Kmeans using Hadoop 11 To be classified documents Computation Memory Machine1 Map task in a JVM Duplicated In- memory Cluster Centroids Cluster Centroids 1x Machines … To be classified documents Computation Memory Machine 1 Map task in a JVM Topics Machines …
12
Kmeans using Hadoop 12 To be classified documents Computation Memory Machine1 Map task in a JVM Duplicated In- memory Cluster Centroids Cluster Centroids 1x Machines … To be classified documents Computation Memory Machine 1 Map task in a JVM Topics Machines …
13
Kmeans using Hadoop 13 To be classified documents Computation Memory Machine1 Map task in a JVM Duplicated In- memory Cluster Centroids Cluster Centroids 1x Machines … To be classified documents Computation Memory Machine 1 Map task in a JVM Duplicated In-memory Cluster Centroids Topics 1x Machines …
14
Memory Wall 14 We used 8 mappers from 30 -80 MB, 4 mappers for 100 – 150 MB, 2 mappers for 180 – 380 for sequential Hadoop.
15
Memory Wall Hadoop’s approach to the problem – Increase the memory available to each Map Task JVM by reducing the number of map tasks assigned to each machine. 15
16
Kmeans using Hadoop 16 To be classified documents Computation Memory Machine1 Map task in a JVM Duplicated In- memory Cluster Centroids Cluster Centroids 1x Machines … To be classified documents Computation Memory Machine 1 Map task in a JVM Machines … Topics 2x
17
Memory Wall 17 Decreased throughput due to reduced number of map tasks per machine We used 8 mappers from 30 -80 MB, 4 mappers for 100 – 150 MB, 2 mappers for 180 – 380 for sequential Hadoop.
18
HJ-Hadoop Approach 1 18 To be classified documents Computation Memory Machine1 Map task in a JVM No Duplicated In- memory Cluster Centroids Cluster Centroids 4x Machines … Dynamic chunking To be classified document s Computation Memory Machine1 Map task in a JVM Topics 4x Machines … Dynamic chunking
19
HJ-Hadoop Approach 1 19 To be classified documents Computation Memory Machine1 Map task in a JVM No Duplicated In- memory Cluster Centroids Cluster Centroids 4x Machines … Dynamic chunking To be classified document s Computation Memory Machine1 Map task in a JVM Topics 4x Machines … Dynamic chunking No Duplicated In-memory Cluster Centroids
20
Results We used 2 mappers for HJ-Hadoop 20
21
Results We used 2 mappers for HJ-Hadoop 21 Process 5x Topics efficiently
22
Results We used 2 mappers for HJ-Hadoop 22 4x throughput improvement
23
HJ-Hadoop Approach 1 23 To be classified documents Computation Memory Machine1 Map task in a JVM No Duplicated In- memory Cluster Centroids Cluster Centroids 4x Machines … Dynamic chunking To be classified document s Computation Memory Machine1 Map task in a JVM Topics 4x Machines … Dynamic chunking
24
HJ-Hadoop Approach 1 24 To be classified documents Computation Memory Machine1 Map task in a JVM No Duplicated In- memory Cluster Centroids Cluster Centroids 4x Machines … Dynamic chunking Only a single thread reading input Computation Memory Machine1 Map task in a JVM Topics 4x Machines … Dynamic chunking
25
Kmeans using Hadoop 25 To be classified documents Computation Memory Machine1 Map task in a JVM Duplicated In- memory Cluster Centroids Cluster Centroids 1x Machines … Computation Memory Machine 1 Map task in a JVM Topics Machines … Four threads reading input
26
HJ-Hadoop Approach 2 26 To be classified documents Computation Memory Machine1 Map task in a JVM Duplicated In- memory Cluster Centroids Cluster Centroids 1x Machines … Computation Memory Machine 1 Map task in a JVM Topics Machines … Four threads reading input
27
HJ-Hadoop Approach 2 27 To be classified documents Computation Memory Machine1 Map task in a JVM Duplicated In- memory Cluster Centroids Cluster Centroids 1x Machines … Computation Memory Machine 1 Map task in a JVM Topics Machines … Four threads reading input No Duplicated In-memory Cluster Centroids
28
Trade Offs between the two approaches Approach 1 – Minimum memory overhead – Improved CPU utilization with small task granularity Approach 2 – Improved IO performance – Overlap between IO and Computation Hybrid Approach – Improved IO with small memory overhead – Improved CPU utilization 28
29
Conclusions Our goal is to tackle the memory inefficiency in the execution of MapReduce applications on multi-core systems by integrating a shared memory parallel model into Hadoop MapReduce runtime – HJ-Hadoop can be used to solve larger problems efficiently than Hadoop, processing process 5x more data at full throughput of the system – The HJ-Hadoop can deliver a 4x throughput relative to Hadoop processing very large in-memory data sets 29
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.