HJ-Hadoop An Optimized MapReduce Runtime for Multi-core Systems Yunming Zhang Advised by: Prof. Alan Cox and Vivek Sarkar Rice University 1.

HJ-Hadoop An Optimized MapReduce Runtime for Multi-core Systems Yunming Zhang Advised by: Prof. Alan Cox and Vivek Sarkar Rice University 1

Social Informatics Slide borrowed from Prof. Geoffrey Fox’s presentation

Big Data Era 20 PB/day 100 PB media 120 PB cluster Bing ~ 300 PB 2008 2012 2011 2012 Slide borrowed from Florin Dinu’s presentation

MapReduce Runtime 4 Job Starts Job Starts Map ……. Reduce Job Ends Job Ends Figure 1. Map Reduce Programming Model Map Reduce

Hadoop Map Reduce Open source implementation of Map Reduce Runtime system – Scalable – Reliable – Available Popular platform for big data analytics 5

Habanero Java(HJ) Programming Language and Runtime Developed at Rice University Optimized for multi-core systems – Lightweight async task – Work sharing runtime – Dynamic task parallelism – http://habanero.rice.edu 6

Kmeans 7

Slice 1 Reducer Join Table pairs… Full Lookup Table Thread 1 Map: Look up a Key in Lookup Table Full Lookup Table Thread 1 Slice 2 Slices …. Slice n pairs… Slice 1 Reducer Join Table pairs… Full Lookup Table Thread 1 Map: Look up a Key in Lookup Table Full Lookup Table Thread 1 Slice 2 Slices …. Slice n pairs… Join Table ComputationMemory Full Lookup Table 1x Look up key Machine1 Duplicated Tables Reducer1 Full Lookup Table 1x Machine2 Machines….. Map task in a JVM Join Table ComputationMemory Full Lookup Table 1x Look up key Machine1 Duplicated Tables Reducer1 Full Lookup Table 1x Machine2 Machines….. Map task in a JVM Join Table ComputationMemory Full Lookup Table 1x Look up key Machine1 Duplicated Tables Reducer1 Full Lookup Table 1x Machine2 Machines….. Map task in a JVM Join Table ComputationMemory Full Lookup Table 1x Look up key Machine1 Duplicated Tables Reducer1 Full Lookup Table 1x Machine2 Machines….. Map task in a JVM Join Table ComputationMemory Full Lookup Table 1x Look up key Machine1 Duplicated Tables Reducer1 Full Lookup Table 1x Machine2 Machines….. Map task in a JVM 8 Cluster Centroids To be Classified Documents Topics To be Classified Documents Kmeans is an application that takes as input a large number of documents and try to classify them into different topics

Kmeans using Hadoop 9 To be classified documents Computation Memory Machine1 Map task in a JVM Duplicated In- memory Cluster Centroids Cluster Centroids 1x Machines … To be classified documents Computation Memory Machine 1 Map task in a JVM Topics Machines …

Kmeans using Hadoop 13 To be classified documents Computation Memory Machine1 Map task in a JVM Duplicated In- memory Cluster Centroids Cluster Centroids 1x Machines … To be classified documents Computation Memory Machine 1 Map task in a JVM Duplicated In-memory Cluster Centroids Topics 1x Machines …

Memory Wall 14 We used 8 mappers from 30 -80 MB, 4 mappers for 100 – 150 MB, 2 mappers for 180 – 380 for sequential Hadoop.

Memory Wall Hadoop’s approach to the problem – Increase the memory available to each Map Task JVM by reducing the number of map tasks assigned to each machine. 15

Kmeans using Hadoop 16 To be classified documents Computation Memory Machine1 Map task in a JVM Duplicated In- memory Cluster Centroids Cluster Centroids 1x Machines … To be classified documents Computation Memory Machine 1 Map task in a JVM Machines … Topics 2x

Memory Wall 17 Decreased throughput due to reduced number of map tasks per machine We used 8 mappers from 30 -80 MB, 4 mappers for 100 – 150 MB, 2 mappers for 180 – 380 for sequential Hadoop.

HJ-Hadoop Approach 1 18 To be classified documents Computation Memory Machine1 Map task in a JVM No Duplicated In- memory Cluster Centroids Cluster Centroids 4x Machines … Dynamic chunking To be classified document s Computation Memory Machine1 Map task in a JVM Topics 4x Machines … Dynamic chunking

HJ-Hadoop Approach 1 19 To be classified documents Computation Memory Machine1 Map task in a JVM No Duplicated In- memory Cluster Centroids Cluster Centroids 4x Machines … Dynamic chunking To be classified document s Computation Memory Machine1 Map task in a JVM Topics 4x Machines … Dynamic chunking No Duplicated In-memory Cluster Centroids

Results We used 2 mappers for HJ-Hadoop 20

Results We used 2 mappers for HJ-Hadoop 21 Process 5x Topics efficiently

Results We used 2 mappers for HJ-Hadoop 22 4x throughput improvement

HJ-Hadoop Approach 1 23 To be classified documents Computation Memory Machine1 Map task in a JVM No Duplicated In- memory Cluster Centroids Cluster Centroids 4x Machines … Dynamic chunking To be classified document s Computation Memory Machine1 Map task in a JVM Topics 4x Machines … Dynamic chunking

HJ-Hadoop Approach 1 24 To be classified documents Computation Memory Machine1 Map task in a JVM No Duplicated In- memory Cluster Centroids Cluster Centroids 4x Machines … Dynamic chunking Only a single thread reading input Computation Memory Machine1 Map task in a JVM Topics 4x Machines … Dynamic chunking

Kmeans using Hadoop 25 To be classified documents Computation Memory Machine1 Map task in a JVM Duplicated In- memory Cluster Centroids Cluster Centroids 1x Machines … Computation Memory Machine 1 Map task in a JVM Topics Machines … Four threads reading input

HJ-Hadoop Approach 2 26 To be classified documents Computation Memory Machine1 Map task in a JVM Duplicated In- memory Cluster Centroids Cluster Centroids 1x Machines … Computation Memory Machine 1 Map task in a JVM Topics Machines … Four threads reading input

HJ-Hadoop Approach 2 27 To be classified documents Computation Memory Machine1 Map task in a JVM Duplicated In- memory Cluster Centroids Cluster Centroids 1x Machines … Computation Memory Machine 1 Map task in a JVM Topics Machines … Four threads reading input No Duplicated In-memory Cluster Centroids

Trade Offs between the two approaches Approach 1 – Minimum memory overhead – Improved CPU utilization with small task granularity Approach 2 – Improved IO performance – Overlap between IO and Computation Hybrid Approach – Improved IO with small memory overhead – Improved CPU utilization 28

Conclusions Our goal is to tackle the memory inefficiency in the execution of MapReduce applications on multi-core systems by integrating a shared memory parallel model into Hadoop MapReduce runtime – HJ-Hadoop can be used to solve larger problems efficiently than Hadoop, processing process 5x more data at full throughput of the system – The HJ-Hadoop can deliver a 4x throughput relative to Hadoop processing very large in-memory data sets 29

HJ-Hadoop An Optimized MapReduce Runtime for Multi-core Systems Yunming Zhang Advised by: Prof. Alan Cox and Vivek Sarkar Rice University 1.

Similar presentations

Presentation on theme: "HJ-Hadoop An Optimized MapReduce Runtime for Multi-core Systems Yunming Zhang Advised by: Prof. Alan Cox and Vivek Sarkar Rice University 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

HJ-Hadoop An Optimized MapReduce Runtime for Multi-core Systems Yunming Zhang Advised by: Prof. Alan Cox and Vivek Sarkar Rice University 1.

Similar presentations

Presentation on theme: "HJ-Hadoop An Optimized MapReduce Runtime for Multi-core Systems Yunming Zhang Advised by: Prof. Alan Cox and Vivek Sarkar Rice University 1."— Presentation transcript:

Similar presentations

About project

Feedback