Download presentation
Presentation is loading. Please wait.
Published byLucinda Copeland Modified over 6 years ago
1
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Linchuan Chen and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University
2
Outline Introduction Background System Design Experiment Results
Related Work Conclusions and Future Work
3
Introduction Motivations GPUs MapReduce Programming Model
Suitable for extreme-scale computing Cost-effective and Power-efficient MapReduce Programming Model Emerged with the development of Data-Intensive Computing GPUs have been proved to be suitable for implementing MapReduce Utilizing the fast but small shared memory for MapReduce is chanllenging Storing (Key, Value) pairs leads to high memory overhead, prohibiting the use of shared memory
4
Introduction Our approach Reduction-based method
Reduce the (key, value) pair to the reduction object immediately after it is generated by the map function Very suitable for reduction-intensive applications A general and efficient MapReduce framework Dynamic memory allocation within a reduction object Maintaining a memory hierarchy Multi-group mechanism Overflow handling Before step into deeper, let me first talk about the background information of MapReduce and GPU architecture
5
Outline Introduction Background System Design Experiment Results
Related Work Conclusions and Future Work
6
MapReduce M M M M M M M M R R R R R Group by Key K1:v k1:v k2:v K1:v
K1: v, v, v, v K2:v K3:v, v K4:v, v, v K5:v R R R R R
7
MapReduce Programming Model Efficient Runtime System Map()
Generates a large number of (key, value) pairs Reduce() Merges the values associated with the same key Efficient Runtime System Parallelization Concurrency Control Resource Management Fault Tolerance … …
8
GPUs Processing Component Memory Component Host Kernel 1 Kernel 2
Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Grid 2 Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0) (Device) Grid Constant Memory Texture Device Block (0, 0) Shared Memory Local Thread (0, 0) Registers Thread (1, 0) Block (1, 0) Host
9
Outline Introduction Background System Design Experiment Results
Related Work Conclusions and Future Work
10
System Design Traditional MapReduce map(input) {
(key, value) = process(input); emit(key, value); } grouping the key-value pairs (by runtime system) reduce(key, iterator) for each value in iterator result = operation(result, value); emit(key, result);
11
System Design Reduction-based approach map(input) {
(key, value) = process(input); reductionobject->insert(key, value); } reduce(value1, value2) value1 = operation(value1, value2); Reduces the memory overhead of storing key-value pairs Makes it possible to effectively utilize shared memory on a GPU Eliminates the need of grouping Especially suitable for reduction-intensive applications
12
Chanllenges Result collection and overflow handling
Maintain a memory hierarchy Trade off space requirement and locking overhead A multi-group scheme To keep the framework general and efficient A well defined data structure for the reduction object
13
Device Memory Reduction Object
Memory Hierarchy GPU Reduction Object 0 Reduction Object 1 Reduction Object 0 Reduction Object 1 … … … … … … Block 0’s Shared Memory Block 0’s Shared Memory Device Memory Reduction Object Result Array Device Memory CPU Host Memory
14
Reduction Object Updating the reduction object
Use locks to synchronize Memory allocation in reduction object Dynamic memory allocation Multiple offsets in device memory reduction object
15
Reduction Object … … … … Memory Allocator KeyIdx[0] ValIdx[0]
Val Data Key Size Val Size Key Data Key Size Val Size Key Data Val Data
16
Multi-group Scheme Locks are used for synchronization
Large number of threads in each thread block Lead to severe contention on the shared memory RO One solution: full replication every thread owns a shared memory RO leads to memory overhead and combination overhead Trade-off multi-group scheme divide threads in each thread block into multiple sub-groups each sub-group owns a shared memory RO Choice of groups numbers Contention overhead Combination overhead
17
Overflow Handling Swapping In-object sorting
Merge the full shared memory ROs to the device memory RO Empty the full shared memory ROs In-object sorting Sort the buckets in the reduction object and delete the unuseful data Users define the way of comparing two buckets
18
Discussion Reduction-intensive applications
Our framework has a big advantage Applications with few or no reduction No need to use shared memory Users need to setup system parameters Develop auto-tuning techniques in future work
19
Extension for Multi-GPU
Shared memory usage can speed up single node execution Potentially benefits the overall performance Reduction objects can avoid global shuffling overhead Can also reduce communication overhead
20
Outline Introduction Background System Design Experiment Results
Related Work Conclusions and Future Work
21
Experiment Results Evaluating the swapping mechanism Applications used
5 reduction-intensive 2 map computation-intensive Tested with small, medium and large datasets Evaluation of the multi-group scheme 1, 2, 4 groups Comparison with other implementations Sequential implementations MapCG Ji et al.'s work Evaluating the swapping mechanism Test with large number of distinct keys
22
Evaluation of the Multi-group Scheme
23
Comparison with Sequential Implementations
24
Comparison with MapCG With reduction-intensive applications
25
Comparison with MapCG With other applications
26
Comparison with Ji et al.'s work
27
Evaluation of the Swapping Mechamism
VS MapCG and Ji et al.’s work
28
Evaluation of the Swapping Mechamism
VS MapCG
29
Evaluation of the Swapping Mechamism
swap_frequency = num_swaps / num_tasks
30
Outline Introduction Background System Design Experiment Results
Related Work Conclusions and Future Work
31
Related Work MapReduce for multi-core systems MapReduce on GPUs
Phoenix, Phoenix Rebirth MapReduce on GPUs Mars, MapCG MapReduce-like framework on GPUs for SVM Catanzaro et al. MapReduce in heterogeneous environments MITHRA, IDAV Utilizing shared memory of GPUs for specific applications Nyland et al., Gutierrez et al. Compiler optimizations for utilizing shared memory Baskaran et al. (PPoPP '08), Moazeni et al. (SASP '09)
32
Conclusions and Future Work
Reduction-based MapReduce Storing the reduction object on the memory hierarchy of the GPU A multi-group scheme Improved performance compared with previous implementations Future work: extend our framework to support new architectures
33
Thank you!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.