Presentation is loading. Please wait.

Presentation is loading. Please wait.

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Similar presentations


Presentation on theme: "Accelerating MapReduce on a Coupled CPU-GPU Architecture"— Presentation transcript:

1 Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University

2 Outline Introduction Background System Design Experiment Results
Conclusions and Future Work 9/19/2018

3 Introduction Motivations Evolution of Heterogeneous Architectures
Decoupled CPU-GPU architectures CPU + NVIDIA GPU Coupled CPU-GPU architectures AMD Fusion Intel Ivy Bridge MapReduce Programming Model Emerged with the development of Data-Intensive Computing GPUs are Used to Speedup MapReduce No work has been done on coupled CPU-GPUs 9/19/2018

4 Introduction Our Work A MapReduce Framework Task Scheduling Schemes
On a coupled CPU-GPU Using both CPU and GPU cores Based on continuous reduction Task Scheduling Schemes Map-Dividing Scheme Divides map tasks between CPU and GPU Pipelining Scheme Pipelines map and reduce stages on different devices Optimizing Load Balance Runtime Tuning Significant Speedup 1.21 – 2.1x speedups over single device versions 9/19/2018

5 Outline Introduction Background System Design Experiment Results
Conclusions and Future Work ffdsa 9/19/2018

6 Heterogeneous Architecture (AMD Fusion Chip)
Processing Component of a GPU Device Grid (CUDA) NDRange (OpenCL) Streaming Multiprocessor (SM) Block (CUDA) Workgroup (OpenCL) Processing Core Thread (CUDA) Work Item (OpenCL) Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Grid 2 Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0) 9/19/2018

7 Heterogeneous Architecture (AMD Fusion Chip)
Memory Component GPU shares the same physical memory with CPU No device memory No PCIe bus zero copy memory buffer Shared Memory Small size (32 KB) Faster I/O Faster locking operations GPU Private Private Private Private Thread 0 Thread 1 Thread 0 Thread 1 Shared Memory Shared Memory SM 0 SM 0 device memory RAM host memory Zero copy Zero copy CPU 9/19/2018

8 MapReduce Programming Model Efficient Runtime System Map()
Generates (key, value) pair(s) Reduce() Merges the values associated with the same key Efficient Runtime System Parallelization Concurrency Control Resource Management Fault Tolerance … … 9/19/2018

9 Outline Introduction Background System Design Experiment Results
Conclusions and Future Work 9/19/2018

10 Overcome Memory Overhead of MapReduce
Traditional MapReduce Procedure map  shuffle  reduce 9/19/2018

11 Overcome Memory Overhead of MapReduce
MapReduce Based on Continuous Reduction 9/19/2018

12 MapReduce Based on Continuous Reduction
Key-value Pairs Reduced Immediately No shuffling overhead Low memory overhead A General Data Structure to Store the Result Reduction object – hash table based Small key number, can use shared memory Non-associative-and-commutative reduction- supports in-object sort 9/19/2018

13 Task Scheduling map tasks Map-dividing Scheme map tasks
CPU map tasks map stage reduce stage GPU Map-dividing Scheme reduce device map tasks reduce stage map device map stage key-value buffer Pipelining Scheme 9/19/2018

14 Map-dividing Scheme Static Scheduling? Dynamic Scheduling
Relative speeds of CPU and GPU Partitioning ratio cannot be determined Dynamic Scheduling Kernel re-launch based High kernel launch overhead Locking based Put global offset in zero-copy memory, and use atomic operations to retrieve tasks However, locking to this memory for CPU and GPU is not correctly supported 9/19/2018

15 Map-dividing Scheme (master-worker model)
Scheduler (core 0) busy worker info … … idle busy busy busy busy idle busy CPU GPU zero copy B 0 B 1 B m B 0 B 1 B n map map map map map map … … … … Output 9/19/2018

16 Map-dividing Scheme (master-worker model)
Locking-free Worker cores do not retrieve tasks actively No competition on global task offset Dedicating a CPU core to Scheduling A waste of resource Especially for applications where CPU is much faster than GPU 9/19/2018

17 Pipelining Scheme GPUs are Good at Highly Parallel Operations
Potentially good at doing map stage, which tends to be compute intensive and parallel CPUs are Good at Control Flow and Data Retrieval Potentially good at doing reduce stage, which involves branch operations and data retrieval 9/19/2018

18 Pipelining Scheme with Dynamic Load Balancing
Scheduler busy worker info … … idle busy busy worker info … … busy idle busy busy Map Device Reduce Device B 0 B 1 B m B 0 B 1 B n … … map map map … … reduce reduce reduce 9/19/2018 key-value buffers Output

19 Pipelining Scheme with Static Load Balancing
reduce map B 1 B 1 reduce map Output … … … … … … … … B m B n reduce map Map Device Key-value buffers Reduce Device 9/19/2018

20 Runtime Tuning for Map-dividing Scheme
Fixed Size Scheduling Large task block size: low scheduling overhead, but high load imbalance Small task block size: low load imbalance but high scheduling overhead Runtime Tuning Worker ID Completed Task Number at Probe Stage Tuned Size N0 N0 / Nave * Sizelarge 1 N1 N1 / Nave * Sizelarge 2 N2 N2 / Nave * Sizelarge … … n Nn Nn / Nave * Sizelarge profile use small blocks adjust according to speed reduce at the end 9/19/2018

21 Outline Introduction Background System Design Experiment Results
Conclusions and Future Work 9/19/2018

22 Experimental Setup Platform Applications A Coupled CPU-GPU
AMD Fusion APU A3850 Quard Core AMD CPU + HD6550 GPU (5 x 80 = 400 cores) Applications Kmeans (KM), Word Count (WC), Naive Bayes Classifier (NBC), Matrix Multiplication (MM), K-nearest Neighbor (kNN) 9/19/2018

23 Load Imbalance under Different Task Block Sizes
Map-dividing Scheme: measure load imbalance for each application between CPU and GPU by using different task block sizes Load_imbalance = |TCPU – TGPU| / max (TCPU, TGPU) 9/19/2018

24 Computation Time under Different Task Block Sizes
Map-dividing Scheme: measure computation time of each application by using different task block sizes 9/19/2018

25 Comparison of Different Approaches
Single Device Versions CPU: CPU-only version. GPU: GPU-only version. Map-dividing Scheme MDO: map-dividing scheme with a manually chosen optimal task block size. TUNED: map-dividing scheme with runtime tuning Pipelining Scheme GMCR: pipelining scheme, GPU map, CPU reduce GMCRD: GMCR with dynamic load balancing GMCRS: GMCR with static load balancing CMGR: pipelining scheme, CPU map, GPU reduce CMGRD: CMGR with dynamic load balancing CMGRS: CMGR with static load balancing 9/19/2018

26 Comparison of Different Approaches
Kmeans: Comparision between single device versions (CPU, GPU), Map-dividing Scheme (MDO, TUNED) and Pipelining Scheme (GMCR, CMGR) 9/19/2018

27 Comparison of Different Approaches
Word Count: Comparision between single device versions (CPU, GPU), Map-dividing Scheme (MDO, TUNED) and Pipelining Scheme (GMCR, CMGR) 9/19/2018

28 Comparison of Different Approaches
Naive Bayes: Comparision between single device versions (CPU, GPU), Map-dividing Scheme (MDO, TUNED) and Pipelining Scheme (GMCR, CMGR) 9/19/2018

29 Comparison of Different Approaches
Matrix Multiplication: Comparision between single device versions (CPU, GPU) and Map-dividing Scheme (MDO, TUNED) 9/19/2018

30 Comparison of Different Approaches
K-nearest Neighbor: Comparision between single device versions (CPU, GPU), Map-dividing Scheme (MDO, TUNED), and Pipelining Scheme (GMCR, CMGR) 9/19/2018

31 Overall Speedups from Our Framework
Compare single core execution, and CPU-GPU execution with handwritten sequential version Execution Time (ms) KM WC NBC MM kNN Sequential 7042 2017 2655 98810 1004 MapReduce (1 core) 7804 2057 2712 93647 1154 Best CPU-GPU (Relative Speedup) 959 (7.34x) 516 (3.91x) 818 (3.25x) 3445 (28.68x) 112 (8.69x) 9/19/2018

32 Outline Introduction Background System Design Experiment Results
Conclusions and Future Work 9/19/2018

33 Conclusions and Future Work
Scheduling MapReduce on a Coupled CPU-GPU Two Different Scheduling Schemes Runtime Tuning to Lower Load Imbalance MapReduce is Based on Continuous Reduction Achieves Significant Speedup Over Single Device Versions for Most Applications Future Work Extend to clusters with coupled CPU-GPU nodes Apply the design ideas to other applications with different communication patterns 9/19/2018

34 Thank you Questions? Contacts:
Linchuan Chen Xin Huo Gagan Agrawal 9/19/2018


Download ppt "Accelerating MapReduce on a Coupled CPU-GPU Architecture"

Similar presentations


Ads by Google