Accelerating MapReduce on a Coupled CPU-GPU Architecture

Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University

Outline Introduction Background System Design Experiment Results
Conclusions and Future Work 9/19/2018

Introduction Motivations Evolution of Heterogeneous Architectures
Decoupled CPU-GPU architectures CPU + NVIDIA GPU Coupled CPU-GPU architectures AMD Fusion Intel Ivy Bridge MapReduce Programming Model Emerged with the development of Data-Intensive Computing GPUs are Used to Speedup MapReduce No work has been done on coupled CPU-GPUs 9/19/2018

Introduction Our Work A MapReduce Framework Task Scheduling Schemes
On a coupled CPU-GPU Using both CPU and GPU cores Based on continuous reduction Task Scheduling Schemes Map-Dividing Scheme Divides map tasks between CPU and GPU Pipelining Scheme Pipelines map and reduce stages on different devices Optimizing Load Balance Runtime Tuning Significant Speedup 1.21 – 2.1x speedups over single device versions 9/19/2018

Conclusions and Future Work ffdsa 9/19/2018

Heterogeneous Architecture (AMD Fusion Chip)
Processing Component of a GPU Device Grid (CUDA) NDRange (OpenCL) Streaming Multiprocessor (SM) Block (CUDA) Workgroup (OpenCL) Processing Core Thread (CUDA) Work Item (OpenCL) Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Grid 2 Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0) 9/19/2018

Heterogeneous Architecture (AMD Fusion Chip)
Memory Component GPU shares the same physical memory with CPU No device memory No PCIe bus zero copy memory buffer Shared Memory Small size (32 KB) Faster I/O Faster locking operations GPU Private Private Private Private Thread 0 Thread 1 … Thread 0 Thread 1 … … Shared Memory Shared Memory SM 0 SM 0 device memory RAM host memory Zero copy Zero copy CPU 9/19/2018

MapReduce Programming Model Efficient Runtime System Map()
Generates (key, value) pair(s) Reduce() Merges the values associated with the same key Efficient Runtime System Parallelization Concurrency Control Resource Management Fault Tolerance … … 9/19/2018

Overcome Memory Overhead of MapReduce
Traditional MapReduce Procedure map  shuffle  reduce ⊘ 9/19/2018

Overcome Memory Overhead of MapReduce
MapReduce Based on Continuous Reduction 9/19/2018

MapReduce Based on Continuous Reduction
Key-value Pairs Reduced Immediately No shuffling overhead Low memory overhead A General Data Structure to Store the Result Reduction object – hash table based Small key number, can use shared memory Non-associative-and-commutative reduction- supports in-object sort 9/19/2018

Task Scheduling map tasks Map-dividing Scheme map tasks
CPU map tasks map stage reduce stage GPU Map-dividing Scheme reduce device map tasks reduce stage map device map stage key-value buffer Pipelining Scheme 9/19/2018

Map-dividing Scheme Static Scheduling? Dynamic Scheduling
Relative speeds of CPU and GPU Partitioning ratio cannot be determined Dynamic Scheduling Kernel re-launch based High kernel launch overhead Locking based Put global offset in zero-copy memory, and use atomic operations to retrieve tasks However, locking to this memory for CPU and GPU is not correctly supported 9/19/2018

Map-dividing Scheme (master-worker model)
Scheduler (core 0) busy worker info … … idle busy busy busy busy idle busy CPU GPU zero copy B 0 B 1 B m B 0 B 1 B n map map map map map map … … … … Output 9/19/2018

Map-dividing Scheme (master-worker model)
Locking-free Worker cores do not retrieve tasks actively No competition on global task offset Dedicating a CPU core to Scheduling A waste of resource Especially for applications where CPU is much faster than GPU 9/19/2018

Pipelining Scheme GPUs are Good at Highly Parallel Operations
Potentially good at doing map stage, which tends to be compute intensive and parallel CPUs are Good at Control Flow and Data Retrieval Potentially good at doing reduce stage, which involves branch operations and data retrieval 9/19/2018

Pipelining Scheme with Dynamic Load Balancing
Scheduler busy worker info … … idle busy busy worker info … … busy idle busy busy Map Device Reduce Device B 0 B 1 B m B 0 B 1 B n … … map map map … … reduce reduce reduce 9/19/2018 key-value buffers Output

Pipelining Scheme with Static Load Balancing
reduce map B 1 B 1 reduce map Output … … … … … … … … B m B n reduce map Map Device Key-value buffers Reduce Device 9/19/2018

Runtime Tuning for Map-dividing Scheme
Fixed Size Scheduling Large task block size: low scheduling overhead, but high load imbalance Small task block size: low load imbalance but high scheduling overhead Runtime Tuning Worker ID Completed Task Number at Probe Stage Tuned Size N0 N0 / Nave * Sizelarge 1 N1 N1 / Nave * Sizelarge 2 N2 N2 / Nave * Sizelarge … … n Nn Nn / Nave * Sizelarge profile use small blocks adjust according to speed reduce at the end 9/19/2018

Experimental Setup Platform Applications A Coupled CPU-GPU
AMD Fusion APU A3850 Quard Core AMD CPU + HD6550 GPU (5 x 80 = 400 cores) Applications Kmeans (KM), Word Count (WC), Naive Bayes Classifier (NBC), Matrix Multiplication (MM), K-nearest Neighbor (kNN) 9/19/2018

Load Imbalance under Different Task Block Sizes
Map-dividing Scheme: measure load imbalance for each application between CPU and GPU by using different task block sizes Load_imbalance = |TCPU – TGPU| / max (TCPU, TGPU) 9/19/2018

Computation Time under Different Task Block Sizes
Map-dividing Scheme: measure computation time of each application by using different task block sizes 9/19/2018

Comparison of Different Approaches
Single Device Versions CPU: CPU-only version. GPU: GPU-only version. Map-dividing Scheme MDO: map-dividing scheme with a manually chosen optimal task block size. TUNED: map-dividing scheme with runtime tuning Pipelining Scheme GMCR: pipelining scheme, GPU map, CPU reduce GMCRD: GMCR with dynamic load balancing GMCRS: GMCR with static load balancing CMGR: pipelining scheme, CPU map, GPU reduce CMGRD: CMGR with dynamic load balancing CMGRS: CMGR with static load balancing 9/19/2018

Kmeans: Comparision between single device versions (CPU, GPU), Map-dividing Scheme (MDO, TUNED) and Pipelining Scheme (GMCR, CMGR) 9/19/2018

Word Count: Comparision between single device versions (CPU, GPU), Map-dividing Scheme (MDO, TUNED) and Pipelining Scheme (GMCR, CMGR) 9/19/2018

Naive Bayes: Comparision between single device versions (CPU, GPU), Map-dividing Scheme (MDO, TUNED) and Pipelining Scheme (GMCR, CMGR) 9/19/2018

Matrix Multiplication: Comparision between single device versions (CPU, GPU) and Map-dividing Scheme (MDO, TUNED) 9/19/2018

K-nearest Neighbor: Comparision between single device versions (CPU, GPU), Map-dividing Scheme (MDO, TUNED), and Pipelining Scheme (GMCR, CMGR) 9/19/2018

Overall Speedups from Our Framework
Compare single core execution, and CPU-GPU execution with handwritten sequential version Execution Time (ms) KM WC NBC MM kNN Sequential 7042 2017 2655 98810 1004 MapReduce (1 core) 7804 2057 2712 93647 1154 Best CPU-GPU (Relative Speedup) 959 (7.34x) 516 (3.91x) 818 (3.25x) 3445 (28.68x) 112 (8.69x) 9/19/2018

Conclusions and Future Work
Scheduling MapReduce on a Coupled CPU-GPU Two Different Scheduling Schemes Runtime Tuning to Lower Load Imbalance MapReduce is Based on Continuous Reduction Achieves Significant Speedup Over Single Device Versions for Most Applications Future Work Extend to clusters with coupled CPU-GPU nodes Apply the design ideas to other applications with different communication patterns 9/19/2018

Thank you Questions? Contacts:
Linchuan Chen Xin Huo Gagan Agrawal 9/19/2018

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Similar presentations

Presentation on theme: "Accelerating MapReduce on a Coupled CPU-GPU Architecture"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Similar presentations

Presentation on theme: "Accelerating MapReduce on a Coupled CPU-GPU Architecture"— Presentation transcript:

Similar presentations

About project

Feedback