Linchuan Chen, Xin Huo and Gagan Agrawal

Linchuan Chen, Xin Huo and Gagan Agrawal
Scheduling Methods for Accelerating Applications on Architectures with Heterogeneous Cores Linchuan Chen, Xin Huo and Gagan Agrawal

Outline Introduction Background System Design Experiment Results
Conclusions and Future Work 9/22/2018

Introduction Motivations Evolution of Heterogeneous Architectures
Decoupled CPU-GPU architectures CPU + NVIDIA GPU Coupled CPU-GPU architectures AMD Fusion Intel Ivy Bridge Non-coherent, non-uniform access shared memory Accelerating Applications using both CPU and GPU Desirable: because both CPU and GPU are important computing power Issues: locking overhead (e.g., work stealing), lack of locking support between devices, kernel re-launch overhead 9/22/2018

Introduction Our Work Several Locking-free Scheduling Methods:
Master-worker One CPU core as master for scheduling, others as workers Core-level token passing Uses a token to pass the permission for retrieving tasks among cores/SMs Device-level token passing Coarse-grained token passing: pass token between CPU and GPU(s) Platforms A decoupled CPU-GPU node with 2 GPUs An AMD fusion CPU-GPU Efficiency CPU+1GPU: up to 1.88 speedup over the better of a single device vertion CPU+2GPU: further improves the performance by up to 1.79x over CPU+1GPU version Up to 21% faster than StarPU or OmpSs 9/22/2018

Conclusions and Future Work ffdsa 9/22/2018

Decoupled GPU Architectures
Processing Component Device Grid (CUDA) NDRange (OpenCL) Streaming Multiprocessor (SM) Block (CUDA) Workgroup (OpenCL) Processing Core Thread (CUDA) Work Item (OpenCL) Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Grid 2 Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0) CPU PCIe 9/22/2018

Decoupled GPU Architectures
Memory Component Device memory Shared memory Small size (32 KB) Faster I/O Faster locking operations Registers Constant memory (Device) Grid Constant Memory Texture Device Block (0, 0) Shared Memory Local Thread (0, 0) Registers Thread (1, 0) Block (1, 0) Host 9/22/2018

Heterogeneous Architecture (AMD Fusion Chip)
Processing Component: Same with a decoupled GPU Memory Component GPU shares the same physical memory with CPU No separate device memory No PCIe bus zero copy memory buffer Also has memory hierarchy Shared memory Registers GPU Private Private Private Private Thread 0 Thread 1 … Thread 0 Thread 1 … … Shared Memory Shared Memory SM 0 SM 0 device memory RAM host memory Zero copy Zero copy CPU 9/22/2018

Task Scheduling Challenges
Static Scheduling? Relative speeds of CPU and GPU vary Partitioning ratio cannot be determined Dynamic Scheduling Kernel relaunch based A device exits kernel and gets tasks using pthread locking High kernel launch overhead Work Stealing Retrieves tasks from within the kernel Relies on the locking mechanism supported by coherent shared memory Access to zero-copy by CPU and GPU is non-coherent, non-uniform Locking to this memory for CPU and GPU is not correctly supported 9/22/2018

We need relaunch-free, locking-free dynamic scheduling methods
9/22/2018

Master-worker Scheduling
Schedule Msg Task Info has_task task_idx task_size Scheduler (core 0) zero copy worker info … … busy busy idle busy busy busy idle busy CPU GPU zero copy B 0 B 1 B m B 0 B 1 B n map map map map map map … … … … combine 9/22/2018 9/22/2018 Output 12

Master-worker Scheduling
Locking-free Worker cores do not retrieve tasks actively No competition on global task offset Relaunch-free GPU kernel continue processing until the end Dedicating a CPU core to Scheduling A waste of resource Especially for applications where CPU is much faster than GPU 9/22/2018

Token-based Scheduling
Basic Idea Put input in zero-copy buffer Use a token to pass the permission for retrieving tasks A worker could retrieve tasks only when it holds the token A worker passes the token to an idle worker after task retrieval Mutual exclusion without locking No need to use a core for scheduling

Core-level Token Passing
CPU Core 1 Token is first held by CPU Core 0 CPU Core 0 retrieves a task block CPU Core 0 passes the token to an idle core, CPU Core 2 CPU Core 2 will repeat the same process busy CPU Core 0 CPU Core 2 idle Global Task Offset token unprocessed tasks idle busy GPU SM 0 GPU SM 2 busy GPU SM 1 9/22/2018

Core-level Token Passing
Pros: No core is used exclusively for scheduling Locking-free mutual exclusion Relaunch-free Cons: Status checking overhead High frequency of token passing If the number of cores/SMs is large 9/22/2018

Device-level Token Passing
Token is first held by CPU CPU retrieves a task block CPU passes the token to GPU, which is idle The task block retried by CPU is scheduled to CPU cores using intra-device locking GPU will repeat the same process Global Task Offset unprocessed tasks idle CPU GPU Core 0 Core 1 Core n SM 0 SM 1 SM n … … 9/22/2018

Device-level Token Passing
Token Passing is Between Devices Reduces status checking overhead and token passing frequency Intra-device Task Scheduling is through Locking Locking frequency is low because tasks are scheduled in relatively large blocks 9/22/2018

Experimental Setup Platform Applications Coupled CPU-GPU
AMD Fusion APU A3850 Quad Core AMD CPU + HD6550 GPU (5 x 80 = 400 cores) Decoupled CPU-GPU 12-core Intel Xeon 5650 CPU + 2 Nvidia M2070 (14 x 32 = 448 cores) (Fermi) cards Applications Generalized Reductions Kmeans, Gridding Kernel, Stencil Computation Jacobi, Sobel Filter, Irregular Reductions Moldyn Euler 9/22/2018

Overall Performance of Different Scheduling (Coupled CPU-GPU)
Device-level token passing is the fastest, less than 8% slower than hand-optimal 9/22/2018

Overall Performance of Different Scheduling (Decoupled CPU-GPU)
Device-level token passing is the fastest, less than 5% slower than hand-optimal 9/22/2018

Core-level Token Passing VS Device-level Token Passing
Results are from AMD Fusion APU First component (red lines): starts with GPU only execution, and gradually increases the CPU cores Second component (blue lines): starts with CPU only execution, and gradually increases the GPU cores 9/22/2018

Scaling on Multiple GPUs
Scale from CPU+1GPU to CPU+2GPU for each scheduling scheme Token passing method used here is device-level token passing Token passing is the fastest. Master-worker scheme works better with number of GPUs increasing, since sacrificing a single CPU core (master) is less affective 9/22/2018

Comparison with StarPU and OmpSs
We use device-level token passing scheme to compare against StarPU and OmpSs. Our scheduling scheme achieves 1.08x to 1.21x speedups over the faster of StarPU and OmpSs. 9/22/2018

Conclusions and Future Work
Exploiting parallelism across heterogeneous cores Locking-free and relaunch-free scheduling schemes Device-level token passing has very low scheduling overhead Efficiently use of both CPU and GPU cores Outperforms StarPU and OmpSs Future Work Extend to other emerging intra-node architectures, e.g., MIC Extend to clusters where each node has a collection of heterogeneous cores 9/22/2018

Thank you! Questions? Contacts:
Linchuan Chen Xin Huo Gagan Agrawal 9/22/2018

Linchuan Chen, Xin Huo and Gagan Agrawal

Similar presentations

Presentation on theme: "Linchuan Chen, Xin Huo and Gagan Agrawal"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Linchuan Chen, Xin Huo and Gagan Agrawal

Similar presentations

Presentation on theme: "Linchuan Chen, Xin Huo and Gagan Agrawal"— Presentation transcript:

Similar presentations

About project

Feedback