Linchuan Chen, Xin Huo and Gagan Agrawal Scheduling Methods for Accelerating Applications on Architectures with Heterogeneous Cores Linchuan Chen, Xin Huo and Gagan Agrawal
Outline Introduction Background System Design Experiment Results Conclusions and Future Work 9/22/2018
Introduction Motivations Evolution of Heterogeneous Architectures Decoupled CPU-GPU architectures CPU + NVIDIA GPU Coupled CPU-GPU architectures AMD Fusion Intel Ivy Bridge Non-coherent, non-uniform access shared memory Accelerating Applications using both CPU and GPU Desirable: because both CPU and GPU are important computing power Issues: locking overhead (e.g., work stealing), lack of locking support between devices, kernel re-launch overhead 9/22/2018
Introduction Our Work Several Locking-free Scheduling Methods: Master-worker One CPU core as master for scheduling, others as workers Core-level token passing Uses a token to pass the permission for retrieving tasks among cores/SMs Device-level token passing Coarse-grained token passing: pass token between CPU and GPU(s) Platforms A decoupled CPU-GPU node with 2 GPUs An AMD fusion CPU-GPU Efficiency CPU+1GPU: up to 1.88 speedup over the better of a single device vertion CPU+2GPU: further improves the performance by up to 1.79x over CPU+1GPU version Up to 21% faster than StarPU or OmpSs 9/22/2018
Outline Introduction Background System Design Experiment Results Conclusions and Future Work ffdsa 9/22/2018
Decoupled GPU Architectures Processing Component Device Grid (CUDA) NDRange (OpenCL) Streaming Multiprocessor (SM) Block (CUDA) Workgroup (OpenCL) Processing Core Thread (CUDA) Work Item (OpenCL) Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Grid 2 Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0) CPU PCIe 9/22/2018
Decoupled GPU Architectures Memory Component Device memory Shared memory Small size (32 KB) Faster I/O Faster locking operations Registers Constant memory (Device) Grid Constant Memory Texture Device Block (0, 0) Shared Memory Local Thread (0, 0) Registers Thread (1, 0) Block (1, 0) Host 9/22/2018
Heterogeneous Architecture (AMD Fusion Chip) Processing Component: Same with a decoupled GPU Memory Component GPU shares the same physical memory with CPU No separate device memory No PCIe bus zero copy memory buffer Also has memory hierarchy Shared memory Registers GPU Private Private Private Private Thread 0 Thread 1 … Thread 0 Thread 1 … … Shared Memory Shared Memory SM 0 SM 0 device memory RAM host memory Zero copy Zero copy CPU 9/22/2018
Outline Introduction Background System Design Experiment Results Conclusions and Future Work 9/22/2018
Task Scheduling Challenges Static Scheduling? Relative speeds of CPU and GPU vary Partitioning ratio cannot be determined Dynamic Scheduling Kernel relaunch based A device exits kernel and gets tasks using pthread locking High kernel launch overhead Work Stealing Retrieves tasks from within the kernel Relies on the locking mechanism supported by coherent shared memory Access to zero-copy by CPU and GPU is non-coherent, non-uniform Locking to this memory for CPU and GPU is not correctly supported 9/22/2018
We need relaunch-free, locking-free dynamic scheduling methods 9/22/2018
Master-worker Scheduling Schedule Msg Task Info has_task task_idx task_size Scheduler (core 0) zero copy worker info … … busy busy idle busy busy busy idle busy CPU GPU zero copy B 0 B 1 B m B 0 B 1 B n map map map map map map … … … … combine 9/22/2018 9/22/2018 Output 12
Master-worker Scheduling Locking-free Worker cores do not retrieve tasks actively No competition on global task offset Relaunch-free GPU kernel continue processing until the end Dedicating a CPU core to Scheduling A waste of resource Especially for applications where CPU is much faster than GPU 9/22/2018
Token-based Scheduling Basic Idea Put input in zero-copy buffer Use a token to pass the permission for retrieving tasks A worker could retrieve tasks only when it holds the token A worker passes the token to an idle worker after task retrieval Mutual exclusion without locking No need to use a core for scheduling
Core-level Token Passing CPU Core 1 Token is first held by CPU Core 0 CPU Core 0 retrieves a task block CPU Core 0 passes the token to an idle core, CPU Core 2 CPU Core 2 will repeat the same process busy CPU Core 0 CPU Core 2 idle Global Task Offset token unprocessed tasks idle busy GPU SM 0 GPU SM 2 busy GPU SM 1 9/22/2018
Core-level Token Passing Pros: No core is used exclusively for scheduling Locking-free mutual exclusion Relaunch-free Cons: Status checking overhead High frequency of token passing If the number of cores/SMs is large 9/22/2018
Device-level Token Passing Token is first held by CPU CPU retrieves a task block CPU passes the token to GPU, which is idle The task block retried by CPU is scheduled to CPU cores using intra-device locking GPU will repeat the same process Global Task Offset unprocessed tasks idle CPU GPU Core 0 Core 1 Core n SM 0 SM 1 SM n … … 9/22/2018
Device-level Token Passing Token Passing is Between Devices Reduces status checking overhead and token passing frequency Intra-device Task Scheduling is through Locking Locking frequency is low because tasks are scheduled in relatively large blocks 9/22/2018
Outline Introduction Background System Design Experiment Results Conclusions and Future Work 9/22/2018
Experimental Setup Platform Applications Coupled CPU-GPU AMD Fusion APU A3850 Quad Core AMD CPU + HD6550 GPU (5 x 80 = 400 cores) Decoupled CPU-GPU 12-core Intel Xeon 5650 CPU + 2 Nvidia M2070 (14 x 32 = 448 cores) (Fermi) cards Applications Generalized Reductions Kmeans, Gridding Kernel, Stencil Computation Jacobi, Sobel Filter, Irregular Reductions Moldyn Euler 9/22/2018
Overall Performance of Different Scheduling (Coupled CPU-GPU) Device-level token passing is the fastest, less than 8% slower than hand-optimal 9/22/2018
Overall Performance of Different Scheduling (Decoupled CPU-GPU) Device-level token passing is the fastest, less than 5% slower than hand-optimal 9/22/2018
Core-level Token Passing VS Device-level Token Passing Results are from AMD Fusion APU First component (red lines): starts with GPU only execution, and gradually increases the CPU cores Second component (blue lines): starts with CPU only execution, and gradually increases the GPU cores 9/22/2018
Scaling on Multiple GPUs Scale from CPU+1GPU to CPU+2GPU for each scheduling scheme Token passing method used here is device-level token passing Token passing is the fastest. Master-worker scheme works better with number of GPUs increasing, since sacrificing a single CPU core (master) is less affective 9/22/2018
Comparison with StarPU and OmpSs We use device-level token passing scheme to compare against StarPU and OmpSs. Our scheduling scheme achieves 1.08x to 1.21x speedups over the faster of StarPU and OmpSs. 9/22/2018
Outline Introduction Background System Design Experiment Results Conclusions and Future Work 9/22/2018
Conclusions and Future Work Exploiting parallelism across heterogeneous cores Locking-free and relaunch-free scheduling schemes Device-level token passing has very low scheduling overhead Efficiently use of both CPU and GPU cores Outperforms StarPU and OmpSs Future Work Extend to other emerging intra-node architectures, e.g., MIC Extend to clusters where each node has a collection of heterogeneous cores 9/22/2018
Thank you! Questions? Contacts: Linchuan Chen chenlinc@cse.ohio-state.edu Xin Huo huox@cse.ohio-state.edu Gagan Agrawal agrawal@cse.ohio-state.edu 9/22/2018