Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc ATIP 1 st Workshop on HPC in SC-09
R. Govindarajan ATIP 1 st Workshop on HPC in SC-09 2 Current Trend in HPC Systems Top500 systems have hundreds of thousand (100,000s) cores Large HPCs. Performance scaling major challenge No. of cores in a processor/node is increasing! 4 – 6 cores per processor, cores/node! Parallelism even at the node level Top systems use accelerators GPUs and CellBEs 1000s of cores/proc. Elements in a single GPU!
R. Govindarajan ATIP 1 st Workshop on HPC in SC-09 3 HPC Design Using Accelerators High level of performance from Accelerators Variety of general-purpose hardware accelerators GPUs : nVidia, ATI, Accelerators: Clearspeed, Cell BE, … Plethora of Instruction Sets even for SIMD Programmable accelerators, e.g., FPGA-based HPC Design using Accelerators Exploit instruction-level parallelism Exploit data-level parallelism on SIMD units Exploit thread-level parallelism on multiple units/multi-cores Challenges Portability across different generation and platforms Ability to exploit different types of parallelism
R. Govindarajan ATIP 1 st Workshop on HPC in SC-09 4 Accelerators – Cell BE
R. Govindarajan ATIP 1 st Workshop on HPC in SC-09 5 Accelerators GPU
R. Govindarajan ATIP 1 st Workshop on HPC in SC-09 6 The Challenge SSE CUDA OpenCL ArmNeon AltiVec AMD CAL
R. Govindarajan ATIP 1 st Workshop on HPC in SC-09 7 Programming in Accelerator- Based Architectures Develop a framework Programmed in a higher-level language, and is efficient Can exploit different types of parallelism on different hardware Parallelism across heterogeneous functional units Be portable across platforms – not device specific!
R. Govindarajan ATIP 1 st Workshop on HPC in SC-09 8 C/C++ CPU Auto vectorizer SSE/ Altivec CUDA/ OpenCL Compiler nvCC/JIT CPU GPUs PTX/ATI CAL IL Brook Compiler CPU GPUs ATI CAL IL Existing Approaches
R. Govindarajan ATIP 1 st Workshop on HPC in SC-09 9 StreaMIT CellBERAW StreamIT Compiler Accelerator CPU GPUs DirectX Runtim e Std. Compiler OpenMP Std. Compiler CPU GPUs Existing Approaches (contd.)
R. Govindarajan ATIP 1 st Workshop on HPC in SC Synergistic Execution on Multiple Hetergeneous Cores What is needed? Compiler/ Runtime System CellBE Other Aceel. Multicores GPUs SSE Streaming Lang. MPI OpenMP CUDA/ OpenCL Array Lang. (Matlab) Parallel Lang.
R. Govindarajan ATIP 1 st Workshop on HPC in SC What is needed? Streaming Lang. MPI OpenMP CUDA/ OpenCL Array Lang. (Matlab) Parallel Lang. CellBE Other Aceel. Multicores GPUs SSE Synergistic Execution on Multiple Hetergeneous Cores PLASMA: High-Level IR Compiler Runtime System
R. Govindarajan ATIP 1 st Workshop on HPC in SC Stream Programming Model Higher level programming model where nodes represent computation and channels communication (producer/consumer relation) between them. Exposes Pipelined parallelism and Task-level parallelism Temporal streaming of data Synchronous Data Flow (SDF), Stream Flow Graph, StreamMIT, Brook, … Compiling techniques for achieving rate-optimal, buffer-optimal, software-pipelined schedules Mapping applications to Accelerators such as GPUs and Cell BE.
R. Govindarajan ATIP 1 st Workshop on HPC in SC Streamit programs are a hierarchical composition of three basic constructs: Pipeline SplitJoin Round-robin or duplicate splitter Feedback Loop Stateful filters Peek values... Filter Splitter Stream Joiner BodySplitter Loop The StreamIt Language
R. Govindarajan ATIP 1 st Workshop on HPC in SC More ”natural” than frameworks like CUDA or CTM Easier learning curve than CUDA No need to think of ”threads” or blocks, StreamIt programs are easier to verify, Schedule can be determined statically. Why StreamIt on GPUs
R. Govindarajan ATIP 1 st Workshop on HPC in SC Work distribution across multiprocessors GPUs have hundreds of processing pipes! Exploit task-level and data-level parallelism Schedule across the multiprocessors Multiple concurrent threads in SM to exploit DLP Execution configuration: task granularity and concurrency Lack of synchronization between the processors of the GPU. Managing CPU-GPU memory bandwidth Issues on Mapping StreamIt for GPUs
R. Govindarajan ATIP 1 st Workshop on HPC in SC Stream Graph Execution Stream Graph Software Pipelined Execution A C D B SM1SM2SM3SM4 A1A2 A3A4 B1B2 B3B4 D1 C1 D2 C2 D3 C3 D4 C Pipeline Parallelism Task Parallelism Data Parallelism
R. Govindarajan ATIP 1 st Workshop on HPC in SC Our Approach Our Approach for GPUs Code for SAXPY float->float filter saxpy { float a = 2.5f; work pop 2 push 1 { float x = pop(); float y = pop(); float s = a * x + y; push(s); }
R. Govindarajan ATIP 1 st Workshop on HPC in SC Multithreading Identify good execution configuration to exploit the right amount of data parallelism Memory Efficient buffer layout scheme to ensure all accesses to GPU memory are coalesced. Task Partition between GPU and CPU cores Work scheduling and processor (SM) assignment problem. Takes into account communication bandwidth restrictions Our Approach (contd.)
R. Govindarajan ATIP 1 st Workshop on HPC in SC Execution Configuration Exec. Time of Macro Node = 32 Exec. Time of Macro Node = 16 A0A1A127 B0B1B127B0B1B127 Total Exec. Time on 2 SMs = MII = 64/2 = 32 More threads for exploiting data-level parallelism
R. Govindarajan ATIP 1 st Workshop on HPC in SC GPUs have a banked memory architecture with a very wide memory channel Accesses by threads in an SM have to be coalesced d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 B0B0 B1B1 B2B2 B3B3 B0B0 B1B1 B2B2 B3B3 thread 0 thread 2 thread 1 thread 3 d0d0 d2d2 d4d4 d6d6 d1d1 d3d3 d5d5 d7d7 B0B0 B1B1 B2B2 B3B3 B0B0 B1B1 B2B2 B3B3 thread 0 thread 2 thread 1 thread 3 Coalesced Memory Accessing
R. Govindarajan ATIP 1 st Workshop on HPC in SC Execution on CPU and GPU Problem: Partition work across CPU and GPU Data transfer between GPU and Host memory required based on the partition! Coalesced access is efficient for GPU, but harmful for CPU! Transform data before move from/to GPU memory Reduce the overall execution time, taking into account memory transfer and transform delays!
R. Govindarajan ATIP 1 st Workshop on HPC in SC Scheduling and Mapping CPU Load:45 GPU Load:40 DMA Load:40 MII:45 B AC D E GPU:20 CPU:20 GPU:20 CPU:15 CPU: B AC D E CPU:10 GPU:20 CPU:20 CPU:80 GPU:20 CPU:15 GPU:10 CPU:10 GPU: Initial StreamIt GraphPartitioned Graph
R. Govindarajan ATIP 1 st Workshop on HPC in SC B n-2 D n-6 E n-7 B n-1 A n-1 B n-3 C n-3 D n-5 C n-5 AnAn C n-4 CPUDMA ChannelGPU B A C D E GPU:20 CPU:20 GPU:20 CPU:15 CPU: Scheduling and Mapping
R. Govindarajan ATIP 1 st Workshop on HPC in SC Compiler Framework Execute Profile Runs Generate Code for Profiling Configuration Selection StreamIt Program Task Partitioning Task Partitioning ILP Partitioner Heuristic Partitioner Instance Partitioning Instance Partitioning Modulo Scheduling Code Generation CUDA Code + C Code
R. Govindarajan ATIP 1 st Workshop on HPC in SC Significant speedup for synergistic execution Experimental Results on Tesla > 52x> 32x> 65x
R. Govindarajan ATIP 1 st Workshop on HPC in SC What is needed? Streaming Lang. MPI OpenMP CUDA/ OpenCL Array Lang. (Matlab) Parallel Lang. CellBE Other Aceel. Multicores GPUs SSE Synergistic Execution on Multiple Hetergeneous Cores PLASMA: High-Level IR Compiler Runtime System
R. Govindarajan ATIP 1 st Workshop on HPC in SC Rich abstractions for Functionality Independence from any single architecture Portability without compromises on efficiency Scale-up and scale down Single core embedded processor to multi-core workstation Take advantage of Accelerators (GPU, Cell, …) Transparent Distributed Memory PLASMA: Portable Programming for PLASTIC SIMD Accelerators IR: What should a solution provide?
R. Govindarajan ATIP 1 st Workshop on HPC in SC PLASMA IR Reduce Add Par Mul SliceV M Matrix-Vector Multiply par mul, temp, A[i *n : i *n+n : 1], X reduce add, Y[I : i+1 : 1], temp
R. Govindarajan ATIP 1 st Workshop on HPC in SC “CPLASM”, a prototype high-level assembly language Prototype PLASMA IR Compiler Currently Supported Targets: C (Scalar), SSE3, CUDA (NVIDIA GPUs) Future Targets: Cell, ATI, ARM Neon,... Compiler Optimizations for this “Vector” IR Our Framework
R. Govindarajan ATIP 1 st Workshop on HPC in SC Our Framework (contd.) Plenty of optimization opportunities!
R. Govindarajan ATIP 1 st Workshop on HPC in SC PLASMA IR Performance Normalized exec. Time comparable to that of hand-tuned library!
R. Govindarajan ATIP 1 st Workshop on HPC in SC Ongoing Work Streaming Lang. MPI OpenMP CUDA/ OpenCL Array Lang. (Matlab) Parallel Lang. CellBE Other Aceel. Multicores GPUs SSE Synergistic Execution on Multiple Hetergeneous Cores PLASMA: High-Level IR Compiler Runtime System Look at other high level languages ! Target other accelerators
R. Govindarajan ATIP 1 st Workshop on HPC in SC Compiling OpenMP/MPI / X10 Mapping the semantics Exploiting data parallelism and task parallelism Communication and synchronization across CPU/GPU/Multiple Nodes Accelerator-specific optimization Memory layout, memory transfer, … Performance and Scaling
R. Govindarajan ATIP 1 st Workshop on HPC in SC Thank You !! My students! IISc and SERC Microsoft and Nvidia ATIP, NSF, all Sponsors ONR Acknowledgements