Download presentation
Presentation is loading. Please wait.
Published byShamar Briley Modified over 10 years ago
1
Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in ATIP 1 st Workshop on HPC in India @ SC-09
2
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 2 Current Trend in HPC Systems Top500 systems have hundreds of thousand (100,000s) cores Large HPCs. Performance scaling major challenge No. of cores in a processor/node is increasing! 4 – 6 cores per processor, 16-24 cores/node! Parallelism even at the node level Top systems use accelerators GPUs and CellBEs 1000s of cores/proc. Elements in a single GPU!
3
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 3 HPC Design Using Accelerators High level of performance from Accelerators Variety of general-purpose hardware accelerators GPUs : nVidia, ATI, Accelerators: Clearspeed, Cell BE, … Plethora of Instruction Sets even for SIMD Programmable accelerators, e.g., FPGA-based HPC Design using Accelerators Exploit instruction-level parallelism Exploit data-level parallelism on SIMD units Exploit thread-level parallelism on multiple units/multi-cores Challenges Portability across different generation and platforms Ability to exploit different types of parallelism
4
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 4 Accelerators – Cell BE
5
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 5 Accelerators - 8800 GPU
6
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 6 The Challenge SSE CUDA OpenCL ArmNeon AltiVec AMD CAL
7
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 7 Programming in Accelerator- Based Architectures Develop a framework Programmed in a higher-level language, and is efficient Can exploit different types of parallelism on different hardware Parallelism across heterogeneous functional units Be portable across platforms – not device specific!
8
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 8 C/C++ CPU Auto vectorizer SSE/ Altivec CUDA/ OpenCL Compiler nvCC/JIT CPU GPUs PTX/ATI CAL IL Brook Compiler CPU GPUs ATI CAL IL Existing Approaches
9
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 9 StreaMIT CellBERAW StreamIT Compiler Accelerator CPU GPUs DirectX Runtim e Std. Compiler OpenMP Std. Compiler CPU GPUs Existing Approaches (contd.)
10
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 10 Synergistic Execution on Multiple Hetergeneous Cores What is needed? Compiler/ Runtime System CellBE Other Aceel. Multicores GPUs SSE Streaming Lang. MPI OpenMP CUDA/ OpenCL Array Lang. (Matlab) Parallel Lang.
11
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 11 What is needed? Streaming Lang. MPI OpenMP CUDA/ OpenCL Array Lang. (Matlab) Parallel Lang. CellBE Other Aceel. Multicores GPUs SSE Synergistic Execution on Multiple Hetergeneous Cores PLASMA: High-Level IR Compiler Runtime System
12
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 12 Stream Programming Model Higher level programming model where nodes represent computation and channels communication (producer/consumer relation) between them. Exposes Pipelined parallelism and Task-level parallelism Temporal streaming of data Synchronous Data Flow (SDF), Stream Flow Graph, StreamMIT, Brook, … Compiling techniques for achieving rate-optimal, buffer-optimal, software-pipelined schedules Mapping applications to Accelerators such as GPUs and Cell BE.
13
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 13 Streamit programs are a hierarchical composition of three basic constructs: Pipeline SplitJoin Round-robin or duplicate splitter Feedback Loop Stateful filters Peek values... Filter Splitter Stream Joiner BodySplitter Loop The StreamIt Language
14
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 14 More ”natural” than frameworks like CUDA or CTM Easier learning curve than CUDA No need to think of ”threads” or blocks, StreamIt programs are easier to verify, Schedule can be determined statically. Why StreamIt on GPUs
15
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 15 Work distribution across multiprocessors GPUs have hundreds of processing pipes! Exploit task-level and data-level parallelism Schedule across the multiprocessors Multiple concurrent threads in SM to exploit DLP Execution configuration: task granularity and concurrency Lack of synchronization between the processors of the GPU. Managing CPU-GPU memory bandwidth Issues on Mapping StreamIt for GPUs
16
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 16 Stream Graph Execution Stream Graph Software Pipelined Execution A C D B SM1SM2SM3SM4 A1A2 A3A4 B1B2 B3B4 D1 C1 D2 C2 D3 C3 D4 C4 0 1 2 3 4 5 6 7 Pipeline Parallelism Task Parallelism Data Parallelism
17
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 17 Our Approach Our Approach for GPUs Code for SAXPY float->float filter saxpy { float a = 2.5f; work pop 2 push 1 { float x = pop(); float y = pop(); float s = a * x + y; push(s); }
18
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 18 Multithreading Identify good execution configuration to exploit the right amount of data parallelism Memory Efficient buffer layout scheme to ensure all accesses to GPU memory are coalesced. Task Partition between GPU and CPU cores Work scheduling and processor (SM) assignment problem. Takes into account communication bandwidth restrictions Our Approach (contd.)
19
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 19 Execution Configuration Exec. Time of Macro Node = 32 Exec. Time of Macro Node = 16 A0A1A127 B0B1B127B0B1B127 Total Exec. Time on 2 SMs = MII = 64/2 = 32 More threads for exploiting data-level parallelism
20
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 20 GPUs have a banked memory architecture with a very wide memory channel Accesses by threads in an SM have to be coalesced d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 B0B0 B1B1 B2B2 B3B3 B0B0 B1B1 B2B2 B3B3 thread 0 thread 2 thread 1 thread 3 d0d0 d2d2 d4d4 d6d6 d1d1 d3d3 d5d5 d7d7 B0B0 B1B1 B2B2 B3B3 B0B0 B1B1 B2B2 B3B3 thread 0 thread 2 thread 1 thread 3 Coalesced Memory Accessing
21
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 21 Execution on CPU and GPU Problem: Partition work across CPU and GPU Data transfer between GPU and Host memory required based on the partition! Coalesced access is efficient for GPU, but harmful for CPU! Transform data before move from/to GPU memory Reduce the overall execution time, taking into account memory transfer and transform delays!
22
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 22 Scheduling and Mapping CPU Load:45 GPU Load:40 DMA Load:40 MII:45 B AC D E GPU:20 CPU:20 GPU:20 CPU:15 CPU:10 20 10 B AC D E CPU:10 GPU:20 CPU:20 CPU:80 GPU:20 CPU:15 GPU:10 CPU:10 GPU:25 20 10 60 Initial StreamIt GraphPartitioned Graph
23
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 23 B n-2 D n-6 E n-7 B n-1 A n-1 B n-3 C n-3 D n-5 C n-5 AnAn C n-4 CPUDMA ChannelGPU B A C D E GPU:20 CPU:20 GPU:20 CPU:15 CPU:10 20 10 Scheduling and Mapping
24
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 24 Compiler Framework Execute Profile Runs Generate Code for Profiling Configuration Selection StreamIt Program Task Partitioning Task Partitioning ILP Partitioner Heuristic Partitioner Instance Partitioning Instance Partitioning Modulo Scheduling Code Generation CUDA Code + C Code
25
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 25 Significant speedup for synergistic execution Experimental Results on Tesla > 52x> 32x> 65x
26
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 26 What is needed? Streaming Lang. MPI OpenMP CUDA/ OpenCL Array Lang. (Matlab) Parallel Lang. CellBE Other Aceel. Multicores GPUs SSE Synergistic Execution on Multiple Hetergeneous Cores PLASMA: High-Level IR Compiler Runtime System
27
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 27 Rich abstractions for Functionality Independence from any single architecture Portability without compromises on efficiency Scale-up and scale down Single core embedded processor to multi-core workstation Take advantage of Accelerators (GPU, Cell, …) Transparent Distributed Memory PLASMA: Portable Programming for PLASTIC SIMD Accelerators IR: What should a solution provide?
28
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 28 PLASMA IR Reduce Add Par Mul SliceV M Matrix-Vector Multiply par mul, temp, A[i *n : i *n+n : 1], X reduce add, Y[I : i+1 : 1], temp
29
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 29 “CPLASM”, a prototype high-level assembly language Prototype PLASMA IR Compiler Currently Supported Targets: C (Scalar), SSE3, CUDA (NVIDIA GPUs) Future Targets: Cell, ATI, ARM Neon,... Compiler Optimizations for this “Vector” IR Our Framework
30
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 30 Our Framework (contd.) Plenty of optimization opportunities!
31
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 31 PLASMA IR Performance Normalized exec. Time comparable to that of hand-tuned library!
32
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 32 Ongoing Work Streaming Lang. MPI OpenMP CUDA/ OpenCL Array Lang. (Matlab) Parallel Lang. CellBE Other Aceel. Multicores GPUs SSE Synergistic Execution on Multiple Hetergeneous Cores PLASMA: High-Level IR Compiler Runtime System Look at other high level languages ! Target other accelerators
33
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 33 Compiling OpenMP/MPI / X10 Mapping the semantics Exploiting data parallelism and task parallelism Communication and synchronization across CPU/GPU/Multiple Nodes Accelerator-specific optimization Memory layout, memory transfer, … Performance and Scaling
34
R. Govindarajan ATIP 1 st Workshop on HPC in India @ SC-09 34 Thank You !! My students! IISc and SERC Microsoft and Nvidia ATIP, NSF, all Sponsors ONR Acknowledgements
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.