Jan Programming Models for Accelerator-Based Architectures R. Govindarajan HPC Lab,SERC, IISc
Jan © 2 HPC Design Using Accelerators High level of performance from Accelerators Variety of general-purpose hardware accelerators –GPUs : nVidia, ATI, –Accelerators: Clearspeed, Cell BE, … –Plethora of Instruction Sets even for SIMD Programmable accelerators, e.g., FPGA-based HPC Design using Accelerators –Exploit instruction-level parallelism –Exploit data-level parallelism on SIMD units –Exploit thread-level parallelism on multiple units/multi-cores Challenges –Portability across different generation and platforms –Ability to exploit different types of parallelism
Jan © 3 Accelerators – Cell BE
Jan © 4 Accelerators GPU
Jan © 5 The Challenge
Jan © 6 Programming in Accelerator- Based Architectures Develop a framework –Programmed in a higher-level language, and is efficient –Can exploit different types of parallelism on different hardware –Parallelism across heterogeneous functional units –Be portable across platforms – not device specific! Jointly with Prof. Matthew Jacob Architecture Lab., SERC, IISc
Jan © 7 Existing Approaches StreaMIT RAWCellBE Compiler Accelerator GPUs Runtime System Brooks GPUs Compiler C/C++ SSE/ Altivec Auto vectorizer
Jan © 8 What is needed Compiler/ Runtime System
Jan © 9 Two-Pronged Approach CUDA Profile-based Compiler GPUsMulticore PLASMA: High-Level Intermediate Representation Compiler and Runtime System
Jan © 10 Two-Pronged Approach CUDA Profile-based Compiler GPUsMulticore PLASMA: High-Level Intermediate Representation Compiler and Runtime System StreaMIT
Jan © 11 Stream Programming Model Higher level programming model where nodes represent computation and channels communication (producer/consumer relation) between them. Exposes Pipelined parallelism and Task-level parallelism Temporal streaming of data Synchronous Data Flow (SDF), Stream Flow Graph, StreamMIT, Brook, … Compiling techniques for achieving rate-optimal, buffer-optimal, software-pipelined schedules Mapping applications to Accelerators such as GPUs and Cell BE.
Jan © 12 The StreamIt Language Streamit programs are a hierarchical composition of three basic constructs: –Pipeline –SplitJoin Round-robin or duplicate splitter –Feedback Loop Stateful filters Peek values... Splitter Filter Stream Joiner BodySplitter Loop
Jan © 13 StreaMIT No. of Push/Pop values fixed and known at compile-time Multi-rate firing Dup. Splitter Bandpass Filter + Amplifier Combiner Signal Source Bandpass Filter + Amplifier 2 – Band Equalizer
Jan © 14 Multi-Rate Firing Consistent firing rate of nodes to ensure no data accumulation on channels If node A fires 3 times, B should fire twice, and C should fire 4 times Solving a set of linear equations! N A * 2 = N B * 3 N B * 4 = N C * 2 Multiple solutions possible Primitive steady-state solution (firing rates) B A C
Jan © 15 StreamIt on GPUs StreamIt provides a convenient way of programming GPUs More ”natural” than frameworks like CUDA or CTM for most domains Easier learning curve than CUDA, programmer does not need to think of the program in terms of ”threads” or blocks, but only as a set of communicating filters StreamIt programs are easier to verify, since the I/O rates of each filter are static, and hence the schedule can be determined entirely at compile time.
Jan © 16 Challenges on GPUs Work distribution between the multiprocessors –GPUs have hundreds of processors (SMs and SIMD units)! Exploiting task-level and data-level parallelism –Scheduling across the multiprocessors –Multiple concurrent threads in SM to exploit DLP Determining the execution configuration (number of threads for each filter) that minimizes execution time. Register constraints (eventhough ~1000s of them) Lack of synchronization mechanisms between the multiprocessors of the GPU. Managing CPU-GPU memory bandwidth efficiently ”Stateless” filters exploit data parallelism, but ”stateful” filters require special attention.
Jan © 17 Existing Approaches Single Threaded SIMD Execution
Jan © 18 Existing Approaches (contd.) Execution on Cell BE Our Approach for GPUs
Jan © 19 Compiling Stream Programs to CUDA for GPUs Software Pipeline the execution of the stream program on the GPUSoftware Pipeline –This takes care of synchronization and consistency issues, since the multiprocessors can execute their work in a decoupled fashion, with kernel invocations being the only synchronization points. –Work distribution and scheduling are accomplished by formulating the problem as a unified Integer Linear Program and solving it, using standard ILP solvers. –The ILP formulation is sufficiently simple to be solved in a few seconds on current hardware.
Jan © 20 Example Loop: LD F0, 0(R1) ADDD F4, F2, F0 ST 0(R1), F4 Add R1, R1, #8 Sub R2, R2, #1 Beqz R2, Loop Target Assembly Code DDG for (i=0 ; i < n ; i++) A[i] = A[i] + s; High Level Code Ld Addd Add 3 2 St Sub Beq
Jan © 21 Basic Block Scheduling A target arch, with 1 Int, 1 FP, 1 Ld/St, and 1 Branch FUs. Load latency = 2 cycles FP Latency = 3 cycles All other instrns. take 1 cycle TInt.Ld/StFPBr. 1Ld 2 3SubAddd 4 5 6AddStBeq 7Ld 8 9SubAddd AddStBeq 6 cycles for each iteration. Ld Addd Add 3 2 St Sub Beq
Jan © 22 Overlapped Execution of Iterations TInt.Ld/StFPBr. 1SubLd 2Add 3Addd 4 5 6StBeq SubLd Add Addd StBeq SubLd Add Addd StBeq Schedule the Add (and Sub ) early –May cause problem with St due to anti-dependence (WAR) Offset of store can be adjusted (-8 or -16 can be used!) –Enables the next Ld to be scheduled sooner! Repetitive pattern appears! Throughput = 2 cycles per iteration!
Jan © 23 Prolog Kernel (repeated n-2 times) Epilog TInt.Ld/StFPBr. 1SubLd 2Add 3Addd 4 5 6StBeq SubLd Add Addd StBeq SubLd Add Addd StBeq Overlapped Execution of Iterations
Jan © 24 Stream Graph Execution Stream Graph Buffer requirement = 4 x A C D B SIMD Execution A1A2 SM1SM2SM3SM4 A3A4 B1B2B3B4 D3 C3 D4 C4 D1 C1 D2 C
Jan © 25 Stream Graph Execution Stream Graph Software Pipelined Execution Buffer requirement = 2 x A C D B SM1SM2SM3SM4 A1A2 A3A4 B1B2 B3B4 D1 C1 D2 C2 D3 C3 D4 C
Jan © 26 Our Approach Good execution configuration determined by using profiling – Identify near-optimal no. of concurrent thread instances per filter. –Takes into consideration register contrainsts Formulate work scheduling and processor (SM) assignment as a unified Integer Linear Program problem. –Takes into account communication bandwidth restrictions Efficient buffer layout scheme to ensure all accesses to GPU memory are coalesced. Stateful filters are assigned to CPUs – synergistic execution of CPUs and GPUs is ongoing work!
Jan © 27 ILP Formulation Resource Constraints : w k,v,p = 1 kth instance of filter v mapped to SM p
Jan © 28 ILP Formulation Dependence Constraint : (j,k,v) -- Sched. Time of kth instance of filter v in steady state iteration j o k,v specifies time within the SWP kernel f k,v specifies the stage of the SWP kernel Filter execution must complete by kernel end
Jan © 29 ILP Formulation Dependence Constraint (contd.): Admissibility of the schedule is given by: Constraint solving the above equations gives the schedule!
Jan © 30 Compiler Framework
Jan © 31 Experimental Results Speedup on GPU (8800) compared to CPU of stream programs Filters are coarsened before scheduling!
Jan © 32 Experimental Results (contd.) Improvements due to buffer coalescing More results in the CGO-09 paper!
Jan © 33 Two-Pronged Approach Compiler/ Runtime System CUDA Profile-based Compiler GPUsMulticore
Jan © 34 Challenges Different SIMD Architectures (Threaded (GPU) vs. Short Vector (CPU)) Multiple Homogeneous cores Heterogeneous Accelerators Distributed Memory on chip!
Jan © 35 What should a solution provide? Rich abstractions for Functionality –Not a lowest common denominator Independence from any single architecture Portability without compromises on efficiency –Don't forget high-performance goals of the ISA Scale-up and scale down –Single core embedded processor to multi-core workstation Take advantage of Accelerators (GPU, Cell, etc.) Transparent Distributed Memory PLASMA: Portable Programming for PLASTIC SIMD Accelerators
Jan © 36 Our Approach Stream Program Intermediate Representation Cuda, C with Intrinsics, Stream or Other high-level program model to a high-level intermediate language –Perform suitable compiler optimization –Intermediate representation expressive enough to handle (target) machine specificities IR to Target machine –Exploit SIMD and thread-level parallelism –Agnostic to SIMD width –Manages heterogeneous memory
Jan © 37 PLASMA Overview
Jan © 38 PLASMA IR Operator –Add, Mult, … Vector –1-D bulk data type of base types –E.g. Distributor –Distributes operator over vector –Example: par add returns Vector composition –Concat, slice, gather, scatter, … Reduce Add Par Mul SliceV M Matrix-Vector Multiply par mul, temp, A[i * n:i * n + n:1], X reduce add, Y[i:i + 1:1], temp
Jan © 39 Our Framework “CPLASM”, a prototype high-level assembly language Prototype PLASMA IR Compiler Currently Supported Targets: C (Scalar), SSE3, CUDA (NVIDIA GPUs) Future Targets: Cell, ATI, ARM Neon,... Compiler Optimizations for this “Vector” IR
Jan © 40 Our Framework (contd.)
Jan © 41 Experimental Results Kernel programs written in CPLASM Compiled to C or CUDA, exposing SIMD parallelism Execution on SSE2 or GPU Comparison with hand-optimized library
Jan © 42 Initial Results Compares well with hand-optimized library kernels Blocking (tiling) optimization can lead to better performance
Jan © 43 Future Directions Synergistic execution of stream program in CPU and GPU. Support for multiple heterogeneous functional units Retargetting PLASMA for multiple accelerators Extending the framework beyond Stream Programming models
Jan Thank You !!