Jan. 2009 Programming Models for Accelerator-Based Architectures R. Govindarajan HPC Lab,SERC, IISc

Jan. 2009 (C)RG@SERC,IISc Programming Models for Accelerator-Based Architectures R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in

Jan. 2009 © RG@SERC,IISc 2 HPC Design Using Accelerators High level of performance from Accelerators Variety of general-purpose hardware accelerators –GPUs : nVidia, ATI, –Accelerators: Clearspeed, Cell BE, … –Plethora of Instruction Sets even for SIMD Programmable accelerators, e.g., FPGA-based HPC Design using Accelerators –Exploit instruction-level parallelism –Exploit data-level parallelism on SIMD units –Exploit thread-level parallelism on multiple units/multi-cores Challenges –Portability across different generation and platforms –Ability to exploit different types of parallelism

Jan. 2009 © RG@SERC,IISc 3 Accelerators – Cell BE

Jan. 2009 © RG@SERC,IISc 4 Accelerators - 8800 GPU

Jan. 2009 © RG@SERC,IISc 5 The Challenge

Jan. 2009 © RG@SERC,IISc 6 Programming in Accelerator- Based Architectures Develop a framework –Programmed in a higher-level language, and is efficient –Can exploit different types of parallelism on different hardware –Parallelism across heterogeneous functional units –Be portable across platforms – not device specific! Jointly with Prof. Matthew Jacob Architecture Lab., SERC, IISc

Jan. 2009 © RG@SERC,IISc 7 Existing Approaches StreaMIT RAWCellBE Compiler Accelerator GPUs Runtime System Brooks GPUs Compiler C/C++ SSE/ Altivec Auto vectorizer

Jan. 2009 © RG@SERC,IISc 8 What is needed Compiler/ Runtime System

Jan. 2009 © RG@SERC,IISc 9 Two-Pronged Approach CUDA Profile-based Compiler GPUsMulticore PLASMA: High-Level Intermediate Representation Compiler and Runtime System

Jan. 2009 © RG@SERC,IISc 10 Two-Pronged Approach CUDA Profile-based Compiler GPUsMulticore PLASMA: High-Level Intermediate Representation Compiler and Runtime System StreaMIT

Jan. 2009 © RG@SERC,IISc 11 Stream Programming Model Higher level programming model where nodes represent computation and channels communication (producer/consumer relation) between them. Exposes Pipelined parallelism and Task-level parallelism Temporal streaming of data Synchronous Data Flow (SDF), Stream Flow Graph, StreamMIT, Brook, … Compiling techniques for achieving rate-optimal, buffer-optimal, software-pipelined schedules Mapping applications to Accelerators such as GPUs and Cell BE.

Jan. 2009 © RG@SERC,IISc 12 The StreamIt Language Streamit programs are a hierarchical composition of three basic constructs: –Pipeline –SplitJoin Round-robin or duplicate splitter –Feedback Loop Stateful filters Peek values... Splitter Filter Stream Joiner BodySplitter Loop

Jan. 2009 © RG@SERC,IISc 13 StreaMIT No. of Push/Pop values fixed and known at compile-time Multi-rate firing Dup. Splitter Bandpass Filter + Amplifier Combiner Signal Source Bandpass Filter + Amplifier 2 – Band Equalizer

Jan. 2009 © RG@SERC,IISc 14 Multi-Rate Firing Consistent firing rate of nodes to ensure no data accumulation on channels If node A fires 3 times, B should fire twice, and C should fire 4 times Solving a set of linear equations! N A * 2 = N B * 3 N B * 4 = N C * 2 Multiple solutions possible Primitive steady-state solution (firing rates) B A C 2 3 4 2

Jan. 2009 © RG@SERC,IISc 15 StreamIt on GPUs StreamIt provides a convenient way of programming GPUs More ”natural” than frameworks like CUDA or CTM for most domains Easier learning curve than CUDA, programmer does not need to think of the program in terms of ”threads” or blocks, but only as a set of communicating filters StreamIt programs are easier to verify, since the I/O rates of each filter are static, and hence the schedule can be determined entirely at compile time.

Jan. 2009 © RG@SERC,IISc 16 Challenges on GPUs Work distribution between the multiprocessors –GPUs have hundreds of processors (SMs and SIMD units)! Exploiting task-level and data-level parallelism –Scheduling across the multiprocessors –Multiple concurrent threads in SM to exploit DLP Determining the execution configuration (number of threads for each filter) that minimizes execution time. Register constraints (eventhough ~1000s of them) Lack of synchronization mechanisms between the multiprocessors of the GPU. Managing CPU-GPU memory bandwidth efficiently ”Stateless” filters exploit data parallelism, but ”stateful” filters require special attention.

Jan. 2009 © RG@SERC,IISc 17 Existing Approaches Single Threaded SIMD Execution

Jan. 2009 © RG@SERC,IISc 18 Existing Approaches (contd.) Execution on Cell BE Our Approach for GPUs

Jan. 2009 © RG@SERC,IISc 19 Compiling Stream Programs to CUDA for GPUs Software Pipeline the execution of the stream program on the GPUSoftware Pipeline –This takes care of synchronization and consistency issues, since the multiprocessors can execute their work in a decoupled fashion, with kernel invocations being the only synchronization points. –Work distribution and scheduling are accomplished by formulating the problem as a unified Integer Linear Program and solving it, using standard ILP solvers. –The ILP formulation is sufficiently simple to be solved in a few seconds on current hardware.

Jan. 2009 © RG@SERC,IISc 20 Example Loop: LD F0, 0(R1) ADDD F4, F2, F0 ST 0(R1), F4 Add R1, R1, #8 Sub R2, R2, #1 Beqz R2, Loop Target Assembly Code DDG for (i=0 ; i < n ; i++) A[i] = A[i] + s; High Level Code Ld Addd Add 3 2 St Sub Beq

Jan. 2009 © RG@SERC,IISc 21 Basic Block Scheduling A target arch, with 1 Int, 1 FP, 1 Ld/St, and 1 Branch FUs. Load latency = 2 cycles FP Latency = 3 cycles All other instrns. take 1 cycle TInt.Ld/StFPBr. 1Ld 2 3SubAddd 4 5 6AddStBeq 7Ld 8 9SubAddd 10 11 12AddStBeq 6 cycles for each iteration. Ld Addd Add 3 2 St Sub Beq

Jan. 2009 © RG@SERC,IISc 22 Overlapped Execution of Iterations TInt.Ld/StFPBr. 1SubLd 2Add 3Addd 4 5 6StBeq 7 8 9 10 11 12 SubLd Add Addd StBeq SubLd Add Addd StBeq Schedule the Add (and Sub ) early –May cause problem with St due to anti-dependence (WAR) Offset of store can be adjusted (-8 or -16 can be used!) –Enables the next Ld to be scheduled sooner!  Repetitive pattern appears!  Throughput = 2 cycles per iteration!

Jan. 2009 © RG@SERC,IISc 23 Prolog Kernel (repeated n-2 times) Epilog TInt.Ld/StFPBr. 1SubLd 2Add 3Addd 4 5 6StBeq 7 8 9 10 11 12 SubLd Add Addd StBeq SubLd Add Addd StBeq Overlapped Execution of Iterations

Jan. 2009 © RG@SERC,IISc 24 Stream Graph Execution Stream Graph Buffer requirement = 4 x A C D B SIMD Execution A1A2 SM1SM2SM3SM4 A3A4 B1B2B3B4 D3 C3 D4 C4 D1 C1 D2 C2 0 1 2 3 4 5 6 7

Jan. 2009 © RG@SERC,IISc 25 Stream Graph Execution Stream Graph Software Pipelined Execution Buffer requirement = 2 x A C D B SM1SM2SM3SM4 A1A2 A3A4 B1B2 B3B4 D1 C1 D2 C2 D3 C3 D4 C4 0 1 2 3 4 5 6 7

Jan. 2009 © RG@SERC,IISc 26 Our Approach Good execution configuration determined by using profiling – Identify near-optimal no. of concurrent thread instances per filter. –Takes into consideration register contrainsts Formulate work scheduling and processor (SM) assignment as a unified Integer Linear Program problem. –Takes into account communication bandwidth restrictions Efficient buffer layout scheme to ensure all accesses to GPU memory are coalesced. Stateful filters are assigned to CPUs – synergistic execution of CPUs and GPUs is ongoing work!

Jan. 2009 © RG@SERC,IISc 27 ILP Formulation Resource Constraints : w k,v,p = 1  kth instance of filter v mapped to SM p

Jan. 2009 © RG@SERC,IISc 28 ILP Formulation Dependence Constraint :  (j,k,v) -- Sched. Time of kth instance of filter v in steady state iteration j o k,v specifies time within the SWP kernel f k,v specifies the stage of the SWP kernel Filter execution must complete by kernel end

Jan. 2009 © RG@SERC,IISc 29 ILP Formulation Dependence Constraint (contd.): Admissibility of the schedule is given by: Constraint solving the above equations gives the schedule!

Jan. 2009 © RG@SERC,IISc 30 Compiler Framework

Jan. 2009 © RG@SERC,IISc 31 Experimental Results Speedup on GPU (8800) compared to CPU of stream programs Filters are coarsened before scheduling!

Jan. 2009 © RG@SERC,IISc 32 Experimental Results (contd.) Improvements due to buffer coalescing More results in the CGO-09 paper!

Jan. 2009 © RG@SERC,IISc 34 Challenges Different SIMD Architectures (Threaded (GPU) vs. Short Vector (CPU))‏ Multiple Homogeneous cores Heterogeneous Accelerators Distributed Memory on chip!

Jan. 2009 © RG@SERC,IISc 35 What should a solution provide? Rich abstractions for Functionality –Not a lowest common denominator Independence from any single architecture Portability without compromises on efficiency –Don't forget high-performance goals of the ISA Scale-up and scale down –Single core embedded processor to multi-core workstation Take advantage of Accelerators (GPU, Cell, etc.)‏ Transparent Distributed Memory  PLASMA: Portable Programming for PLASTIC SIMD Accelerators

Jan. 2009 © RG@SERC,IISc 36 Our Approach Stream Program Intermediate Representation Cuda, C with Intrinsics, Stream or Other high-level program model to a high-level intermediate language –Perform suitable compiler optimization –Intermediate representation expressive enough to handle (target) machine specificities IR to Target machine –Exploit SIMD and thread-level parallelism –Agnostic to SIMD width –Manages heterogeneous memory

Jan. 2009 © RG@SERC,IISc 38 PLASMA IR Operator –Add, Mult, … Vector –1-D bulk data type of base types –E.g. Distributor –Distributes operator over vector –Example: par add returns Vector composition –Concat, slice, gather, scatter, … Reduce Add Par Mul SliceV M Matrix-Vector Multiply par mul, temp, A[i * n:i * n + n:1], X reduce add, Y[i:i + 1:1], temp

Jan. 2009 © RG@SERC,IISc 39 Our Framework “CPLASM”, a prototype high-level assembly language Prototype PLASMA IR Compiler Currently Supported Targets:  C (Scalar), SSE3, CUDA (NVIDIA GPUs)‏ Future Targets:  Cell, ATI, ARM Neon,... Compiler Optimizations for this “Vector” IR

Jan. 2009 © RG@SERC,IISc 41 Experimental Results Kernel programs written in CPLASM Compiled to C or CUDA, exposing SIMD parallelism Execution on SSE2 or GPU Comparison with hand-optimized library

Jan. 2009 © RG@SERC,IISc 42 Initial Results Compares well with hand-optimized library kernels Blocking (tiling) optimization can lead to better performance

Jan. 2009 © RG@SERC,IISc 43 Future Directions Synergistic execution of stream program in CPU and GPU. Support for multiple heterogeneous functional units Retargetting PLASMA for multiple accelerators Extending the framework beyond Stream Programming models

Jan. 2009 (C)RG@SERC,IISc Thank You !!

Jan. 2009 Programming Models for Accelerator-Based Architectures R. Govindarajan HPC Lab,SERC, IISc

Similar presentations

Presentation on theme: "Jan. 2009 Programming Models for Accelerator-Based Architectures R. Govindarajan HPC Lab,SERC, IISc"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jan. 2009 Programming Models for Accelerator-Based Architectures R. Govindarajan HPC Lab,SERC, IISc

Similar presentations

Presentation on theme: "Jan. 2009 Programming Models for Accelerator-Based Architectures R. Govindarajan HPC Lab,SERC, IISc"— Presentation transcript:

Similar presentations

About project

Feedback