Jan. 2009 Programming Models for Accelerator-Based Architectures R. Govindarajan HPC Lab,SERC, IISc

Slides:

Advertisements

Similar presentations

Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc

Advertisements

Software Pipelined Execution of Stream Programs on GPUs Abhishek Udupa et. al. Indian Institute of Science, Bangalore 2009.

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

CSCI 4717/5717 Computer Architecture

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Bottleneck Elimination from Stream Graphs S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.

EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

GPGPU platforms GP - General Purpose computation using GPU

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Basics and Architectures

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.

VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc

GPU Architecture and Programming

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today:

Michael I. Gordon, William Thies, and Saman Amarasinghe

EKT303/4 Superscalar vs Super-pipelined.

Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

Linear Analysis and Optimization of Stream Programs Masterworks Presentation Andrew A. Lamb 4/30/2003 Professor Saman Amarasinghe MIT Laboratory for Computer.

My Coordinates Office EM G.27 contact time:

Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

CS427 Multicore Architecture and Parallel Computing

Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Lecture 5: Pipeline Wrap-up, Static ILP

6- General Purpose GPU Programming

Presentation transcript:

Jan Programming Models for Accelerator-Based Architectures R. Govindarajan HPC Lab,SERC, IISc

Jan © 2 HPC Design Using Accelerators High level of performance from Accelerators Variety of general-purpose hardware accelerators –GPUs : nVidia, ATI, –Accelerators: Clearspeed, Cell BE, … –Plethora of Instruction Sets even for SIMD Programmable accelerators, e.g., FPGA-based HPC Design using Accelerators –Exploit instruction-level parallelism –Exploit data-level parallelism on SIMD units –Exploit thread-level parallelism on multiple units/multi-cores Challenges –Portability across different generation and platforms –Ability to exploit different types of parallelism

Jan © 3 Accelerators – Cell BE

Jan © 4 Accelerators GPU

Jan © 5 The Challenge

Jan © 6 Programming in Accelerator- Based Architectures Develop a framework –Programmed in a higher-level language, and is efficient –Can exploit different types of parallelism on different hardware –Parallelism across heterogeneous functional units –Be portable across platforms – not device specific! Jointly with Prof. Matthew Jacob Architecture Lab., SERC, IISc

Jan © 7 Existing Approaches StreaMIT RAWCellBE Compiler Accelerator GPUs Runtime System Brooks GPUs Compiler C/C++ SSE/ Altivec Auto vectorizer

Jan © 8 What is needed Compiler/ Runtime System

Jan © 9 Two-Pronged Approach CUDA Profile-based Compiler GPUsMulticore PLASMA: High-Level Intermediate Representation Compiler and Runtime System

Jan © 10 Two-Pronged Approach CUDA Profile-based Compiler GPUsMulticore PLASMA: High-Level Intermediate Representation Compiler and Runtime System StreaMIT

Jan © 11 Stream Programming Model Higher level programming model where nodes represent computation and channels communication (producer/consumer relation) between them. Exposes Pipelined parallelism and Task-level parallelism Temporal streaming of data Synchronous Data Flow (SDF), Stream Flow Graph, StreamMIT, Brook, … Compiling techniques for achieving rate-optimal, buffer-optimal, software-pipelined schedules Mapping applications to Accelerators such as GPUs and Cell BE.

Jan © 12 The StreamIt Language Streamit programs are a hierarchical composition of three basic constructs: –Pipeline –SplitJoin Round-robin or duplicate splitter –Feedback Loop Stateful filters Peek values... Splitter Filter Stream Joiner BodySplitter Loop

Jan © 13 StreaMIT No. of Push/Pop values fixed and known at compile-time Multi-rate firing Dup. Splitter Bandpass Filter + Amplifier Combiner Signal Source Bandpass Filter + Amplifier 2 – Band Equalizer

Jan © 14 Multi-Rate Firing Consistent firing rate of nodes to ensure no data accumulation on channels If node A fires 3 times, B should fire twice, and C should fire 4 times Solving a set of linear equations! N A * 2 = N B * 3 N B * 4 = N C * 2 Multiple solutions possible Primitive steady-state solution (firing rates) B A C

Jan © 15 StreamIt on GPUs StreamIt provides a convenient way of programming GPUs More ”natural” than frameworks like CUDA or CTM for most domains Easier learning curve than CUDA, programmer does not need to think of the program in terms of ”threads” or blocks, but only as a set of communicating filters StreamIt programs are easier to verify, since the I/O rates of each filter are static, and hence the schedule can be determined entirely at compile time.

Jan © 16 Challenges on GPUs Work distribution between the multiprocessors –GPUs have hundreds of processors (SMs and SIMD units)! Exploiting task-level and data-level parallelism –Scheduling across the multiprocessors –Multiple concurrent threads in SM to exploit DLP Determining the execution configuration (number of threads for each filter) that minimizes execution time. Register constraints (eventhough ~1000s of them) Lack of synchronization mechanisms between the multiprocessors of the GPU. Managing CPU-GPU memory bandwidth efficiently ”Stateless” filters exploit data parallelism, but ”stateful” filters require special attention.

Jan © 17 Existing Approaches Single Threaded SIMD Execution

Jan © 18 Existing Approaches (contd.) Execution on Cell BE Our Approach for GPUs

Jan © 19 Compiling Stream Programs to CUDA for GPUs Software Pipeline the execution of the stream program on the GPUSoftware Pipeline –This takes care of synchronization and consistency issues, since the multiprocessors can execute their work in a decoupled fashion, with kernel invocations being the only synchronization points. –Work distribution and scheduling are accomplished by formulating the problem as a unified Integer Linear Program and solving it, using standard ILP solvers. –The ILP formulation is sufficiently simple to be solved in a few seconds on current hardware.

Jan © 20 Example Loop: LD F0, 0(R1) ADDD F4, F2, F0 ST 0(R1), F4 Add R1, R1, #8 Sub R2, R2, #1 Beqz R2, Loop Target Assembly Code DDG for (i=0 ; i < n ; i++) A[i] = A[i] + s; High Level Code Ld Addd Add 3 2 St Sub Beq

Jan © 21 Basic Block Scheduling A target arch, with 1 Int, 1 FP, 1 Ld/St, and 1 Branch FUs. Load latency = 2 cycles FP Latency = 3 cycles All other instrns. take 1 cycle TInt.Ld/StFPBr. 1Ld 2 3SubAddd 4 5 6AddStBeq 7Ld 8 9SubAddd AddStBeq 6 cycles for each iteration. Ld Addd Add 3 2 St Sub Beq

Jan © 22 Overlapped Execution of Iterations TInt.Ld/StFPBr. 1SubLd 2Add 3Addd 4 5 6StBeq SubLd Add Addd StBeq SubLd Add Addd StBeq Schedule the Add (and Sub ) early –May cause problem with St due to anti-dependence (WAR) Offset of store can be adjusted (-8 or -16 can be used!) –Enables the next Ld to be scheduled sooner!  Repetitive pattern appears!  Throughput = 2 cycles per iteration!

Jan © 23 Prolog Kernel (repeated n-2 times) Epilog TInt.Ld/StFPBr. 1SubLd 2Add 3Addd 4 5 6StBeq SubLd Add Addd StBeq SubLd Add Addd StBeq Overlapped Execution of Iterations

Jan © 24 Stream Graph Execution Stream Graph Buffer requirement = 4 x A C D B SIMD Execution A1A2 SM1SM2SM3SM4 A3A4 B1B2B3B4 D3 C3 D4 C4 D1 C1 D2 C

Jan © 25 Stream Graph Execution Stream Graph Software Pipelined Execution Buffer requirement = 2 x A C D B SM1SM2SM3SM4 A1A2 A3A4 B1B2 B3B4 D1 C1 D2 C2 D3 C3 D4 C

Jan © 26 Our Approach Good execution configuration determined by using profiling – Identify near-optimal no. of concurrent thread instances per filter. –Takes into consideration register contrainsts Formulate work scheduling and processor (SM) assignment as a unified Integer Linear Program problem. –Takes into account communication bandwidth restrictions Efficient buffer layout scheme to ensure all accesses to GPU memory are coalesced. Stateful filters are assigned to CPUs – synergistic execution of CPUs and GPUs is ongoing work!

Jan © 27 ILP Formulation Resource Constraints : w k,v,p = 1  kth instance of filter v mapped to SM p

Jan © 28 ILP Formulation Dependence Constraint :  (j,k,v) -- Sched. Time of kth instance of filter v in steady state iteration j o k,v specifies time within the SWP kernel f k,v specifies the stage of the SWP kernel Filter execution must complete by kernel end

Jan © 29 ILP Formulation Dependence Constraint (contd.): Admissibility of the schedule is given by: Constraint solving the above equations gives the schedule!

Jan © 30 Compiler Framework

Jan © 31 Experimental Results Speedup on GPU (8800) compared to CPU of stream programs Filters are coarsened before scheduling!

Jan © 32 Experimental Results (contd.) Improvements due to buffer coalescing More results in the CGO-09 paper!

Jan © 33 Two-Pronged Approach Compiler/ Runtime System CUDA Profile-based Compiler GPUsMulticore

Jan © 34 Challenges Different SIMD Architectures (Threaded (GPU) vs. Short Vector (CPU))‏ Multiple Homogeneous cores Heterogeneous Accelerators Distributed Memory on chip!

Jan © 35 What should a solution provide? Rich abstractions for Functionality –Not a lowest common denominator Independence from any single architecture Portability without compromises on efficiency –Don't forget high-performance goals of the ISA Scale-up and scale down –Single core embedded processor to multi-core workstation Take advantage of Accelerators (GPU, Cell, etc.)‏ Transparent Distributed Memory  PLASMA: Portable Programming for PLASTIC SIMD Accelerators

Jan © 36 Our Approach Stream Program Intermediate Representation Cuda, C with Intrinsics, Stream or Other high-level program model to a high-level intermediate language –Perform suitable compiler optimization –Intermediate representation expressive enough to handle (target) machine specificities IR to Target machine –Exploit SIMD and thread-level parallelism –Agnostic to SIMD width –Manages heterogeneous memory

Jan © 37 PLASMA Overview

Jan © 38 PLASMA IR Operator –Add, Mult, … Vector –1-D bulk data type of base types –E.g. Distributor –Distributes operator over vector –Example: par add returns Vector composition –Concat, slice, gather, scatter, … Reduce Add Par Mul SliceV M Matrix-Vector Multiply par mul, temp, A[i * n:i * n + n:1], X reduce add, Y[i:i + 1:1], temp

Jan © 39 Our Framework “CPLASM”, a prototype high-level assembly language Prototype PLASMA IR Compiler Currently Supported Targets:  C (Scalar), SSE3, CUDA (NVIDIA GPUs)‏ Future Targets:  Cell, ATI, ARM Neon,... Compiler Optimizations for this “Vector” IR

Jan © 40 Our Framework (contd.)

Jan © 41 Experimental Results Kernel programs written in CPLASM Compiled to C or CUDA, exposing SIMD parallelism Execution on SSE2 or GPU Comparison with hand-optimized library

Jan © 42 Initial Results Compares well with hand-optimized library kernels Blocking (tiling) optimization can lead to better performance

Jan © 43 Future Directions Synergistic execution of stream program in CPU and GPU. Support for multiple heterogeneous functional units Retargetting PLASMA for multiple accelerators Extending the framework beyond Stream Programming models

Jan Thank You !!