Software Pipelined Execution of Stream Programs on GPUs Abhishek Udupa et. al. Indian Institute of Science, Bangalore 2009.

Slides:

Advertisements

Similar presentations

Analyzing Parallel Performance Intel Software College Introduction to Parallel Programming – Part 6.

Advertisements

Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.

Design and Evaluation of an Autonomic Workflow Engine Thomas Heinis, Cesare Pautasso, Gustavo Alsonso Dept. of Computer Science Swiss Federal Institute.

Adders Used to perform addition, subtraction, multiplication, and division (sometimes) Half-adder adds rightmost (least significant) bit Full-adder.

EE384y: Packet Switch Architectures

Analysis of Computer Algorithms

1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.

Distributed Systems Architectures

Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.

Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.

Processes and Operating Systems

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.

Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.

Towards Automating the Configuration of a Distributed Storage System Lauro B. Costa Matei Ripeanu {lauroc, NetSysLab University of British.

1 Introducing the Specifications of the Metro Ethernet Forum MEF 19 Abstract Test Suite for UNI Type 1 February 2008.

Chapter 5 Input/Output 5.1 Principles of I/O hardware

Chapter 6 File Systems 6.1 Files 6.2 Directories

Chapter 1 Introduction Copyright © Operating Systems, by Dhananjay Dhamdhere Copyright © Introduction Abstract Views of an Operating System.

Prasanna Pandit R. Govindarajan

Robust Window-based Multi-node Technology- Independent Logic Minimization Jeff L.Cobb Kanupriya Gulati Sunil P. Khatri Texas Instruments, Inc. Dept. of.

Figure 12–1 Basic computer block diagram.

Chapter 5 – Enterprise Analysis

Chapter 11: Models of Computation

The Modular Structure of Complex Systems Team 3 Nupur Choudhary Aparna Nanjappa Mark Zeits.

Ideal Parent Structure Learning School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan with Iftach Nachman and Nir.

CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.

Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc

Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.

CMPT 275 Software Engineering

1 Processes and Threads Chapter Processes 2.2 Threads 2.3 Interprocess communication 2.4 Classical IPC problems 2.5 Scheduling.

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.

A Hardware Processing Unit For Point Sets S. Heinzle, G. Guennebaud, M. Botsch, M. Gross Graphics Hardware 2008.

Chapter 2 Entity-Relationship Data Modeling: Tools and Techniques

Analyzing Genes and Genomes

Systems Analysis and Design in a Changing World, Fifth Edition

PSSA Preparation.

Essential Cell Biology

Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.

Energy Generation in Mitochondria and Chlorplasts

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Dataflow Programming with MaxCompiler.

9. Two Functions of Two Random Variables

Lecture 6: Multicore Systems

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Bottleneck Elimination from Stream Graphs S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Jan Programming Models for Accelerator-Based Architectures R. Govindarajan HPC Lab,SERC, IISc

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

My Coordinates Office EM G.27 contact time:

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.

Presentation transcript:

Software Pipelined Execution of Stream Programs on GPUs Abhishek Udupa et. al. Indian Institute of Science, Bangalore 2009

Abstract Describe the challenges in mapping StreamIt to GPUs Propose an efficient technique to software pipeline the execution of stream programs on GPUs Formulating this problem of scheduling and actor assignment to a processor as an efficient ILP Also describes an efficient buffer layout technique for GPUs  That facilitates exploiting the high memory bandwidth available in GPUs Yielding speedups between 1.87X and 36.83X over a single threaded CPU 2

Motivations GPUs have emerged as massively parallel machines GPU can perform general purpose computation Latest GPUs consisting of hundreds of SPUs are capable of supporting thousands concurrent threads with zero-cost h/w controlled context switch GeForce 8800 GTS [128 SPUs forming 16 multiprocessors, connected by a 256-bit wide bus to 512 MB of memory] 3

Motivations Contd. Each multiprocessors in NVIDIA GPUs is conceptually wide SIMD processor CPUs have hierarchies of caches to tolerate memory latency GPUs address the problem by providing high bandwidth processor-memory link and supporting high degree of simultaneous multithreading (SMT) within the processing element The combination of SIMD and SMT enables GPUs to achieve a peak throughput of 400 GFLOPs 4

Motivations Contd. The high performance of GPUs comes at the cost of reduced flexibility and greater programming effort from the user For instance, in the NVIDIA GPUs, threads executing on different multiprocessors cannot  synchronize in an efficient manner  communicate in a reliable and consistent manner through the device memory, within a kernel invocation GPU cannot access the memory of the host system While the memory bus is capable of delivering very high bandwidth  accesses to device memory by threads executing on a multiprocessor, need to be combined in order 5

Motivations Contd. ATI and NVIDIA have proposed the CTM and CUDA frameworks, respectively, for developing general purpose applications targeting the GPUs  But both of these frameworks still require the programmer to express the program as data-parallel computations that can be executed efficiently on the GPU  Programming with these frameworks tie the application to the platform  Any change in the platform would require significant rework in porting the applications 6

Prior Works I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, “Brook for GPUs: Stream Computing on Graphics Hardware,” ACM Transactions on Graphics, vol. 23, no. 3, pp. 777– 786, 2004 D. Tarditi, S. Puri, and J. Oglesby, “Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses,” in ASPLOS-XII: Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, 2006, pp. 325–335. These techniques still require considerable effort by the programmer in order to transform the streaming application to execute efficiently on the GPU Maps StreamIt onto the CellBE platform [M. Kudlur and S. Mahlke PLDI 2008] 7

x2 x1y1 z2 z1 x5 x1 x6 x2 x7 x3 x8 x4 y5 y1 y6 y2 y7 y3 y8 y4 z5 z1 z6 z2 z7 z3 z8 z4 x2 x1y1 Split x3 x1 y3 y1 x4 x2 y4 y2 SPE0 SPE1 z3 z1 z4 z2 Join z2 z1 (a) Unthreaded CPU (b) 4-wide SIMD(c) On Cell by Scott 8 SIMD reduces the trip-count of the loop

With GPU Model Zn+1 Yn+1Xn+2Yn+2 Z1 Y1X2Y2 X2nY2n XnYn Zn+2 Z2 Xn+1 X1 Z2n Zn Thousands of iterations are carried out in parallel 9 Greatly reduces the trip-count of the loop

Motivations Contd. Parallelism must be exploited at various levels  The data parallelism across threads executing on a multiprocessor  The SMT parallelism among threads on the same multiprocessor needs to be managed to provide optimum performance  Parallelism offered by having multiple multiprocessors on a GPU should be used to exploit the task and pipeline level parallelism  Further, accesses to device memory need to be coalesced as far as possible to ensure optimal usage of available bandwidth These challenges are solved in this paper 10

Contributions This paper makes the following contributions:  Describe a software pipelining framework for efficiently mapping StreamIt programs onto GPUs  Present a buffer mapping scheme for StreamIt programs to make efficient use of the memory bandwidth of the GPU  Describe a profile based approach to decide on the optimal number of threads assigned to a multiprocessor called execution configuration for StreamIt filters  Implement the scheme in the StreamIt compiler and demonstrate a 1.87X to 36.83X speedup over a single threaded CPU on a set of streaming applications 11

parallel computation StreamIt Overview StreamIt is a novel language for streaming  Exposes parallelism and communication  Architecture independent  Modular and composable Simple structures composed to creates complex graphs  Malleable Change program behavior with small modifications may be any StreamIt language construct joiner splitter pipeline feedback loop joiner splitter splitjoin filter 12

Organization of the NVIDIA GeForce 8800 series of GPUs Architecture of GeForce 8800 GPU Architecture of individual SM 13

ILP Formulation: Definition Formulate the scheduling of the stream graph across the multiprocessors (SMs) of the GPU as ILP Each instance of each filter to be the fundamental schedulable entity V denotes the set of all nodes (filters) in the stream graph N is the number of nodes in the stream graph E denotes the set of all (directed) edges in the stream graph k v represents the number of firings of a filter v Є V in steady state 14

ILP Formulation: Definition Contd. T represents the Initiation Interval (II) of the software pipelined loop d(v) is the delay or execution time of filter v Є V I uv is the number of elements consumed by filter v on each firing of v for an edge (u, v) Є E O uv is the number of elements produced by filter u on each firing of u for an edge (u, v) Є E m uv denotes the number of elements initially present on the edge (u, v) Є E 15

ILP Formulation: Resource Constraints w k,v,p is 0-1 integer variables all k Є [0, k v ), all v Є V, all k Є [0, P max ) w k,v,p = 1 implies that the k th instance of the filter v has been assigned to SM p The following constraint ensures that each instance of each filter is assigned to exactly one SM 16

ILP Formulation: Resource Constraints Contd. This constraint models the fact that all the filters assigned to an SM can be scheduled in the II (T) specified 17

ILP Formulation Contd. For dependency among filters, first need to ensure that the execution of any instance of any filter cannot wrap around into the next II Consider the linear form of a software pipelined schedule, given by where σ(j, k, v) represents the time at which the execution of the k th instance of filter v in the j th steady state iteration of the stream graph is scheduled and A k,v are integer variables with A k,v ≥ 0, all k Є [0; kv), all v Є V 18

ILP Formulation Contd. Can write Defining Thus, the linear form of the schedule is given by 19

ILP Formulation Contd. Constrain the starting time of filters o k,v as follows This constraint ensures that every firing of a filter is scheduled so that it can complete within the same kernel or II 20

ILP Formulation: Dependency Constraints To ensure that the firing rule for each instance of each filter is satisfied by a schedule, the admissibility of a schedule is given below It assumes that the firings the instances of filter v are serialized 21

ILP Formulation: Dependency Constraints Contd. However this assumption does not hold for this model, where instances of each filter could execute out of order, or in parallel across different SMs, as long as the firing rule is satisfied So need to ensure that the schedule is admissible with respect to all the predecessors of the i th firing of the filter v 22

ILP Formulation: Dependency Constraints Contd. A B 2 3 A0A0 B0B0 21 A1A1 A2A2 B1B

ILP Formulation: Dependency Constraints Contd. Intuitively, the i th firing of a filter v must wait for all the I uv tokens produced after its (i-1) th firing 24

ILP Formulation: Dependency Constraints Contd. So far the assumption that the result of executing a filter u is available d(u) time units after the filter u has started executing However, the limitations of a GPU imply that this may not always be the case If the results are required from a producer instance of a filter u, which is scheduled on an SM different from the SM where the consumer instance of a filter v is scheduled, then the data produced can only be reliably accessed in the next steady state iteration 25

ILP Formulation: Dependency Constraints Contd. Modelling this using the w k,v,p variables, which indicate which SM a particular instance of a given filter is assigned to Defining 0 – 1 integer variables g l,k,u,v 26

ILP Formulation: Dependency Constraints Contd. Finally by simplifying 27 All these equations 1-6 form an ILP

Code Generation for GPUs Overview of the compilation process, targeting a StreamIt program onto the GPU 28

CUDA Memory Model 29 All threads of a thread block are assigned to exactly one SM or upto 8 thread blocks can be assigned to one SM A group of thread blocks forms a grid Finally, a kernel call dispatched to the GPU through the CUDA runtime consists of exactly one grid

Profiling and Execution Configuration Selection It is important to determine the optimal execution configuration, specified by the number of threads per block and the number of registers allocated per thread This is achieved by the profiling and execution configuration selection phase StreamIt compiler is modified the to generate the CUDA sources along with a driver routine for each filter in the StreamIt program 30

The purpose of profiling First, for accurately estimating the execution time of each filter on the GPU, for use in the ILP Secondly, by running multiple profile runs of the same filter, with a varying the number of threads and the number of registers thus identifying the optimal number of threads 31

Algorithm for Profiling Filters 32

Optimizing Buffer Layout 33 - A filter has pop rate 4 and executing with 4 threads - on a device with memory organized as 8 banks

Optimizing Buffer Layout Contd. 34 Each thread of each cluster of 128 threads, pops or pushes the first token it consumes or produces from contiguous locations in memory

Experimental Evaluation 35 Only considers stateless filters

Experimental Setup Each benchmark was compiled and executed on a machine with dual quad-core Intel Xeon processors, running at 2.83 GHz, with 16 GB of FB-DIMM main memory The machine runs Linux, with kernel version , and the NVIDIA driver version Used a GeForce 8800 GTS 512 GPU, with 512 MB of device memory 36

37 SWPNC: SWP implementation with No Coalescing; Serial: Serial execution of filters using a SAS schedule Phased Behaviour

38 Optimized SWP schedule with no coarsening; SWP 4: Coarsened 4 times; SWP 8: Coarsened 8 times; SWP 16: Coarsened 16 times

About ILP Solver CPLEX version 9.0, running on Linux to solve the ILP The machine used was a quad processor Intel Xeon 3.06GHz, with 4GB of RAM CPLEX was running as a single threaded application and hence did not make use of the SMP available The solver was allotted 20 seconds to attempt a solution with this II If it failed to find a solution in 20 seconds, the II is relaxed by 0.5% and the process is repeated until a feasible solution was found Handling stateful filters on GPUs is a possible future work 39

Questions? 40