Download presentation
Presentation is loading. Please wait.
Published byAvery Maberry Modified over 10 years ago
1
Software Pipelined Execution of Stream Programs on GPUs Abhishek Udupa et. al. Indian Institute of Science, Bangalore 2009
2
Abstract Describe the challenges in mapping StreamIt to GPUs Propose an efficient technique to software pipeline the execution of stream programs on GPUs Formulating this problem of scheduling and actor assignment to a processor as an efficient ILP Also describes an efficient buffer layout technique for GPUs That facilitates exploiting the high memory bandwidth available in GPUs Yielding speedups between 1.87X and 36.83X over a single threaded CPU 2
3
Motivations GPUs have emerged as massively parallel machines GPU can perform general purpose computation Latest GPUs consisting of hundreds of SPUs are capable of supporting thousands concurrent threads with zero-cost h/w controlled context switch GeForce 8800 GTS [128 SPUs forming 16 multiprocessors, connected by a 256-bit wide bus to 512 MB of memory] 3
4
Motivations Contd. Each multiprocessors in NVIDIA GPUs is conceptually wide SIMD processor CPUs have hierarchies of caches to tolerate memory latency GPUs address the problem by providing high bandwidth processor-memory link and supporting high degree of simultaneous multithreading (SMT) within the processing element The combination of SIMD and SMT enables GPUs to achieve a peak throughput of 400 GFLOPs 4
5
Motivations Contd. The high performance of GPUs comes at the cost of reduced flexibility and greater programming effort from the user For instance, in the NVIDIA GPUs, threads executing on different multiprocessors cannot synchronize in an efficient manner communicate in a reliable and consistent manner through the device memory, within a kernel invocation GPU cannot access the memory of the host system While the memory bus is capable of delivering very high bandwidth accesses to device memory by threads executing on a multiprocessor, need to be combined in order 5
6
Motivations Contd. ATI and NVIDIA have proposed the CTM and CUDA frameworks, respectively, for developing general purpose applications targeting the GPUs But both of these frameworks still require the programmer to express the program as data-parallel computations that can be executed efficiently on the GPU Programming with these frameworks tie the application to the platform Any change in the platform would require significant rework in porting the applications 6
7
Prior Works I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, “Brook for GPUs: Stream Computing on Graphics Hardware,” ACM Transactions on Graphics, vol. 23, no. 3, pp. 777– 786, 2004 D. Tarditi, S. Puri, and J. Oglesby, “Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses,” in ASPLOS-XII: Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, 2006, pp. 325–335. These techniques still require considerable effort by the programmer in order to transform the streaming application to execute efficiently on the GPU Maps StreamIt onto the CellBE platform [M. Kudlur and S. Mahlke PLDI 2008] 7
8
x2 x1y1 z2 z1 x5 x1 x6 x2 x7 x3 x8 x4 y5 y1 y6 y2 y7 y3 y8 y4 z5 z1 z6 z2 z7 z3 z8 z4 x2 x1y1 Split x3 x1 y3 y1 x4 x2 y4 y2 SPE0 SPE1 z3 z1 z4 z2 Join z2 z1 (a) Unthreaded CPU (b) 4-wide SIMD(c) On Cell by Scott 8 SIMD reduces the trip-count of the loop
9
With GPU Model Zn+1 Yn+1Xn+2Yn+2 Z1 Y1X2Y2 X2nY2n XnYn Zn+2 Z2 Xn+1 X1 Z2n Zn Thousands of iterations are carried out in parallel 9 Greatly reduces the trip-count of the loop
10
Motivations Contd. Parallelism must be exploited at various levels The data parallelism across threads executing on a multiprocessor The SMT parallelism among threads on the same multiprocessor needs to be managed to provide optimum performance Parallelism offered by having multiple multiprocessors on a GPU should be used to exploit the task and pipeline level parallelism Further, accesses to device memory need to be coalesced as far as possible to ensure optimal usage of available bandwidth These challenges are solved in this paper 10
11
Contributions This paper makes the following contributions: Describe a software pipelining framework for efficiently mapping StreamIt programs onto GPUs Present a buffer mapping scheme for StreamIt programs to make efficient use of the memory bandwidth of the GPU Describe a profile based approach to decide on the optimal number of threads assigned to a multiprocessor called execution configuration for StreamIt filters Implement the scheme in the StreamIt compiler and demonstrate a 1.87X to 36.83X speedup over a single threaded CPU on a set of streaming applications 11
12
parallel computation StreamIt Overview StreamIt is a novel language for streaming Exposes parallelism and communication Architecture independent Modular and composable Simple structures composed to creates complex graphs Malleable Change program behavior with small modifications may be any StreamIt language construct joiner splitter pipeline feedback loop joiner splitter splitjoin filter 12
13
Organization of the NVIDIA GeForce 8800 series of GPUs Architecture of GeForce 8800 GPU Architecture of individual SM 13
14
ILP Formulation: Definition Formulate the scheduling of the stream graph across the multiprocessors (SMs) of the GPU as ILP Each instance of each filter to be the fundamental schedulable entity V denotes the set of all nodes (filters) in the stream graph N is the number of nodes in the stream graph E denotes the set of all (directed) edges in the stream graph k v represents the number of firings of a filter v Є V in steady state 14
15
ILP Formulation: Definition Contd. T represents the Initiation Interval (II) of the software pipelined loop d(v) is the delay or execution time of filter v Є V I uv is the number of elements consumed by filter v on each firing of v for an edge (u, v) Є E O uv is the number of elements produced by filter u on each firing of u for an edge (u, v) Є E m uv denotes the number of elements initially present on the edge (u, v) Є E 15
16
ILP Formulation: Resource Constraints w k,v,p is 0-1 integer variables all k Є [0, k v ), all v Є V, all k Є [0, P max ) w k,v,p = 1 implies that the k th instance of the filter v has been assigned to SM p The following constraint ensures that each instance of each filter is assigned to exactly one SM 16
17
ILP Formulation: Resource Constraints Contd. This constraint models the fact that all the filters assigned to an SM can be scheduled in the II (T) specified 17
18
ILP Formulation Contd. For dependency among filters, first need to ensure that the execution of any instance of any filter cannot wrap around into the next II Consider the linear form of a software pipelined schedule, given by where σ(j, k, v) represents the time at which the execution of the k th instance of filter v in the j th steady state iteration of the stream graph is scheduled and A k,v are integer variables with A k,v ≥ 0, all k Є [0; kv), all v Є V 18
19
ILP Formulation Contd. Can write Defining Thus, the linear form of the schedule is given by 19
20
ILP Formulation Contd. Constrain the starting time of filters o k,v as follows This constraint ensures that every firing of a filter is scheduled so that it can complete within the same kernel or II 20
21
ILP Formulation: Dependency Constraints To ensure that the firing rule for each instance of each filter is satisfied by a schedule, the admissibility of a schedule is given below It assumes that the firings the instances of filter v are serialized 21
22
ILP Formulation: Dependency Constraints Contd. However this assumption does not hold for this model, where instances of each filter could execute out of order, or in parallel across different SMs, as long as the firing rule is satisfied So need to ensure that the schedule is admissible with respect to all the predecessors of the i th firing of the filter v 22
23
ILP Formulation: Dependency Constraints Contd. A B 2 3 A0A0 B0B0 21 A1A1 A2A2 B1B1 12 23
24
ILP Formulation: Dependency Constraints Contd. Intuitively, the i th firing of a filter v must wait for all the I uv tokens produced after its (i-1) th firing 24
25
ILP Formulation: Dependency Constraints Contd. So far the assumption that the result of executing a filter u is available d(u) time units after the filter u has started executing However, the limitations of a GPU imply that this may not always be the case If the results are required from a producer instance of a filter u, which is scheduled on an SM different from the SM where the consumer instance of a filter v is scheduled, then the data produced can only be reliably accessed in the next steady state iteration 25
26
ILP Formulation: Dependency Constraints Contd. Modelling this using the w k,v,p variables, which indicate which SM a particular instance of a given filter is assigned to Defining 0 – 1 integer variables g l,k,u,v 26
27
ILP Formulation: Dependency Constraints Contd. Finally by simplifying 27 All these equations 1-6 form an ILP
28
Code Generation for GPUs Overview of the compilation process, targeting a StreamIt program onto the GPU 28
29
CUDA Memory Model 29 All threads of a thread block are assigned to exactly one SM or upto 8 thread blocks can be assigned to one SM A group of thread blocks forms a grid Finally, a kernel call dispatched to the GPU through the CUDA runtime consists of exactly one grid
30
Profiling and Execution Configuration Selection It is important to determine the optimal execution configuration, specified by the number of threads per block and the number of registers allocated per thread This is achieved by the profiling and execution configuration selection phase StreamIt compiler is modified the to generate the CUDA sources along with a driver routine for each filter in the StreamIt program 30
31
The purpose of profiling First, for accurately estimating the execution time of each filter on the GPU, for use in the ILP Secondly, by running multiple profile runs of the same filter, with a varying the number of threads and the number of registers thus identifying the optimal number of threads 31
32
Algorithm for Profiling Filters 32
33
Optimizing Buffer Layout 33 - A filter has pop rate 4 and executing with 4 threads - on a device with memory organized as 8 banks
34
Optimizing Buffer Layout Contd. 34 Each thread of each cluster of 128 threads, pops or pushes the first token it consumes or produces from contiguous locations in memory
35
Experimental Evaluation 35 Only considers stateless filters
36
Experimental Setup Each benchmark was compiled and executed on a machine with dual quad-core Intel Xeon processors, running at 2.83 GHz, with 16 GB of FB-DIMM main memory The machine runs Linux, with kernel version 2.6.18, and the NVIDIA driver version 173.14.09 Used a GeForce 8800 GTS 512 GPU, with 512 MB of device memory 36
37
37 SWPNC: SWP implementation with No Coalescing; Serial: Serial execution of filters using a SAS schedule Phased Behaviour
38
38 Optimized SWP schedule with no coarsening; SWP 4: Coarsened 4 times; SWP 8: Coarsened 8 times; SWP 16: Coarsened 16 times
39
About ILP Solver CPLEX version 9.0, running on Linux to solve the ILP The machine used was a quad processor Intel Xeon 3.06GHz, with 4GB of RAM CPLEX was running as a single threaded application and hence did not make use of the SMP available The solver was allotted 20 seconds to attempt a solution with this II If it failed to find a solution in 20 seconds, the II is relaxed by 0.5% and the process is repeated until a feasible solution was found Handling stateful filters on GPUs is a possible future work 39
40
Questions? 40
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.