Download presentation
Presentation is loading. Please wait.
Published byJulius Powers Modified over 7 years ago
1
2017/12/1 An Architecture Evaluation and Implementation of a Soft GPGPU for FPGAs Kevin Andryc
2
Outline Motivation Background FlexGrip: Soft GPGPU
FlexGrip: Soft GPGPU Optimizations
3
Motivation Compiling FPGA designs is time consuming
Synthesize to create netlist Compiling FPGA designs is time consuming Requires resynthesizing design for each change Not every system has a GPGPU available GPGPUs not practical for systems that require minimal power and heat Inflexible compared to FPGAs Implement Translate Map Place & Route Create BIT File
4
FlexGrip Soft GPGPU FlexGrip: FLEXible GRaphIcs Processor
Fully CUDA binary-compatible integer soft GPGPU Run multiple applications without the need to recompile the hardware Support for highly multithreaded applications and complex conditional execution Hardware Flexibility Trade power versus performance Add processing, memory, and custom resources Reconfigure (perhaps on-the-fly) for specific applications (cloud computing)
5
Outline Motivation Background FlexGrip: Soft GPGPU
FlexGrip: Soft GPGPU Optimizations
6
Introduction to the GPGPU Hardware Architecture
Array of streaming multiprocessors (SMs) Each SM consists of a set of 32-bit scalar processors (SPs) Single Instruction Multiple Data (SIMD) execution Multiprocessor executes same instructions on different scalar processors at each clock cycle SP – Scalar Processor (core) SFU – Special function unit (Used for transcendental functions like sine, cosine, log etc. ) SFU- Execute transcendental instructions such as sin, cosine, reciprocal, and square root Image courtesy: S. Collange, M. Daumas, D. Defour, and D. Parello, “Barra: A Parallel Functional Simulator for GPGPU”, IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), Aug 2010
7
CUDA: The Software Model
Compute Unified Device Architecture (CUDA) Parallel programming model from Nvidia Extended version of C Function executed on GPU are executed as SIMD threads or Warps Warps are executed by a grid of thread blocks Thread Block: collection of operations which can be performed in parallel . Image courtesy: Patterson, David A.; Hennessy, John L., Computer Architecture: A Quantitative, 5th ed., Morgan Kaufman, Sept. 30, 2011
8
Software to Hardware Mapping
Block scheduler: Assigns thread blocks to multiprocessors Threads are scheduled in the form of warps Warp: Subset of operations performed in parallel; sometimes conditionally Fine grained scheduling: SM architected as single instruction, multiple thread (SIMT) processor Each scalar processor (SP) executes one thread maintaining its own PC Performs same operation on different set of data Free to independently execute data-dependent branches Image courtesy: S. Collange, M. Daumas, D. Defour, and D. Parello, “Barra: A Parallel Functional Simulator for GPGPU”, IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), Aug 2010
9
Outline Motivation Background FlexGrip: Soft GPGPU
FlexGrip: Soft GPGPU Optimizations
10
System Architecture
11
FlexGrip Streaming Multiprocessor
12
FPGA-Specific Considerations
Execute numerous CUDA applications without FPGA recompilation Flexible design to accommodate tradeoffs (e.g.: area versus performance) Vary number of SPs per SM Vary number of SMs per soft GPGPU Matlab Simulink used to generate coarse grained RTL blocks Used to rapidly develop application specific SPs Multi-port Block RAMs for reduced latency DSP48E1 digital signal processing blocks Multiple arithmetic functions supported in single block Optimized for speed and reduced latency
13
Design Environment and Benchmarks
Description Autocor Autocorrelation of 1D array Bitonic High performance sorting network MatrixMul Multiplication of square matrices Reduction Parallel reduction of 1D array Transpose Matrix transpose Design Environment Synthesis and Design: Xilinx ISE 14.2 Simulation: Modelsim SE 10.1 Total of Five CUDA Applications Evaluated Benchmarks from University of Wisconsin1 and NVIDIA Programmers Guide2 Mix of data parallel and control-flow intensive 1 D. Chang, C. Jenkins, P. Garcia, S. Gilani, P. Aguilera, A. Nagarajan,M. Anderson, M. Kenny, S. Bauer, M. Schulte, and K. Compton, “ERCBench: An open-source benchmark suite for embedded and reconfigurable computing,” in International Conference on Field Programmable Logic and Applications, Aug. 2010, pp. 408–413 2 “Nvidia CUDA programming guide,” version
14
Benchmarking vs. MicroBlaze
MicroBlaze soft processor Implemented on Xilinx ML605 development board (Virtex-6 VLX240T FPGA) Software timer used for execution time FlexGrip: Soft GPGPU Implemented on ML605 for 8 SPs, used ModelSim 10.1 for benchmarking 16 and 32 SPs All five benchmarks ran successfully with same bitstream Compile times < 1 second
15
Area Comparison Area breakdown per stage for 8 SP configuration
Parameters Freq. (MHz) LUTs Registers BRAM DSP48E 8 SP 100 71,323 103,776 120 156 16 SP 113,504 149,297 132 300 32 SP 231,436 240,230 588 Area breakdown per stage for 8 SP configuration
16
Architecture Scalability
Varying SPs in Single SM Average speedups: 8 cores – 12x 16 cores – 18x 32 cores – 22x Largest Speedups: Reduction: Array size multiple of 32, fully utilizing warps MatrixMult: High arithmetic density Bitonic: Divergence cost amortized by more swapping in parallel Memory bandwidth limitation Speedup vs. MicroBlaze for variable scalar processors and input data size 256 for 1 SM
17
Application Scalability
Speedup of almost 30x for Reduction Autocorrelation and Bitonic Sort speedup taper off as array size increases Accumulating divergence penalty amortizing parallel processing benefits Memory bandwidth vs. parallelism Speedup of 1 SM, 32-SP GPGPU vs. MicroBlaze for varying problem size
18
Energy Efficiency Estimated using Xilinx’s XPower Tool
Dynamic power used to generate efficiency Static power largely function of device size Energy = Power x Execution Time Normalized dynamic energy consumption versus MicroBlaze for different SP counts for applications size of 256 or MicroBlaze requires an average of 80% more energy than FlexGrip for 1 SM, 8 SP configuration
19
Outline Motivation Background FlexGrip: Soft GPGPU
FlexGrip: Soft GPGPU Optimizations
20
Branch Divergence if (x[i] < n) x[i] = x[i] + 1; else x[i] = x[i] - 1; Branch divergence occurs when threads inside a warp branch to different execution paths Threads in a warp stored in a warp stack Example: Instructions inside ELSE statement are masked (i.e.: not executed) Once IF statement complete, use the complement of mask to execute ELSE statement Branch Path A Path B Branch Path A Path B Thread
21
Conditional Branch Optimizations
Each of the 24 warps within an SM contains it’s own Warp Stack Each warp stack has entry for each thread (32) Each entry: 32-bit active thread mask, 2-bit type, 32-bit address Prior to executing taken path -- instruction address and active thread mask are pushed on the stack Upon completion of the taken path -- stack is read, the active mask inverted, and processing continues Worst case: Require nesting for all 32 threads (~50KB of memory!) Optimization: Profile applications for optimal depth
22
Source Operand Optimizations
23
Multiple Streaming Multiprocessors
Maximum of 256 threads in a thread block At the start of execution, the max number of thread blocks that can be scheduled is calculated Threads scheduled in a round-robin fashion
24
Benchmarking vs. MicroBlaze
MicroBlaze soft processor Implemented on Xilinx ML605 development board (Virtex-6 VLX240T FPGA) Software timer used for execution time Compile times < 1 second Benchmark Description Autocorr Autocorrelation of 1D array Bitonic High performance sorting network MatrixMul Multiplication of square matrices Reduction Parallel reduction of 1D array Transpose Matrix transpose FlexGrip: Soft GPGPU Implemented on ML605 for 1 SM and 8 SPs ModelSim 10.1 for benchmarking 1 SM with 16 and 32 SPs and 2 SM 8, 16, and 32 SP designs All five benchmarks ran successfully with same bitstream Compile times < 1 second All designs were evaluated at 100 MHz
25
Architecture Scalability – 2 SM
Speedup vs. MicroBlaze for variable scalar processors and input data size 256 for 2 SM Varying SPs in 2 SM Design Peak Speedup over 40x for 4 of 5 benchmarks 1 SM vs. 2 SM Speedup ranged from 1.77x (Reduction) to 1.98x (Transpose, MatrixMul) Speedup of 2 SM vs. 1 SM (256 data size) 8 SP 16 SP 32 SP Autocorr 1.94 Bitonic 1.82 1.83 1.85 MatrixMul 1.98 Reduction 1.78 1.77 Transpose
26
Architectural Customizations
Num. Of Oper. Warp Depth Slice LUTs Flip Flops Block RAM DSP % Area Red. Dyn. Baseline 3 32 60,375 103,776 124 156 - Autocorr 16 52,121 82,017 14% 3% Mat. Mult. 42,536 60,161 20% 9% Reduction 30% Transpose Bitonic 2 39,189 57,301 35% 15% 27,136 120 12 62% 38% Removing multiplier/third operand and reduced warp depth achieves 23% energy reduction for any benchmark Depending on application space, one could vary parameters to optimize system
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.