2017/12/1 An Architecture Evaluation and Implementation of a Soft GPGPU for FPGAs Kevin Andryc.

2017/12/1 An Architecture Evaluation and Implementation of a Soft GPGPU for FPGAs Kevin Andryc

Outline Motivation Background FlexGrip: Soft GPGPU
FlexGrip: Soft GPGPU Optimizations

Motivation Compiling FPGA designs is time consuming
Synthesize to create netlist Compiling FPGA designs is time consuming Requires resynthesizing design for each change Not every system has a GPGPU available GPGPUs not practical for systems that require minimal power and heat Inflexible compared to FPGAs Implement Translate Map Place & Route Create BIT File

FlexGrip Soft GPGPU FlexGrip: FLEXible GRaphIcs Processor
Fully CUDA binary-compatible integer soft GPGPU Run multiple applications without the need to recompile the hardware Support for highly multithreaded applications and complex conditional execution Hardware Flexibility Trade power versus performance Add processing, memory, and custom resources Reconfigure (perhaps on-the-fly) for specific applications (cloud computing)

Introduction to the GPGPU Hardware Architecture
Array of streaming multiprocessors (SMs) Each SM consists of a set of 32-bit scalar processors (SPs) Single Instruction Multiple Data (SIMD) execution Multiprocessor executes same instructions on different scalar processors at each clock cycle SP – Scalar Processor (core) SFU – Special function unit (Used for transcendental functions like sine, cosine, log etc. ) SFU- Execute transcendental instructions such as sin, cosine, reciprocal, and square root Image courtesy: S. Collange, M. Daumas, D. Defour, and D. Parello, “Barra: A Parallel Functional Simulator for GPGPU”, IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), Aug 2010

CUDA: The Software Model
Compute Unified Device Architecture (CUDA) Parallel programming model from Nvidia Extended version of C Function executed on GPU are executed as SIMD threads or Warps Warps are executed by a grid of thread blocks Thread Block: collection of operations which can be performed in parallel . Image courtesy: Patterson, David A.; Hennessy, John L., Computer Architecture: A Quantitative, 5th ed., Morgan Kaufman, Sept. 30, 2011

Software to Hardware Mapping
Block scheduler: Assigns thread blocks to multiprocessors Threads are scheduled in the form of warps Warp: Subset of operations performed in parallel; sometimes conditionally Fine grained scheduling: SM architected as single instruction, multiple thread (SIMT) processor Each scalar processor (SP) executes one thread maintaining its own PC Performs same operation on different set of data Free to independently execute data-dependent branches Image courtesy: S. Collange, M. Daumas, D. Defour, and D. Parello, “Barra: A Parallel Functional Simulator for GPGPU”, IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), Aug 2010

System Architecture

FlexGrip Streaming Multiprocessor

FPGA-Specific Considerations
Execute numerous CUDA applications without FPGA recompilation Flexible design to accommodate tradeoffs (e.g.: area versus performance) Vary number of SPs per SM Vary number of SMs per soft GPGPU Matlab Simulink used to generate coarse grained RTL blocks Used to rapidly develop application specific SPs Multi-port Block RAMs for reduced latency DSP48E1 digital signal processing blocks Multiple arithmetic functions supported in single block Optimized for speed and reduced latency

Design Environment and Benchmarks
Description Autocor Autocorrelation of 1D array Bitonic High performance sorting network MatrixMul Multiplication of square matrices Reduction Parallel reduction of 1D array Transpose Matrix transpose Design Environment Synthesis and Design: Xilinx ISE 14.2 Simulation: Modelsim SE 10.1 Total of Five CUDA Applications Evaluated Benchmarks from University of Wisconsin1 and NVIDIA Programmers Guide2 Mix of data parallel and control-flow intensive 1 D. Chang, C. Jenkins, P. Garcia, S. Gilani, P. Aguilera, A. Nagarajan,M. Anderson, M. Kenny, S. Bauer, M. Schulte, and K. Compton, “ERCBench: An open-source benchmark suite for embedded and reconfigurable computing,” in International Conference on Field Programmable Logic and Applications, Aug. 2010, pp. 408–413 2 “Nvidia CUDA programming guide,” version

Benchmarking vs. MicroBlaze
MicroBlaze soft processor Implemented on Xilinx ML605 development board (Virtex-6 VLX240T FPGA) Software timer used for execution time FlexGrip: Soft GPGPU Implemented on ML605 for 8 SPs, used ModelSim 10.1 for benchmarking 16 and 32 SPs All five benchmarks ran successfully with same bitstream Compile times < 1 second

Area Comparison Area breakdown per stage for 8 SP configuration
Parameters Freq. (MHz) LUTs Registers BRAM DSP48E 8 SP 100 71,323 103,776 120 156 16 SP 113,504 149,297 132 300 32 SP 231,436 240,230 588 Area breakdown per stage for 8 SP configuration

Architecture Scalability
Varying SPs in Single SM Average speedups: 8 cores – 12x 16 cores – 18x 32 cores – 22x Largest Speedups: Reduction: Array size multiple of 32, fully utilizing warps MatrixMult: High arithmetic density Bitonic: Divergence cost amortized by more swapping in parallel Memory bandwidth limitation Speedup vs. MicroBlaze for variable scalar processors and input data size 256 for 1 SM

Application Scalability
Speedup of almost 30x for Reduction Autocorrelation and Bitonic Sort speedup taper off as array size increases Accumulating divergence penalty amortizing parallel processing benefits Memory bandwidth vs. parallelism Speedup of 1 SM, 32-SP GPGPU vs. MicroBlaze for varying problem size

Energy Efficiency Estimated using Xilinx’s XPower Tool
Dynamic power used to generate efficiency Static power largely function of device size Energy = Power x Execution Time Normalized dynamic energy consumption versus MicroBlaze for different SP counts for applications size of 256 or MicroBlaze requires an average of 80% more energy than FlexGrip for 1 SM, 8 SP configuration

Branch Divergence if (x[i] < n) x[i] = x[i] + 1; else x[i] = x[i] - 1; Branch divergence occurs when threads inside a warp branch to different execution paths Threads in a warp stored in a warp stack Example: Instructions inside ELSE statement are masked (i.e.: not executed) Once IF statement complete, use the complement of mask to execute ELSE statement Branch Path A Path B Branch Path A Path B Thread

Conditional Branch Optimizations
Each of the 24 warps within an SM contains it’s own Warp Stack Each warp stack has entry for each thread (32) Each entry: 32-bit active thread mask, 2-bit type, 32-bit address Prior to executing taken path -- instruction address and active thread mask are pushed on the stack Upon completion of the taken path -- stack is read, the active mask inverted, and processing continues Worst case: Require nesting for all 32 threads (~50KB of memory!) Optimization: Profile applications for optimal depth

Source Operand Optimizations

Multiple Streaming Multiprocessors
Maximum of 256 threads in a thread block At the start of execution, the max number of thread blocks that can be scheduled is calculated Threads scheduled in a round-robin fashion

Benchmarking vs. MicroBlaze
MicroBlaze soft processor Implemented on Xilinx ML605 development board (Virtex-6 VLX240T FPGA) Software timer used for execution time Compile times < 1 second Benchmark Description Autocorr Autocorrelation of 1D array Bitonic High performance sorting network MatrixMul Multiplication of square matrices Reduction Parallel reduction of 1D array Transpose Matrix transpose FlexGrip: Soft GPGPU Implemented on ML605 for 1 SM and 8 SPs ModelSim 10.1 for benchmarking 1 SM with 16 and 32 SPs and 2 SM 8, 16, and 32 SP designs All five benchmarks ran successfully with same bitstream Compile times < 1 second All designs were evaluated at 100 MHz

Architecture Scalability – 2 SM
Speedup vs. MicroBlaze for variable scalar processors and input data size 256 for 2 SM Varying SPs in 2 SM Design Peak Speedup over 40x for 4 of 5 benchmarks 1 SM vs. 2 SM Speedup ranged from 1.77x (Reduction) to 1.98x (Transpose, MatrixMul) Speedup of 2 SM vs. 1 SM (256 data size) 8 SP 16 SP 32 SP Autocorr 1.94 Bitonic 1.82 1.83 1.85 MatrixMul 1.98 Reduction 1.78 1.77 Transpose

Architectural Customizations
Num. Of Oper. Warp Depth Slice LUTs Flip Flops Block RAM DSP % Area Red. Dyn. Baseline 3 32 60,375 103,776 124 156 - Autocorr 16 52,121 82,017 14% 3% Mat. Mult. 42,536 60,161 20% 9% Reduction 30% Transpose Bitonic 2 39,189 57,301 35% 15% 27,136 120 12 62% 38% Removing multiplier/third operand and reduced warp depth achieves 23% energy reduction for any benchmark Depending on application space, one could vary parameters to optimize system

2017/12/1 An Architecture Evaluation and Implementation of a Soft GPGPU for FPGAs Kevin Andryc.

Similar presentations

Presentation on theme: "2017/12/1 An Architecture Evaluation and Implementation of a Soft GPGPU for FPGAs Kevin Andryc."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2017/12/1 An Architecture Evaluation and Implementation of a Soft GPGPU for FPGAs Kevin Andryc.

Similar presentations

Presentation on theme: "2017/12/1 An Architecture Evaluation and Implementation of a Soft GPGPU for FPGAs Kevin Andryc."— Presentation transcript:

Similar presentations

About project

Feedback