VESPA: Portable, Scalable, and Flexible FPGA-Based Vector Processors Peter YiannacourasUniv. of Toronto J. Gregory Steffan Univ. of Toronto Jonathan Rose Univ. of Toronto
2 Soft Processors in FPGA Systems HDL + CAD C + Compiler Easier Faster Smaller Less Power Data-level parallelism → soft vector processors Configurable – how can we make use of this?
3 Vector Processing Primer // C code for(i=0;i<16; i++) b[i]+=a[i] // Vectorized code set vl,16 vload vr0,b vload vr1,a vadd vr0,vr0,vr1 vstore vr0,b Each vector instruction holds many units of independent operations b[0]+=a[0] b[1]+=a[1] b[2]+=a[2] b[4]+=a[4] b[3]+=a[3] b[5]+=a[5] b[6]+=a[6] b[7]+=a[7] b[8]+=a[8] b[9]+=a[9] b[10]+=a[10] b[11]+=a[11] b[12]+=a[12] b[13]+=a[13] b[14]+=a[14] b[15]+=a[15] vadd 1 Vector Lane
4 Vector Processing Primer // C code for(i=0;i<16; i++) b[i]+=a[i] // Vectorized code set vl,16 vload vr0,b vload vr1,a vadd vr0,vr0,vr1 vstore vr0,b Each vector instruction holds many units of independent operations vadd 16 Vector Lanes b[0]+=a[0] b[1]+=a[1] b[2]+=a[2] b[4]+=a[4] b[3]+=a[3] b[5]+=a[5] b[6]+=a[6] b[7]+=a[7] b[8]+=a[8] b[9]+=a[9] b[10]+=a[10] b[11]+=a[11] b[12]+=a[12] b[13]+=a[13] b[14]+=a[14] b[15]+=a[15] 16x speedup 1) Portable 2) Flexible 3) Scalable
5 Soft Vector Processor Benefits 1. Portable SW: Agnostic to HW implementation Eg. Number of lanes HW: Can be implemented on any FPGA architecture 2. Flexible Many parameters to tune (by end-user, not vendor) Eg. Number of lanes, width of lanes, etc. 3. Scalable SW: Applies to any code with data-level parallelism HW: Number of lanes can grow with capacity of device Parallelism can scale with Moore’s law How would this fit in with current FPGA design flow?
6 Conventional FPGA Design Flow Memory Interface Custom Accelerator Peripherals Soft Proc Custom Accelerator Custom Accelerator Software Routine Software Routine Software Routine Is the soft processor the bottleneck? yes, find hot code Three options: 1)Manual hardware design 2)Acquire RTL IP-core 3)High level synthesis Eg. Altera C2H Push button Code dependent
7 Proposed Soft Vector Processor System Design Flow Memory Interface Custom Accelerator Peripherals Soft Proc Vector Lane 1 Vector Lane 2 Is the soft processor the bottleneck? yes, increase lanes We propose adding vector extensions to existing soft processors Vector Lane 3 Vector Lane 4 User Code + Portable, Flexible, Scalable Vectorized Software Routine Vectorized Software Routine Vectorized Software Routine Portable, Easy-to-use Vectorized Software Routine Vectorized Software Routine Vectorized Software Routine
8 Our Goals 1. Evaluate soft vector processing for real: Using a complete hardware design (in Verilog) On real FPGA hardware (Stratix 1S80C6) Running full benchmarks (EEMBC) From off-chip memory (DDR-133MHz) 2. Quantify performance/area tradeoffs Across different vector processor configurations 3. Explore application-specific customizations Reduce generality of soft vector processors
9 Current Infrastructure Vectorized assembly subroutines GNU as + Vector support ELF Binary Instruction Set Simulation + SPREE + Vector support scalar μP + vpu VC RF VS RF VC WB VS WB Logic Decode Repli- cate Hazard check VR RF ALUALU Mem Unit x & satur. VR WB MUXMUX Satu- rate Rshift VR RF ALUALU x & satur. VR WB MUXMUX Satu- rate Rshift EEMBC C Benchmarks RTL Simulation SOFTWAREHARDWARE Verilog CAD Software cycles area, frequency GCC ld verification Manually designed coprocessor TM4 Vector Extended Soft Processor Architecture
10 VESPA Architecture Design Scalar Pipeline 3-stage Vector Control Pipeline 3-stage Vector Pipeline 6-stage IcacheDcache Decode RF ALUALU MUXMUX WB VC RF VS RF VC WB VS WB Logic Decode Repli- cate Hazard check VR RF ALUALU x & satur. VR WB MUXMUX Satu- rate Rshift VR RF ALUALU x & satur. VR WB MUXMUX Satu- rate Rshift Mem Unit Decode Supports integer and fixed-point operations, and predication 32-bit datapaths Shared Dcache 10
11 Experiment #1: Vector Lane Exploration Vary the number of vector lanes implemented Using parameterized vector core Measure speedup on 6 EEMBC benchmarks Directly on Stratix I 1S80C6 clocked at 50 MHz Was designed for Stratix III, runs at 135 MHz Using 32KB direct-mapped level 1 cache DDR 133MHz => 10 cycle miss penalty Measure area cost Equate silicon area of all resources used Report in units of Equivalent LEs
12 Performance Scaling Across Vector Lanes Good scaling – average of 1.85x for 2 lanes to 6.3x for 16 lanes Cycle Speedup Normalized to 1 Lane Scaling past 16 limited by number of multipliers in Stratix 1S80 6.3x
13 Design Characteristics on Stratix III Lanes Clock Frequency (MHz) Logic Used (ALMs) Mulipliers Used (18-bit DSPs) Block RAMs Used (M9Ks) Clock Frequency steady … until 64 lanes ALMs grow by 570 ALMs/lane DSPs grow by 4(1+L) Block RAMs unaffected … until 32 lanes when port width dominates Device: 3S200C2
14 Application-Specific Vector Processing Customize to the application if: 1. It is the only application that will run, OR 2. The FPGA can be reconfigured between runs Observations: Not all applications 1. Operate on 32-bit data types 2. Use the entire vector instruction set Eliminate unused hardware (reduce area) Reduce cost (buy smaller FPGA) Re-invest area savings into more lanes Speed up clock (nets span shorter distances)
15 Opportunity for Customization BenchmarkLargest Data Type Size Percentage of Vector ISA used autcor4 bytes9.6% conven1 byte5.9% fbital2 bytes14.1% viterb2 bytes13.3% rgbcmyk1 byte5.9% rgbyiq2 bytes8.1% Lots of opportunity to customize width & ISA support 0% reduction up to 75% reduction <15% utilization
16 Customizing the Vector Processor Parameterized core can very easily change: L - Number of Vector Lanes W - Bit-width of the vector lanes M – Size of memory crossbar MVL – Maximum Vector Length Instruction set automatically subsetted Each vector instruction individually enabled/disabled Control logic & datapath hardware automatically removed
17 Experiment #2: Reducing Area by Reducing Vector Width Up to 54% of vector coprocessor area eliminated 54% 38% Savings increase with more lanes => better scalability Normalized Vector Coprocessor Area largest data type size (in bytes)
18 Experiment #3: Reducing Area by Subsetting Instruction Set Up to 55% of VPU area eliminated, 46% on average 55%46% Normalized Vector Coprocessor Area Again, savings increase with more lanes => better scalability
19 Experiment #4: Combined Width Reduction and Instruction Set Subsetting 61% 70% Performance scaling (seen previously) at almost 1/3 the area cost
20 Re-Invest Area Savings into Lanes (Improved VESPA) 9.3x 11.5x Area savings can be converted into better performance
21 Summary Evaluated soft vector processors Real hardware, memory, and benchmarks Observed significant performance scaling Average of 6.3x with 16 lanes Further scaling possible on newer devices Explored measures to reduce area cost Reducing vector width Reducing supported instruction set Combining width and instruction set reduction 61% area reduction on average, up to 70% Soft vector processors provide a portable, flexible, and scalable framework for exploiting data level parallelism that is easier to use than designing custom FPGA hardware
22 Future Work Improve scalability bottlenecks Memory system Evaluate scaling past 16 lanes Port to platform with newer FPGA Compare against hardware What do we pay for simpler design?
23 Performance Impact of Cache Size Measure impact of cache size on 16 lane VPU Streaming Streaming => prefetching could be fruitful
24 Combined Width Reduction and Instruction Set Subsetting Close to 70% area reduction
25 Performance vs Scalar (C) Code 1 Lane2 Lanes4 Lanes8 Lanes16 Lanes autcor conven fbital viterb rgbcmyk rgbyiq GEOMEAN
26 Vector Memory Unit Dcache base stride*0 index0 + MUXMUX... stride*1 index1 + MUXMUX stride*L indexL + MUXMUX Memory Request Queue Read Crossbar … Memory Lanes=4 rddata0 rddata1 rddataL wrdata0 wrdata1 wrdataL... Write Crossbar Memory Write Queue L = # Lanes - 1 … …