Fine-Grain Performance Scaling of Soft Vector Processors Peter Yiannacouras Jonathan Rose Gregory J. Steffan ESWEEK – CASES 2009, Grenoble, France Oct.

Fine-Grain Performance Scaling of Soft Vector Processors Peter Yiannacouras Jonathan Rose Gregory J. Steffan ESWEEK – CASES 2009, Grenoble, France Oct 13, 2009

2 FPGA Systems and Soft Processors Soft Processor Custom HW HDL + CAD Software + Compiler Easier Faster Smaller Less Power Simplify FPGA design: Customize soft processor architecture ? Configurable COMPETE WeeksMonths Target: Data level parallelism → vector processors Used in 25% of designs [source: Altera, 2009] Digital System Hard Processor  Board space, latency, power  Specialized device, increased cost computation

3 Vector Processing Primer // C code for(i=0;i<16; i++) c[i]=a[i]+b[i] // Vectorized code set vl,16 vload vr0,a vload vr1,b vadd vr2,vr0,vr1 vstore vr2,c Each vector instruction holds many units of independent operations vr2[0]= vr0[0]+vr1[0] vr2[1]= vr0[1]+vr1[1] vr2[2]= vr0[2]+vr1[2] vr2[4]= vr0[4]+vr1[4] vr2[3]= vr0[3]+vr1[3] vr2[5]= vr0[5]+vr1[5] vr2[6]= vr0[6]+vr1[6] vr2[7]= vr0[7]+vr1[7] vr2[8]= vr0[8]+vr1[8] vr2[9]= vr0[9]+vr1[9] vr2[10]=vr0[10]+vr1[10] vr2[11]=vr0[11]+vr1[11] vr2[12]=vr0[12]+vr1[12] vr2[13]=vr0[13]+vr1[13] vr2[14]=vr0[14]+vr1[14] vr2[15]=vr0[15]+vr1[15] vadd 1 Vector Lane

4 Vector Processing Primer // C code for(i=0;i<16; i++) c[i]=a[i]+b[i] // Vectorized code set vl,16 vload vr0,a vload vr1,b vadd vr2,vr0,vr1 vstore vr2,c Each vector instruction holds many units of independent operations vadd 16 Vector Lanes vr2[0]= vr0[0]+vr1[0] vr2[1]= vr0[1]+vr1[1] vr2[2]= vr0[2]+vr1[2] vr2[4]= vr0[4]+vr1[4] vr2[3]= vr0[3]+vr1[3] vr2[5]= vr0[5]+vr1[5] vr2[6]= vr0[6]+vr1[6] vr2[7]= vr0[7]+vr1[7] vr2[8]= vr0[8]+vr1[8] vr2[9]= vr0[9]+vr1[9] vr2[10]=vr0[10]+vr1[10] vr2[11]=vr0[11]+vr1[11] vr2[12]=vr0[12]+vr1[12] vr2[13]=vr0[13]+vr1[13] vr2[14]=vr0[14]+vr1[14] vr2[15]=vr0[15]+vr1[15] 16x speedup Previous Work (on Soft Vector Processors) : 1. Scalability 2. Flexibility 3. Portability CASES’08

5 VESPA Architecture Design (Vector Extended Soft Processor Architecture) Scalar Pipeline 3-stage Vector Control Pipeline 3-stage Vector Pipeline 6-stage IcacheDcache Decode RF ALUALU MUXMUX WB VC RF VS RF VC WB VS WB Logic Decode Repli- cate Hazard check VR RF VR WB VR RF VR WB Decode Supports integer and fixed-point operations [VIRAM] 32-bit Lanes Shared Dcache Legend Pipe stage Logic Storage Lane 1 ALU,Mem Unit Lane 2 ALU, Mem, Mul

6 In This Work 1. Evaluate for real using modern hardware Scale to 32 lanes (previous work did 16 lanes) 2. Add more fine-grain architectural parameters 1. Scale more finely Augment with parameterized vector chaining support 2. Customize to functional unit demand Augment with heterogeneous lanes 3. Explore a large design space

7 Evaluation Infrastructure Binary Instruction Set Simulation EEMBC Benchmarks RTL Simulation SOFTWAREHARDWARE Verilog FPGA CAD Software cycles area, power, clock frequency GCC Compiler verification Full hardware design of VESPA soft vector processor Evaluate soft vector processors with high accuracy Stratix III 340 DDR2 Vectorized assembly subroutines GNU as ld

8 VESPA Scalability Up to 19x, average of 11x for 32 lanes → good scaling 19x 11x (Area=1) (Area=1.3) (Area=1.9) (Area=3.2) (Area=6.3) (Area=12.3) Powerful parameter … but is coarse-grained

9 Vector Lane Design Space Too coarse grain! Reprogrammability allows more exact-fit 8% of largest FPGA (Equivalent ALMs)

11 Vector Chaining Simultaneous execution of independent element operations within dependent instructions vadd vr10, vr1,vr2 vmul vr20, vr10,vr11 dependency 01234567 0 vadd vmul Dependent Instructions 1234567 Independent Element Operations

12 Vector Chaining in VESPA Unified ALUALU Vector Register File B=1 B=2 Bank 0 Vector Register File Bank 1 MUXMUX MUXMUX vmul vadd vmul vadd Single Instruction Execution Multiple Instruction Execution time No Vector Chaining With Vector Chaining ALUALU ALUALU ALUALU Mem ALUALU ALUALU ALUALU ALUALU Mul Mem Mul Lanes=4 Performance increase if instructions correctly scheduled

13 ALU Replication B=2 APB=false Bank 0 Vector Register File Bank 1 MUXMUX vsub vadd Single Instruction Execution time With Vector Chaining Mem ALU Mul MUXMUX ALU B=2 APB=true Bank 0 Vector Register File Bank 1 MUXMUX With Vector Chaining Mem ALU Mul MUXMUX ALU MUXMUX vsub vadd Multiple Instruction Execution time Lanes=4

14 Vector Chaining Speedup (on an 8-lane VESPA) Don’t care More banks More ALUs More banks More ALUs Chaining can be quite costly in area: 27%-92% Performance is application dependent: 5%-76% Significant speed improvement over no chaining (22-35% avg) More fine-grain vs double lanes: 19-89% speed, 86% area Cycle Speedup vs No Chaining

16 Heterogeneous Lanes Mul ALU Mul ALU Mul ALU Mul ALU Lane 1 Lane 2 Lane 3 Lane 4 vmul 4 Lanes (L=4) 2 Multiplier Lanes (X=2)

17 Heterogeneous Lanes Mul ALU Mul ALU Lane 1 Lane 2 Lane 3 Lane 4 vmul STALL! Save area, but reduce speed depending on demand on the multiplier 4 Lanes (L=4) 2 Multiplier Lanes (X=2)

18 Impact of Heterogeneous Lanes (on a 32-lane VESPA) FreeExpensiveModerate Performance penalty is application dependent: 0%-85% Modest area savings (6%-13%) – dedicated multipliers

20 Design Space Exploration using VESPA Architectural Parameters DescriptionSymbolValues Number of LanesL1,2,4,8, … Memory Crossbar LanesM1,2, …, L Multiplier LanesX1,2, …, L Banks for Vector ChainingB1,2,4 ALU Replicate Per BankAPBon/off Maximum Vector LengthMVL2,4,8, … Width of Lanes (in bits)W1-32 Instruction Enable (each)-on/off Data Cache CapacityDDany Data Cache Line SizeDWany Data Prefetch SizeDPK< DD Vector Data Prefetch SizeDPV< DD/MVL Compute Architecture Memory Architecture Instruction Set Architecture

21 VESPA Design Space (768 architectural configurations) Fine-grain design space allows better-fit architecture 28x range 18x range 4x Evidence of efficiency: trade performance and area 1:1 1 2 4 8 16 32 Normalized Coprocessor Area 64 Normalized Wall Clock Time

22 Summary 1. Evaluated VESPA on modern FPGA hardware Scale up to 32 lanes with 11x average speedup 2. Augmented VESPA with fine-tunable parameters 1. Vector Chaining (by banking the register file) 22-35% better average performance than without Chaining configuration impact very application-dependent 2. Heterogeneous Lanes – lanes w/o multipliers Multipliers saved, costs performance (sometimes free) 3. Explored a vast architectural design space 18x range in performance, 28x range in area Use software for non-critical data-parallel computation

23 Thank You! VESPA release: http://www.eecg.utoronto.ca/VESPA

24 VESPA Parameters DescriptionSymbolValues Number of LanesL1,2,4,8, … Memory Crossbar LanesM1,2, …, L Multiplier LanesX1,2, …, L Banks for Vector ChainingB1,2,4 ALU Replicate Per BankAPBon/off Maximum Vector LengthMVL2,4,8, … Width of Lanes (in bits)W1-32 Instruction Enable (each)-on/off Data Cache CapacityDDany Data Cache Line SizeDWany Data Prefetch SizeDPK< DD Vector Data Prefetch SizeDPV< DD/MVL Compute Architecture Memory Architecture Instruction Set Architecture

25 VESPA Scalability Up to 27x, average of 15x for 32 lanes → good scaling 27x 15x (Area=1) (Area=1.3) (Area=1.9) (Area=3.2) (Area=6.3) (Area=12.3) Powerful parameter … but too coarse-grained

26 Proposed Soft Vector Processor System Design Flow Memory Interface Custom HW Peripherals Soft Proc Vector Lane 1 Is the soft processor the bottleneck? yes, increase lanes www.fpgavendor.com We propose adding vector extensions to existing soft processors User Code + Portable, Flexible, Scalable Vectorized Software Routine Vectorized Software Routine Vectorized Software Routine Portable Vectorized Software Routine Vectorized Software Routine Vectorized Software Routine Vector Lane 2 Vector Lane 3 Vector Lane 4 We want to evaluate soft vector processors for real

27 Vector Memory Unit Dcache base stride*0 index0 + MUXMUX... stride*1 index1 + MUXMUX stride*L indexL + MUXMUX Memory Request Queue Read Crossbar … Memory Lanes=4 rddata0 rddata1 rddataL wrdata0 wrdata1 wrdataL... Write Crossbar Memory Write Queue L = # Lanes - 1 … …

28 Overall Memory System Performance (4KB)(16KB) 67% 48% 31% 4% 15 Wider line + prefetching reduces memory unit stall cycles significantly Wider line + prefetching eliminates all but 4% of miss cycles 16 lanes

Fine-Grain Performance Scaling of Soft Vector Processors Peter Yiannacouras Jonathan Rose Gregory J. Steffan ESWEEK – CASES 2009, Grenoble, France Oct.

Similar presentations

Presentation on theme: "Fine-Grain Performance Scaling of Soft Vector Processors Peter Yiannacouras Jonathan Rose Gregory J. Steffan ESWEEK – CASES 2009, Grenoble, France Oct."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fine-Grain Performance Scaling of Soft Vector Processors Peter Yiannacouras Jonathan Rose Gregory J. Steffan ESWEEK – CASES 2009, Grenoble, France Oct.

Similar presentations

Presentation on theme: "Fine-Grain Performance Scaling of Soft Vector Processors Peter Yiannacouras Jonathan Rose Gregory J. Steffan ESWEEK – CASES 2009, Grenoble, France Oct."— Presentation transcript:

Similar presentations

About project

Feedback