Customizable Soft Vector Processors Peter Yiannacouras, PhD Candidate Connections 2009
Soft Processors in FPGA Systems Weeks Soft Processor Custom HW Months Software + Compiler HDL + CAD Used in 25% of designs [source: Altera, 2009] Faster Smaller Less Power Easier COMPETE Configurable Make FPGA technology more easily accessible Optimize soft processor to application properties
Data Level Parallelism Same operation // C code for(i=0;i<16; i++) c[i]=a[i]+b[i] Independent data c[15]=a[15]+b[15] c[14]=a[14]+b[14] Data Level Parallelism c[13]=a[13]+b[13] c[12]=a[12]+b[12] Commonly found in embedded systems c[11]=a[11]+b[11] c[10]=a[10]+b[10] c[9]= a[9]+b[9] c[8]= a[8]+b[8] c[7]= a[7]+b[7] Exploit using a Vector Processor c=a+b c[6]= a[6]+b[6] //Processor instructions load r0,a[1] load r1,b[1] add r2,r0,r1 store r2,c[1] c[5]= a[5]+b[5] c[4]= a[4]+b[4] c[3]= a[3]+b[3] c[2]= a[2]+b[2] c[1]= a[1]+b[1] c[0]= a[0]+b[0]
Vector Processing Primer vadd // C code for(i=0;i<16; i++) c[i]=a[i]+b[i] // Vectorized code set vl,16 vload vr0,a vload vr1,b vadd vr2,vr0,vr1 vstore vr2,c vr2[15]=vr0[15]+vr1[15] vr2[14]=vr0[14]+vr1[14] vr2[13]=vr0[13]+vr1[13] vr2[12]=vr0[12]+vr1[12] vr2[11]=vr0[11]+vr1[11] vr2[10]=vr0[10]+vr1[10] vr2[9]= vr0[9]+vr1[9] vr2[8]= vr0[8]+vr1[8] vr2[7]= vr0[7]+vr1[7] vr2[6]= vr0[6]+vr1[6] vr2[5]= vr0[5]+vr1[5] vr2[4]= vr0[4]+vr1[4] Each vector instruction holds many units of independent operations vr2[3]= vr0[3]+vr1[3] vr2[2]= vr0[2]+vr1[2] vr2[1]= vr0[1]+vr1[1] vr2[0]= vr0[0]+vr1[0] 1 Vector Lane
Vector Processing Primer vadd // C code for(i=0;i<16; i++) c[i]=a[i]+b[i] // Vectorized code set vl,16 vload vr0,a vload vr1,b vadd vr2,vr0,vr1 vstore vr2,c 16 Vector Lanes vr2[15]=vr0[15]+vr1[15] vr2[14]=vr0[14]+vr1[14] vr2[13]=vr0[13]+vr1[13] 16x speedup vr2[12]=vr0[12]+vr1[12] Implemented on an FPGA (Soft Vector Processor) Is it scalable? vr2[11]=vr0[11]+vr1[11] vr2[10]=vr0[10]+vr1[10] vr2[9]= vr0[9]+vr1[9] vr2[8]= vr0[8]+vr1[8] vr2[7]= vr0[7]+vr1[7] vr2[6]= vr0[6]+vr1[6] vr2[5]= vr0[5]+vr1[5] vr2[4]= vr0[4]+vr1[4] Each vector instruction holds many units of independent operations vr2[3]= vr0[3]+vr1[3] vr2[2]= vr0[2]+vr1[2] vr2[1]= vr0[1]+vr1[1] vr2[0]= vr0[0]+vr1[0]
Soft Vector Processor Scalability 9x 14x 7 configurations: 14x speed, 9x area => coarse-grained!
More Architectural Parameters Description Symbol Values Number of Lanes L 1,2,4,8, … Memory Crossbar Lanes M 1,2, …, L Multiplier Lanes X Register Banks for Chaining B 1,2,4, … ALU per Register Bank APB true/false Maximum Vector Length MVL 2,4,8, … Width of Lanes (in bits) W 1-32 Instruction Enable (each) - on/off Data Cache Capacity DD any Data Cache Line Size DW Data Prefetch Size DPK < DD Vector Data Prefetch Size DPV < DD/MVL Processor Architecture Instruction Set Architecture Memory System
Fine-Grained Trade Off Space Memory System: Weak Moderate Good