Improving Memory System Performance for Soft Vector Processors Peter Yiannacouras J. Gregory Steffan Jonathan Rose WoSPS – Oct 26, 2008
Soft Processors in FPGA Systems Custom Logic C + Compiler HDL + CAD Easier Faster Smaller Less Power Configurable – how can we make use of this? Data-level parallelism → soft vector processors
Vector Processing Primer vadd // C code for(i=0;i<16; i++) b[i]+=a[i] // Vectorized code set vl,16 vload vr0,b vload vr1,a vadd vr0,vr0,vr1 vstore vr0,b b[15]+=a[15] b[14]+=a[14] b[13]+=a[13] b[12]+=a[12] b[11]+=a[11] b[10]+=a[10] b[9]+=a[9] b[8]+=a[8] b[7]+=a[7] b[6]+=a[6] b[5]+=a[5] b[4]+=a[4] Each vector instruction holds many units of independent operations b[3]+=a[3] b[2]+=a[2] b[1]+=a[1] b[0]+=a[0] 1 Vector Lane
Vector Processing Primer vadd // C code for(i=0;i<16; i++) b[i]+=a[i] // Vectorized code set vl,16 vload vr0,b vload vr1,a vadd vr0,vr0,vr1 vstore vr0,b b[15]+=a[15] 16 Vector Lanes b[14]+=a[14] 16x speedup b[13]+=a[13] b[12]+=a[12] b[11]+=a[11] b[10]+=a[10] b[9]+=a[9] b[8]+=a[8] b[7]+=a[7] b[6]+=a[6] b[5]+=a[5] b[4]+=a[4] Each vector instruction holds many units of independent operations b[3]+=a[3] b[2]+=a[2] b[1]+=a[1] b[0]+=a[0]
Sub-Linear Scalability Vector lanes not being fully utilized
Where Are The Cycles Spent? 16 lanes 67% 2/3 cycles spent waiting on memory unit, often from cache misses
Our Goals Improve memory system Evaluate improvements for real: Better cache design Hardware prefetching Evaluate improvements for real: Using a complete hardware design (in Verilog) On real FPGA hardware (Stratix 1S80C6) Running full benchmarks (EEMBC) From off-chip memory (DDR-133MHz)
Current Infrastructure SOFTWARE HARDWARE Verilog EEMBC C Benchmarks GCC ld scalar μP ELF Binary + Vectorized assembly subroutines GNU as + vpu RF VC VS WB Logic Decode Repli- cate Hazard check VR U L A Unit Mem x & satur. X M rate Satu- Rshift Vector support MINT Instruction Set Simulator Modelsim (RTL Simulator) Altera Quartus II v 8.0 area, frequency verification cycles verification
VESPA Architecture Design Icache Dcache M U X WB Scalar Pipeline 3-stage Decode RF A L U Shared Dcache VC RF VC WB Supports integer and fixed-point operations, and predication Vector Control Pipeline 3-stage Logic Decode VS RF VS WB Mem Unit Vector Pipeline 6-stage Decode Repli- cate Hazard check VR RF VR RF M U X VR WB M U X VR WB A L U A L U Satu- rate Satu- rate 32-bit datapaths x & satur. Rshift x & satur. Rshift 10
Vector Memory Crossbar Memory System Design vld.w (load 16 contiguous 32-bit words) VESPA 16 lanes Scalar Vector Coproc Lane Lane Lane Lane 4 Lane Lane Lane Lane 8 Lane Lane Lane Lane 12 Lane 4 Lane 4 Lane 15 Lane 16 Vector Memory Crossbar … Dcache 4KB, 16B line DDR 9 cycle access DDR
Vector Memory Crossbar Memory System Design vld.w (load 16 contiguous 32-bit words) VESPA 16 lanes Scalar Vector Coproc Lane Lane Lane Lane 4 Lane Lane Lane Lane 8 Lane Lane Lane Lane 12 Lane 4 Lane 4 Lane 15 Lane 16 Vector Memory Crossbar 4x … Dcache 16KB, 64B line 4x Reduced cache accesses + some prefetching DDR 9 cycle access DDR
Improving Cache Design Vary the cache depth & cache line size Using parameterized design Cache line size: 16, 32, 64, 128 bytes Cache depth: 4, 8, 16, 32, 64 KB Measure performance on 9 benchmarks 6 from EEMBC, all executed in hardware Measure area cost Equate silicon area of all resources used Report in units of Equivalent LEs
Cache Design Space – Performance (Wall Clock Time) 122MHz 123MHz 126MHz 129MHz Best cache design almost doubles performance of original VESPA Cache line more important than cache depth (lots of streaming) More pipelining/retiming could reduce clock frequency penalty
Cache Design Space – Area 64B (512 bits) 16bits 4096 bits 16bits 4096 bits 16bits 4096 bits 16bits 4096 bits 16bits 4096 bits 16bits 4096 bits 16bits 4096 bits … M4K 32 => 16KB of storage MRAM System area almost doubled in worst case
Cache Design Space – Area M4K MRAM a) Choose depth to fill block RAMs needed for line size b) Don’t use MRAMs: big, few, and overkill
Hardware Prefetching Example No Prefetching Prefetching 3 blocks vld.w vld.w vld.w vld.w MISS MISS … MISS … HIT Dcache Dcache 9 cycle penalty 9 cycle penalty DDR DDR
Hardware Data Prefetching Advantages Little area overhead Parallelize memory fetching with computation Use full memory bandwidth Disadvantages Cache pollution We use Sequential Prefetching triggered on: a) any miss, or b) sequential vector instruction miss We measure performance/area using a 64B, 16KB dcache
Prefetching K Blocks – Any Miss Peak average speedup 28% Not receptive 2.2x Only half the benchmarks significantly sped-up, max of 2.2x, avg 28%
Prefetching Area Cost: Writeback Buffer Prefetching 3 blocks Two options: Deny prefetch Buffer all dirty lines Area cost is small 1.6% of system area Mostly block RAMs Little logic No clock frequency impact vld.w WB Buffer MISS … dirty lines … Dcache 9 cycle penalty DDR
Any Miss vs Sequential Vector Miss Collinear – nearly all misses in our benchmarks are sequential vector
Vector Length Prefetching Previously: constant # cache lines prefetched Now: Use multiple of vector length Only for sequential vector memory instructions Eg. Vector load of 32 elements Guarantees <= 1 miss per vector memory instr 31 vld.w fetch + prefetch 28*k
Vector Length Prefetching - Performance Peak 29% Not receptive 21% 2.2x no cache pollution 1*VL prefetching provides good speedup without tuning, 8*VL best
Overall Memory System Performance 67% 48% 31% 4% (4KB) (16KB) 15 Wider line + prefetching reduces memory unit stall cycles significantly Wider line + prefetching eliminates all but 4% of miss cycles
Improved Scalability Previous: 3-8x range, average of 5x for 16 lanes Now: 6-13x range, average of 10x for 16 lanes
Summary Explored cache design ~2x performance for ~2x system area Area growth due largely to memory crossbar Widened cache line size to 64B and depth to 16KB Enhanced VESPA w/ hardware data prefetching Up to 2.2x performance, average of 28% for K=15 Vector length prefetcher gains 21% on average for 1*VL Good for mixed workloads, no tuning, no cache pollution Peak at 8*VL, average of 29% speedup Overall improved VESPA memory system & scalability Decreased miss cycles to 4%, Decreased memory unit stall cycles to 31%
Vector Memory Unit + + + Memory Request Queue base rddata0 rddata1 stride*0 M U X rddataL stride*1 M U X + ... + stride*L M U X + index0 index1 indexL wrdata0 ... Memory Lanes=4 … wrdata1 wrdataL Dcache Read Crossbar Write Crossbar L = # Lanes - 1 Memory Write Queue … …