Improving Memory System Performance for Soft Vector Processors

Improving Memory System Performance for Soft Vector Processors
Peter Yiannacouras J. Gregory Steffan Jonathan Rose WoSPS – Oct 26, 2008

Soft Processors in FPGA Systems
Custom Logic C + Compiler HDL + CAD  Easier  Faster  Smaller  Less Power  Configurable – how can we make use of this? Data-level parallelism → soft vector processors

Vector Processing Primer
vadd // C code for(i=0;i<16; i++) b[i]+=a[i] // Vectorized code set vl,16 vload vr0,b vload vr1,a vadd vr0,vr0,vr1 vstore vr0,b b[15]+=a[15] b[14]+=a[14] b[13]+=a[13] b[12]+=a[12] b[11]+=a[11] b[10]+=a[10] b[9]+=a[9] b[8]+=a[8] b[7]+=a[7] b[6]+=a[6] b[5]+=a[5] b[4]+=a[4] Each vector instruction holds many units of independent operations b[3]+=a[3] b[2]+=a[2] b[1]+=a[1] b[0]+=a[0] 1 Vector Lane

Vector Processing Primer
vadd // C code for(i=0;i<16; i++) b[i]+=a[i] // Vectorized code set vl,16 vload vr0,b vload vr1,a vadd vr0,vr0,vr1 vstore vr0,b b[15]+=a[15] 16 Vector Lanes b[14]+=a[14] 16x speedup b[13]+=a[13] b[12]+=a[12] b[11]+=a[11] b[10]+=a[10] b[9]+=a[9] b[8]+=a[8] b[7]+=a[7] b[6]+=a[6] b[5]+=a[5] b[4]+=a[4] Each vector instruction holds many units of independent operations b[3]+=a[3] b[2]+=a[2] b[1]+=a[1] b[0]+=a[0]

Sub-Linear Scalability
Vector lanes not being fully utilized

Where Are The Cycles Spent?
16 lanes 67% 2/3 cycles spent waiting on memory unit, often from cache misses

Our Goals Improve memory system Evaluate improvements for real:
Better cache design Hardware prefetching Evaluate improvements for real: Using a complete hardware design (in Verilog) On real FPGA hardware (Stratix 1S80C6) Running full benchmarks (EEMBC) From off-chip memory (DDR-133MHz)

Current Infrastructure
SOFTWARE HARDWARE Verilog EEMBC C Benchmarks GCC ld scalar μP ELF Binary + Vectorized assembly subroutines GNU as + vpu RF VC VS WB Logic Decode Repli- cate Hazard check VR U L A Unit Mem x & satur. X M rate Satu- Rshift Vector support MINT Instruction Set Simulator Modelsim (RTL Simulator) Altera Quartus II v 8.0 area, frequency verification cycles verification

VESPA Architecture Design
Icache Dcache M U X WB Scalar Pipeline 3-stage Decode RF A L U Shared Dcache VC RF VC WB Supports integer and fixed-point operations, and predication Vector Control Pipeline 3-stage Logic Decode VS RF VS WB Mem Unit Vector Pipeline 6-stage Decode Repli- cate Hazard check VR RF VR RF M U X VR WB M U X VR WB A L U A L U Satu- rate Satu- rate 32-bit datapaths x & satur. Rshift x & satur. Rshift 10

Vector Memory Crossbar
Memory System Design vld.w (load 16 contiguous 32-bit words) VESPA 16 lanes Scalar Vector Coproc Lane Lane Lane Lane 4 Lane Lane Lane Lane 8 Lane Lane Lane Lane 12 Lane 4 Lane 4 Lane 15 Lane 16 Vector Memory Crossbar … Dcache 4KB, 16B line DDR 9 cycle access DDR

Vector Memory Crossbar
Memory System Design vld.w (load 16 contiguous 32-bit words) VESPA 16 lanes Scalar Vector Coproc Lane Lane Lane Lane 4 Lane Lane Lane Lane 8 Lane Lane Lane Lane 12 Lane 4 Lane 4 Lane 15 Lane 16 Vector Memory Crossbar 4x … Dcache 16KB, 64B line 4x Reduced cache accesses + some prefetching DDR 9 cycle access DDR

Improving Cache Design
Vary the cache depth & cache line size Using parameterized design Cache line size: 16, 32, 64, 128 bytes Cache depth: 4, 8, 16, 32, 64 KB Measure performance on 9 benchmarks 6 from EEMBC, all executed in hardware Measure area cost Equate silicon area of all resources used Report in units of Equivalent LEs

Cache Design Space – Performance (Wall Clock Time)
122MHz 123MHz 126MHz 129MHz Best cache design almost doubles performance of original VESPA Cache line more important than cache depth (lots of streaming) More pipelining/retiming could reduce clock frequency penalty

Cache Design Space – Area
64B (512 bits) 16bits 4096 bits 16bits 4096 bits 16bits 4096 bits 16bits 4096 bits 16bits 4096 bits 16bits 4096 bits 16bits 4096 bits … M4K 32 => 16KB of storage MRAM System area almost doubled in worst case

Cache Design Space – Area
M4K MRAM a) Choose depth to fill block RAMs needed for line size b) Don’t use MRAMs: big, few, and overkill

Hardware Prefetching Example
No Prefetching Prefetching 3 blocks vld.w vld.w vld.w vld.w MISS MISS … MISS … HIT Dcache Dcache 9 cycle penalty 9 cycle penalty DDR DDR

Hardware Data Prefetching
Advantages Little area overhead Parallelize memory fetching with computation Use full memory bandwidth Disadvantages Cache pollution We use Sequential Prefetching triggered on: a) any miss, or b) sequential vector instruction miss We measure performance/area using a 64B, 16KB dcache

Prefetching K Blocks – Any Miss
Peak average speedup 28% Not receptive 2.2x Only half the benchmarks significantly sped-up, max of 2.2x, avg 28%

Prefetching Area Cost: Writeback Buffer
Prefetching 3 blocks Two options: Deny prefetch Buffer all dirty lines Area cost is small 1.6% of system area Mostly block RAMs Little logic No clock frequency impact vld.w WB Buffer MISS … dirty lines … Dcache 9 cycle penalty DDR

Any Miss vs Sequential Vector Miss
Collinear – nearly all misses in our benchmarks are sequential vector

Vector Length Prefetching
Previously: constant # cache lines prefetched Now: Use multiple of vector length Only for sequential vector memory instructions Eg. Vector load of 32 elements Guarantees <= 1 miss per vector memory instr 31 vld.w fetch + prefetch 28*k

Vector Length Prefetching - Performance
Peak 29% Not receptive 21% 2.2x no cache pollution 1*VL prefetching provides good speedup without tuning, 8*VL best

Overall Memory System Performance
67% 48% 31% 4% (4KB) (16KB) 15 Wider line + prefetching reduces memory unit stall cycles significantly Wider line + prefetching eliminates all but 4% of miss cycles

Improved Scalability Previous: 3-8x range, average of 5x for 16 lanes
Now: 6-13x range, average of 10x for 16 lanes

Summary Explored cache design
~2x performance for ~2x system area Area growth due largely to memory crossbar Widened cache line size to 64B and depth to 16KB Enhanced VESPA w/ hardware data prefetching Up to 2.2x performance, average of 28% for K=15 Vector length prefetcher gains 21% on average for 1*VL Good for mixed workloads, no tuning, no cache pollution Peak at 8*VL, average of 29% speedup Overall improved VESPA memory system & scalability Decreased miss cycles to 4%, Decreased memory unit stall cycles to 31%

Vector Memory Unit + + + Memory Request Queue base rddata0 rddata1
stride*0 M U X rddataL stride*1 M U X + ... + stride*L M U X + index0 index1 indexL wrdata0 ... Memory Lanes=4 … wrdata1 wrdataL Dcache Read Crossbar Write Crossbar L = # Lanes - 1 Memory Write Queue … …

Improving Memory System Performance for Soft Vector Processors

Similar presentations

Presentation on theme: "Improving Memory System Performance for Soft Vector Processors"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Improving Memory System Performance for Soft Vector Processors

Similar presentations

Presentation on theme: "Improving Memory System Performance for Soft Vector Processors"— Presentation transcript:

Similar presentations

About project

Feedback