Download presentation
Presentation is loading. Please wait.
Published byHadian Sudirman Modified over 6 years ago
1
Improving Memory System Performance for Soft Vector Processors
Peter Yiannacouras J. Gregory Steffan Jonathan Rose WoSPS – Oct 26, 2008
2
Soft Processors in FPGA Systems
Custom Logic C + Compiler HDL + CAD Easier Faster Smaller Less Power Configurable – how can we make use of this? Data-level parallelism → soft vector processors
3
Vector Processing Primer
vadd // C code for(i=0;i<16; i++) b[i]+=a[i] // Vectorized code set vl,16 vload vr0,b vload vr1,a vadd vr0,vr0,vr1 vstore vr0,b b[15]+=a[15] b[14]+=a[14] b[13]+=a[13] b[12]+=a[12] b[11]+=a[11] b[10]+=a[10] b[9]+=a[9] b[8]+=a[8] b[7]+=a[7] b[6]+=a[6] b[5]+=a[5] b[4]+=a[4] Each vector instruction holds many units of independent operations b[3]+=a[3] b[2]+=a[2] b[1]+=a[1] b[0]+=a[0] 1 Vector Lane
4
Vector Processing Primer
vadd // C code for(i=0;i<16; i++) b[i]+=a[i] // Vectorized code set vl,16 vload vr0,b vload vr1,a vadd vr0,vr0,vr1 vstore vr0,b b[15]+=a[15] 16 Vector Lanes b[14]+=a[14] 16x speedup b[13]+=a[13] b[12]+=a[12] b[11]+=a[11] b[10]+=a[10] b[9]+=a[9] b[8]+=a[8] b[7]+=a[7] b[6]+=a[6] b[5]+=a[5] b[4]+=a[4] Each vector instruction holds many units of independent operations b[3]+=a[3] b[2]+=a[2] b[1]+=a[1] b[0]+=a[0]
5
Sub-Linear Scalability
Vector lanes not being fully utilized
6
Where Are The Cycles Spent?
16 lanes 67% 2/3 cycles spent waiting on memory unit, often from cache misses
7
Our Goals Improve memory system Evaluate improvements for real:
Better cache design Hardware prefetching Evaluate improvements for real: Using a complete hardware design (in Verilog) On real FPGA hardware (Stratix 1S80C6) Running full benchmarks (EEMBC) From off-chip memory (DDR-133MHz)
8
Current Infrastructure
SOFTWARE HARDWARE Verilog EEMBC C Benchmarks GCC ld scalar μP ELF Binary + Vectorized assembly subroutines GNU as + vpu RF VC VS WB Logic Decode Repli- cate Hazard check VR U L A Unit Mem x & satur. X M rate Satu- Rshift Vector support MINT Instruction Set Simulator Modelsim (RTL Simulator) Altera Quartus II v 8.0 area, frequency verification cycles verification
9
VESPA Architecture Design
Icache Dcache M U X WB Scalar Pipeline 3-stage Decode RF A L U Shared Dcache VC RF VC WB Supports integer and fixed-point operations, and predication Vector Control Pipeline 3-stage Logic Decode VS RF VS WB Mem Unit Vector Pipeline 6-stage Decode Repli- cate Hazard check VR RF VR RF M U X VR WB M U X VR WB A L U A L U Satu- rate Satu- rate 32-bit datapaths x & satur. Rshift x & satur. Rshift 10
10
Vector Memory Crossbar
Memory System Design vld.w (load 16 contiguous 32-bit words) VESPA 16 lanes Scalar Vector Coproc Lane Lane Lane Lane 4 Lane Lane Lane Lane 8 Lane Lane Lane Lane 12 Lane 4 Lane 4 Lane 15 Lane 16 Vector Memory Crossbar … Dcache 4KB, 16B line DDR 9 cycle access DDR
11
Vector Memory Crossbar
Memory System Design vld.w (load 16 contiguous 32-bit words) VESPA 16 lanes Scalar Vector Coproc Lane Lane Lane Lane 4 Lane Lane Lane Lane 8 Lane Lane Lane Lane 12 Lane 4 Lane 4 Lane 15 Lane 16 Vector Memory Crossbar 4x … Dcache 16KB, 64B line 4x Reduced cache accesses + some prefetching DDR 9 cycle access DDR
12
Improving Cache Design
Vary the cache depth & cache line size Using parameterized design Cache line size: 16, 32, 64, 128 bytes Cache depth: 4, 8, 16, 32, 64 KB Measure performance on 9 benchmarks 6 from EEMBC, all executed in hardware Measure area cost Equate silicon area of all resources used Report in units of Equivalent LEs
13
Cache Design Space – Performance (Wall Clock Time)
122MHz 123MHz 126MHz 129MHz Best cache design almost doubles performance of original VESPA Cache line more important than cache depth (lots of streaming) More pipelining/retiming could reduce clock frequency penalty
14
Cache Design Space – Area
64B (512 bits) 16bits 4096 bits 16bits 4096 bits 16bits 4096 bits 16bits 4096 bits 16bits 4096 bits 16bits 4096 bits 16bits 4096 bits … M4K 32 => 16KB of storage MRAM System area almost doubled in worst case
15
Cache Design Space – Area
M4K MRAM a) Choose depth to fill block RAMs needed for line size b) Don’t use MRAMs: big, few, and overkill
16
Hardware Prefetching Example
No Prefetching Prefetching 3 blocks vld.w vld.w vld.w vld.w MISS MISS … MISS … HIT Dcache Dcache 9 cycle penalty 9 cycle penalty DDR DDR
17
Hardware Data Prefetching
Advantages Little area overhead Parallelize memory fetching with computation Use full memory bandwidth Disadvantages Cache pollution We use Sequential Prefetching triggered on: a) any miss, or b) sequential vector instruction miss We measure performance/area using a 64B, 16KB dcache
18
Prefetching K Blocks – Any Miss
Peak average speedup 28% Not receptive 2.2x Only half the benchmarks significantly sped-up, max of 2.2x, avg 28%
19
Prefetching Area Cost: Writeback Buffer
Prefetching 3 blocks Two options: Deny prefetch Buffer all dirty lines Area cost is small 1.6% of system area Mostly block RAMs Little logic No clock frequency impact vld.w WB Buffer MISS … dirty lines … Dcache 9 cycle penalty DDR
20
Any Miss vs Sequential Vector Miss
Collinear – nearly all misses in our benchmarks are sequential vector
21
Vector Length Prefetching
Previously: constant # cache lines prefetched Now: Use multiple of vector length Only for sequential vector memory instructions Eg. Vector load of 32 elements Guarantees <= 1 miss per vector memory instr 31 vld.w fetch + prefetch 28*k
22
Vector Length Prefetching - Performance
Peak 29% Not receptive 21% 2.2x no cache pollution 1*VL prefetching provides good speedup without tuning, 8*VL best
23
Overall Memory System Performance
67% 48% 31% 4% (4KB) (16KB) 15 Wider line + prefetching reduces memory unit stall cycles significantly Wider line + prefetching eliminates all but 4% of miss cycles
24
Improved Scalability Previous: 3-8x range, average of 5x for 16 lanes
Now: 6-13x range, average of 10x for 16 lanes
25
Summary Explored cache design
~2x performance for ~2x system area Area growth due largely to memory crossbar Widened cache line size to 64B and depth to 16KB Enhanced VESPA w/ hardware data prefetching Up to 2.2x performance, average of 28% for K=15 Vector length prefetcher gains 21% on average for 1*VL Good for mixed workloads, no tuning, no cache pollution Peak at 8*VL, average of 29% speedup Overall improved VESPA memory system & scalability Decreased miss cycles to 4%, Decreased memory unit stall cycles to 31%
26
Vector Memory Unit + + + Memory Request Queue base rddata0 rddata1
stride*0 M U X rddataL stride*1 M U X + ... + stride*L M U X + index0 index1 indexL wrdata0 ... Memory Lanes=4 … wrdata1 wrdataL Dcache Read Crossbar Write Crossbar L = # Lanes - 1 Memory Write Queue … …
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.