Download presentation
Presentation is loading. Please wait.
Published byMarilynn Gallagher Modified over 9 years ago
1
Improving Memory System Performance for Soft Vector Processors Peter Yiannacouras J. Gregory Steffan Jonathan Rose WoSPS – Oct 26, 2008
2
2 Soft Processors in FPGA Systems Soft Processor Custom Logic HDL + CAD C + Compiler Easier Faster Smaller Less Power Data-level parallelism → soft vector processors Configurable – how can we make use of this?
3
3 Vector Processing Primer // C code for(i=0;i<16; i++) b[i]+=a[i] // Vectorized code set vl,16 vload vr0,b vload vr1,a vadd vr0,vr0,vr1 vstore vr0,b Each vector instruction holds many units of independent operations b[0]+=a[0] b[1]+=a[1] b[2]+=a[2] b[4]+=a[4] b[3]+=a[3] b[5]+=a[5] b[6]+=a[6] b[7]+=a[7] b[8]+=a[8] b[9]+=a[9] b[10]+=a[10] b[11]+=a[11] b[12]+=a[12] b[13]+=a[13] b[14]+=a[14] b[15]+=a[15] vadd 1 Vector Lane
4
4 Vector Processing Primer // C code for(i=0;i<16; i++) b[i]+=a[i] // Vectorized code set vl,16 vload vr0,b vload vr1,a vadd vr0,vr0,vr1 vstore vr0,b Each vector instruction holds many units of independent operations vadd 16 Vector Lanes b[0]+=a[0] b[1]+=a[1] b[2]+=a[2] b[4]+=a[4] b[3]+=a[3] b[5]+=a[5] b[6]+=a[6] b[7]+=a[7] b[8]+=a[8] b[9]+=a[9] b[10]+=a[10] b[11]+=a[11] b[12]+=a[12] b[13]+=a[13] b[14]+=a[14] b[15]+=a[15] 16x speedup
5
5 Sub-Linear Scalability Vector lanes not being fully utilized
6
6 Where Are The Cycles Spent? 67% 2/3 cycles spent waiting on memory unit, often from cache misses 16 lanes
7
7 Our Goals 1. Improve memory system Better cache design Hardware prefetching 2. Evaluate improvements for real: Using a complete hardware design (in Verilog) On real FPGA hardware (Stratix 1S80C6) Running full benchmarks (EEMBC) From off-chip memory (DDR-133MHz)
8
8 Current Infrastructure Vectorized assembly subroutines GNU as + Vector support ELF Binary MINT Instruction Set Simulator scalar μP + vpu VC RF VS RF VC WB VS WB Logic Decode Repli- cate Hazard check VR RF ALUALU Mem Unit x & satur. VR WB MUXMUX Satu- rate Rshift VR RF ALUALU x & satur. VR WB MUXMUX Satu- rate Rshift EEMBC C Benchmarks Modelsim (RTL Simulator) SOFTWAREHARDWARE Verilog Altera Quartus II v 8.0 cycles area, frequency GCC ld verification
9
9 VESPA Architecture Design Scalar Pipeline 3-stage Vector Control Pipeline 3-stage Vector Pipeline 6-stage IcacheDcache Decode RF ALUALU MUXMUX WB VC RF VS RF VC WB VS WB Logic Decode Repli- cate Hazard check VR RF ALUALU x & satur. VR WB MUXMUX Satu- rate Rshift VR RF ALUALU x & satur. VR WB MUXMUX Satu- rate Rshift Mem Unit Decode Supports integer and fixed-point operations, and predication 32-bit datapaths Shared Dcache 10
10
Vector Memory Crossbar Memory System Design DDR Scalar Vector Coproc Lane 0 Lane 0 Lane 0 Lane 4 Dcache 4KB, 16B line … Lane 0 Lane 0 Lane 0 Lane 8 Lane 0 Lane 0 Lane 0 Lane 12 Lane 4 Lane 4 Lane 15 Lane 16 VESPA 16 lanes DDR 9 cycle access vld.w (load 16 contiguous 32-bit words)
11
11 Vector Memory Crossbar Memory System Design DDR Scalar Vector Coproc Lane 0 Lane 0 Lane 0 Lane 4 Dcache 16KB, 64B line … Lane 0 Lane 0 Lane 0 Lane 8 Lane 0 Lane 0 Lane 0 Lane 12 Lane 4 Lane 4 Lane 15 Lane 16 VESPA 16 lanes DDR 9 cycle access vld.w (load 16 contiguous 32-bit words) 4x Reduced cache accesses + some prefetching
12
12 Improving Cache Design Vary the cache depth & cache line size Using parameterized design Cache line size: 16, 32, 64, 128 bytes Cache depth: 4, 8, 16, 32, 64 KB Measure performance on 9 benchmarks 6 from EEMBC, all executed in hardware Measure area cost Equate silicon area of all resources used Report in units of Equivalent LEs
13
13 Cache Design Space – Performance (Wall Clock Time) Best cache design almost doubles performance of original VESPA 122MHz 123MHz 126MHz 129MHz More pipelining/retiming could reduce clock frequency penalty Cache line more important than cache depth (lots of streaming)
14
14 Cache Design Space – Area M4K MRAM 16bits 4096 bits 64B (512 bits) 16bits 4096 bits 16bits 4096 bits … 16bits 4096 bits 16bits 4096 bits 16bits 4096 bits 16bits 4096 bits 32 => 16KB of storage System area almost doubled in worst case
15
15 Cache Design Space – Area M4K MRAM b) Don’t use MRAMs: big, few, and overkill a) Choose depth to fill block RAMs needed for line size
16
16 Hardware Prefetching Example DDR Dcache … vld.w No PrefetchingPrefetching 3 blocks DDR Dcache … vld.w MISS 9 cycle penalty 9 cycle penalty vld.w HIT MISS
17
17 Hardware Data Prefetching Advantages Little area overhead Parallelize memory fetching with computation Use full memory bandwidth Disadvantages Cache pollution We use Sequential Prefetching triggered on: a) any miss, or b) sequential vector instruction miss We measure performance/area using a 64B, 16KB dcache
18
18 Prefetching K Blocks – Any Miss Peak average speedup 28% 2.2x Not receptive Only half the benchmarks significantly sped-up, max of 2.2x, avg 28%
19
19 dirty lines … Prefetching Area Cost: Writeback Buffer Two options: Deny prefetch Buffer all dirty lines Area cost is small 1.6% of system area Mostly block RAMs Little logic No clock frequency impact Prefetching 3 blocks DDR Dcache … vld.w MISS 9 cycle penalty WB Buffer
20
20 Any Miss vs Sequential Vector Miss Collinear – nearly all misses in our benchmarks are sequential vector
21
21 Vector Length Prefetching Previously: constant # cache lines prefetched Now: Use multiple of vector length Only for sequential vector memory instructions Eg. Vector load of 32 elements Guarantees <= 1 miss per vector memory instr vld.w 031 fetch + prefetch 28*k
22
22 Vector Length Prefetching - Performance Peak 29% 2.2x Not receptive 1*VL prefetching provides good speedup without tuning, 8*VL best no cache pollution 21%
23
23 Overall Memory System Performance (4KB)(16KB) 67% 48% 31% 4% 15 Wider line + prefetching reduces memory unit stall cycles significantly Wider line + prefetching eliminates all but 4% of miss cycles
24
24 Improved Scalability Previous: 3-8x range, average of 5x for 16 lanes Now: 6-13x range, average of 10x for 16 lanes
25
25 Summary Explored cache design ~2x performance for ~2x system area Area growth due largely to memory crossbar Widened cache line size to 64B and depth to 16KB Enhanced VESPA w/ hardware data prefetching Up to 2.2x performance, average of 28% for K=15 Vector length prefetcher gains 21% on average for 1*VL Good for mixed workloads, no tuning, no cache pollution Peak at 8*VL, average of 29% speedup Overall improved VESPA memory system & scalability Decreased miss cycles to 4%, Decreased memory unit stall cycles to 31%
26
26 Vector Memory Unit Dcache base stride*0 index0 + MUXMUX... stride*1 index1 + MUXMUX stride*L indexL + MUXMUX Memory Request Queue Read Crossbar … Memory Lanes=4 rddata0 rddata1 rddataL wrdata0 wrdata1 wrdataL... Write Crossbar Memory Write Queue L = # Lanes - 1 … …
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.