Improving Memory System Performance for Soft Vector Processors Peter Yiannacouras J. Gregory Steffan Jonathan Rose WoSPS – Oct 26, 2008.

Slides:



Advertisements
Similar presentations
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Advertisements

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Lecture 12 Reduce Miss Penalty and Hit Time
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Performance of Cache Memory
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
VESPA: Portable, Scalable, and Flexible FPGA-Based Vector Processors Peter YiannacourasUniv. of Toronto J. Gregory Steffan Univ. of Toronto Jonathan Rose.
Chapter 12 Pipelining Strategies Performance Hazards.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Chapter 12 CPU Structure and Function. Example Register Organizations.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
Review for Midterm 2 CPSC 321 Computer Architecture Andreas Klappenecker.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.
Two-issue Super Scalar CPU. CPU structure, what did we have to deal with: -double clock generation -double-port instruction cache -double-port instruction.
Data Parallel FPGA Workloads: Software Versus Hardware Peter Yiannacouras J. Gregory Steffan Jonathan Rose FPL 2009.
Fine-Grain Performance Scaling of Soft Vector Processors Peter Yiannacouras Jonathan Rose Gregory J. Steffan ESWEEK – CASES 2009, Grenoble, France Oct.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.
SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
PipeliningPipelining Computer Architecture (Fall 2006)
CMSC 611: Advanced Computer Architecture
Nios II Processor: Memory Organization and Access
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
COSC3330 Computer Architecture
Improving Memory Access 1/3 The Cache and Virtual Memory
Multiscalar Processors
Application-Specific Customization of Soft Processor Microarchitecture
CSC 4250 Computer Architectures
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
5.2 Eleven Advanced Optimizations of Cache Performance
Chapter 14 Instruction Level Parallelism and Superscalar Processors
ECE 445 – Computer Organization
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Improving Memory System Performance for Soft Vector Processors
Cache - Optimization.
Customizable Soft Vector Processors
Application-Specific Customization of Soft Processor Microarchitecture
Pipelining.
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Improving Memory System Performance for Soft Vector Processors Peter Yiannacouras J. Gregory Steffan Jonathan Rose WoSPS – Oct 26, 2008

2 Soft Processors in FPGA Systems Soft Processor Custom Logic HDL + CAD C + Compiler Easier Faster Smaller Less Power Data-level parallelism → soft vector processors Configurable – how can we make use of this?

3 Vector Processing Primer // C code for(i=0;i<16; i++) b[i]+=a[i] // Vectorized code set vl,16 vload vr0,b vload vr1,a vadd vr0,vr0,vr1 vstore vr0,b Each vector instruction holds many units of independent operations b[0]+=a[0] b[1]+=a[1] b[2]+=a[2] b[4]+=a[4] b[3]+=a[3] b[5]+=a[5] b[6]+=a[6] b[7]+=a[7] b[8]+=a[8] b[9]+=a[9] b[10]+=a[10] b[11]+=a[11] b[12]+=a[12] b[13]+=a[13] b[14]+=a[14] b[15]+=a[15] vadd 1 Vector Lane

4 Vector Processing Primer // C code for(i=0;i<16; i++) b[i]+=a[i] // Vectorized code set vl,16 vload vr0,b vload vr1,a vadd vr0,vr0,vr1 vstore vr0,b Each vector instruction holds many units of independent operations vadd 16 Vector Lanes b[0]+=a[0] b[1]+=a[1] b[2]+=a[2] b[4]+=a[4] b[3]+=a[3] b[5]+=a[5] b[6]+=a[6] b[7]+=a[7] b[8]+=a[8] b[9]+=a[9] b[10]+=a[10] b[11]+=a[11] b[12]+=a[12] b[13]+=a[13] b[14]+=a[14] b[15]+=a[15] 16x speedup

5 Sub-Linear Scalability Vector lanes not being fully utilized

6 Where Are The Cycles Spent? 67% 2/3 cycles spent waiting on memory unit, often from cache misses 16 lanes

7 Our Goals 1. Improve memory system Better cache design Hardware prefetching 2. Evaluate improvements for real: Using a complete hardware design (in Verilog) On real FPGA hardware (Stratix 1S80C6) Running full benchmarks (EEMBC) From off-chip memory (DDR-133MHz)

8 Current Infrastructure Vectorized assembly subroutines GNU as + Vector support ELF Binary MINT Instruction Set Simulator scalar μP + vpu VC RF VS RF VC WB VS WB Logic Decode Repli- cate Hazard check VR RF ALUALU Mem Unit x & satur. VR WB MUXMUX Satu- rate Rshift VR RF ALUALU x & satur. VR WB MUXMUX Satu- rate Rshift EEMBC C Benchmarks Modelsim (RTL Simulator) SOFTWAREHARDWARE Verilog Altera Quartus II v 8.0 cycles area, frequency GCC ld verification

9 VESPA Architecture Design Scalar Pipeline 3-stage Vector Control Pipeline 3-stage Vector Pipeline 6-stage IcacheDcache Decode RF ALUALU MUXMUX WB VC RF VS RF VC WB VS WB Logic Decode Repli- cate Hazard check VR RF ALUALU x & satur. VR WB MUXMUX Satu- rate Rshift VR RF ALUALU x & satur. VR WB MUXMUX Satu- rate Rshift Mem Unit Decode Supports integer and fixed-point operations, and predication 32-bit datapaths Shared Dcache 10

Vector Memory Crossbar Memory System Design DDR Scalar Vector Coproc Lane 0 Lane 0 Lane 0 Lane 4 Dcache 4KB, 16B line … Lane 0 Lane 0 Lane 0 Lane 8 Lane 0 Lane 0 Lane 0 Lane 12 Lane 4 Lane 4 Lane 15 Lane 16 VESPA 16 lanes DDR 9 cycle access vld.w (load 16 contiguous 32-bit words)

11 Vector Memory Crossbar Memory System Design DDR Scalar Vector Coproc Lane 0 Lane 0 Lane 0 Lane 4 Dcache 16KB, 64B line … Lane 0 Lane 0 Lane 0 Lane 8 Lane 0 Lane 0 Lane 0 Lane 12 Lane 4 Lane 4 Lane 15 Lane 16 VESPA 16 lanes DDR 9 cycle access vld.w (load 16 contiguous 32-bit words) 4x Reduced cache accesses + some prefetching

12 Improving Cache Design Vary the cache depth & cache line size Using parameterized design Cache line size: 16, 32, 64, 128 bytes Cache depth: 4, 8, 16, 32, 64 KB Measure performance on 9 benchmarks 6 from EEMBC, all executed in hardware Measure area cost Equate silicon area of all resources used Report in units of Equivalent LEs

13 Cache Design Space – Performance (Wall Clock Time) Best cache design almost doubles performance of original VESPA 122MHz 123MHz 126MHz 129MHz More pipelining/retiming could reduce clock frequency penalty Cache line more important than cache depth (lots of streaming)

14 Cache Design Space – Area M4K MRAM 16bits 4096 bits 64B (512 bits) 16bits 4096 bits 16bits 4096 bits … 16bits 4096 bits 16bits 4096 bits 16bits 4096 bits 16bits 4096 bits 32 => 16KB of storage System area almost doubled in worst case

15 Cache Design Space – Area M4K MRAM b) Don’t use MRAMs: big, few, and overkill a) Choose depth to fill block RAMs needed for line size

16 Hardware Prefetching Example DDR Dcache … vld.w No PrefetchingPrefetching 3 blocks DDR Dcache … vld.w MISS 9 cycle penalty 9 cycle penalty vld.w HIT MISS

17 Hardware Data Prefetching Advantages Little area overhead Parallelize memory fetching with computation Use full memory bandwidth Disadvantages Cache pollution We use Sequential Prefetching triggered on: a) any miss, or b) sequential vector instruction miss We measure performance/area using a 64B, 16KB dcache

18 Prefetching K Blocks – Any Miss Peak average speedup 28% 2.2x Not receptive Only half the benchmarks significantly sped-up, max of 2.2x, avg 28%

19 dirty lines … Prefetching Area Cost: Writeback Buffer Two options: Deny prefetch Buffer all dirty lines Area cost is small 1.6% of system area Mostly block RAMs Little logic No clock frequency impact Prefetching 3 blocks DDR Dcache … vld.w MISS 9 cycle penalty WB Buffer

20 Any Miss vs Sequential Vector Miss Collinear – nearly all misses in our benchmarks are sequential vector

21 Vector Length Prefetching Previously: constant # cache lines prefetched Now: Use multiple of vector length Only for sequential vector memory instructions Eg. Vector load of 32 elements Guarantees <= 1 miss per vector memory instr vld.w 031 fetch + prefetch 28*k

22 Vector Length Prefetching - Performance Peak 29% 2.2x Not receptive 1*VL prefetching provides good speedup without tuning, 8*VL best no cache pollution 21%

23 Overall Memory System Performance (4KB)(16KB) 67% 48% 31% 4% 15 Wider line + prefetching reduces memory unit stall cycles significantly Wider line + prefetching eliminates all but 4% of miss cycles

24 Improved Scalability Previous: 3-8x range, average of 5x for 16 lanes Now: 6-13x range, average of 10x for 16 lanes

25 Summary Explored cache design ~2x performance for ~2x system area Area growth due largely to memory crossbar Widened cache line size to 64B and depth to 16KB Enhanced VESPA w/ hardware data prefetching Up to 2.2x performance, average of 28% for K=15 Vector length prefetcher gains 21% on average for 1*VL Good for mixed workloads, no tuning, no cache pollution Peak at 8*VL, average of 29% speedup Overall improved VESPA memory system & scalability Decreased miss cycles to 4%, Decreased memory unit stall cycles to 31%

26 Vector Memory Unit Dcache base stride*0 index0 + MUXMUX... stride*1 index1 + MUXMUX stride*L indexL + MUXMUX Memory Request Queue Read Crossbar … Memory Lanes=4 rddata0 rddata1 rddataL wrdata0 wrdata1 wrdataL... Write Crossbar Memory Write Queue L = # Lanes - 1 … …