Improving Memory System Performance for Soft Vector Processors

Slides:



Advertisements
Similar presentations
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Advertisements

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Lecture 12 Reduce Miss Penalty and Hit Time
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Performance of Cache Memory
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA A Parameterizable.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
1 Recap: Memory Hierarchy. 2 Unified vs.Separate Level 1 Cache Unified Level 1 Cache (Princeton Memory Architecture). A single level 1 cache is used for.
VESPA: Portable, Scalable, and Flexible FPGA-Based Vector Processors Peter YiannacourasUniv. of Toronto J. Gregory Steffan Univ. of Toronto Jonathan Rose.
Chapter 12 Pipelining Strategies Performance Hazards.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
Chapter 12 CPU Structure and Function. Example Register Organizations.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
Review for Midterm 2 CPSC 321 Computer Architecture Andreas Klappenecker.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Appendix A Pipelining: Basic and Intermediate Concepts
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.
Two-issue Super Scalar CPU. CPU structure, what did we have to deal with: -double clock generation -double-port instruction cache -double-port instruction.
Data Parallel FPGA Workloads: Software Versus Hardware Peter Yiannacouras J. Gregory Steffan Jonathan Rose FPL 2009.
Chapter 5 Large and Fast: Exploiting Memory Hierarchy CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University.
Fine-Grain Performance Scaling of Soft Vector Processors Peter Yiannacouras Jonathan Rose Gregory J. Steffan ESWEEK – CASES 2009, Grenoble, France Oct.
SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.
RISC Architecture RISC vs CISC Sherwin Chan.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
Improving Memory System Performance for Soft Vector Processors Peter Yiannacouras J. Gregory Steffan Jonathan Rose WoSPS – Oct 26, 2008.
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
Computer Organization CS224 Fall 2012 Lessons 39 & 40.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
Nios II Processor: Memory Organization and Access
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
COSC3330 Computer Architecture
Improving Memory Access 1/3 The Cache and Virtual Memory
Multiscalar Processors
Application-Specific Customization of Soft Processor Microarchitecture
CSC 4250 Computer Architectures
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
5.2 Eleven Advanced Optimizations of Cache Performance
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Single Clock Datapath With Control
PIII Data Stream Power Saving Modes Buses Memory Order Buffer
ECE 445 – Computer Organization
Drinking from the Firehose Decode in the Mill™ CPU Architecture
Systems Architecture II
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Cache - Optimization.
Customizable Soft Vector Processors
Application-Specific Customization of Soft Processor Microarchitecture
Pipelining.
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Improving Memory System Performance for Soft Vector Processors Peter Yiannacouras J. Gregory Steffan Jonathan Rose WoSPS – Oct 26, 2008

Soft Processors in FPGA Systems Custom Logic C + Compiler HDL + CAD  Easier  Faster  Smaller  Less Power  Configurable – how can we make use of this? Data-level parallelism → soft vector processors

Vector Processing Primer vadd // C code for(i=0;i<16; i++) b[i]+=a[i] // Vectorized code set vl,16 vload vr0,b vload vr1,a vadd vr0,vr0,vr1 vstore vr0,b b[15]+=a[15] b[14]+=a[14] b[13]+=a[13] b[12]+=a[12] b[11]+=a[11] b[10]+=a[10] b[9]+=a[9] b[8]+=a[8] b[7]+=a[7] b[6]+=a[6] b[5]+=a[5] b[4]+=a[4] Each vector instruction holds many units of independent operations b[3]+=a[3] b[2]+=a[2] b[1]+=a[1] b[0]+=a[0] 1 Vector Lane

Vector Processing Primer vadd // C code for(i=0;i<16; i++) b[i]+=a[i] // Vectorized code set vl,16 vload vr0,b vload vr1,a vadd vr0,vr0,vr1 vstore vr0,b b[15]+=a[15] 16 Vector Lanes b[14]+=a[14] 16x speedup b[13]+=a[13] b[12]+=a[12] b[11]+=a[11] b[10]+=a[10] b[9]+=a[9] b[8]+=a[8] b[7]+=a[7] b[6]+=a[6] b[5]+=a[5] b[4]+=a[4] Each vector instruction holds many units of independent operations b[3]+=a[3] b[2]+=a[2] b[1]+=a[1] b[0]+=a[0]

Sub-Linear Scalability Vector lanes not being fully utilized

Where Are The Cycles Spent? 16 lanes 67% 2/3 cycles spent waiting on memory unit, often from cache misses

Our Goals Improve memory system Evaluate improvements for real: Better cache design Hardware prefetching Evaluate improvements for real: Using a complete hardware design (in Verilog) On real FPGA hardware (Stratix 1S80C6) Running full benchmarks (EEMBC) From off-chip memory (DDR-133MHz)

Current Infrastructure SOFTWARE HARDWARE Verilog EEMBC C Benchmarks GCC ld scalar μP ELF Binary + Vectorized assembly subroutines GNU as + vpu RF VC VS WB Logic Decode Repli- cate Hazard check VR U L A Unit Mem x & satur. X M rate Satu- Rshift Vector support MINT Instruction Set Simulator Modelsim (RTL Simulator) Altera Quartus II v 8.0 area, frequency verification cycles verification

VESPA Architecture Design Icache Dcache M U X WB Scalar Pipeline 3-stage Decode RF A L U Shared Dcache VC RF VC WB Supports integer and fixed-point operations, and predication Vector Control Pipeline 3-stage Logic Decode VS RF VS WB Mem Unit Vector Pipeline 6-stage Decode Repli- cate Hazard check VR RF VR RF M U X VR WB M U X VR WB A L U A L U Satu- rate Satu- rate 32-bit datapaths x & satur. Rshift x & satur. Rshift 10

Vector Memory Crossbar Memory System Design vld.w (load 16 contiguous 32-bit words) VESPA 16 lanes Scalar Vector Coproc Lane Lane Lane Lane 4 Lane Lane Lane Lane 8 Lane Lane Lane Lane 12 Lane 4 Lane 4 Lane 15 Lane 16 Vector Memory Crossbar … Dcache 4KB, 16B line DDR 9 cycle access DDR

Vector Memory Crossbar Memory System Design vld.w (load 16 contiguous 32-bit words) VESPA 16 lanes Scalar Vector Coproc Lane Lane Lane Lane 4 Lane Lane Lane Lane 8 Lane Lane Lane Lane 12 Lane 4 Lane 4 Lane 15 Lane 16 Vector Memory Crossbar 4x … Dcache 16KB, 64B line 4x Reduced cache accesses + some prefetching DDR 9 cycle access DDR

Improving Cache Design Vary the cache depth & cache line size Using parameterized design Cache line size: 16, 32, 64, 128 bytes Cache depth: 4, 8, 16, 32, 64 KB Measure performance on 9 benchmarks 6 from EEMBC, all executed in hardware Measure area cost Equate silicon area of all resources used Report in units of Equivalent LEs

Cache Design Space – Performance (Wall Clock Time) 122MHz 123MHz 126MHz 129MHz Best cache design almost doubles performance of original VESPA Cache line more important than cache depth (lots of streaming) More pipelining/retiming could reduce clock frequency penalty

Cache Design Space – Area 64B (512 bits) 16bits 4096 bits 16bits 4096 bits 16bits 4096 bits 16bits 4096 bits 16bits 4096 bits 16bits 4096 bits 16bits 4096 bits … M4K 32 => 16KB of storage MRAM System area almost doubled in worst case

Cache Design Space – Area M4K MRAM a) Choose depth to fill block RAMs needed for line size b) Don’t use MRAMs: big, few, and overkill

Hardware Prefetching Example No Prefetching Prefetching 3 blocks vld.w vld.w vld.w vld.w MISS MISS … MISS … HIT Dcache Dcache 9 cycle penalty 9 cycle penalty DDR DDR

Hardware Data Prefetching Advantages Little area overhead Parallelize memory fetching with computation Use full memory bandwidth Disadvantages Cache pollution We use Sequential Prefetching triggered on: a) any miss, or b) sequential vector instruction miss We measure performance/area using a 64B, 16KB dcache

Prefetching K Blocks – Any Miss Peak average speedup 28% Not receptive 2.2x Only half the benchmarks significantly sped-up, max of 2.2x, avg 28%

Prefetching Area Cost: Writeback Buffer Prefetching 3 blocks Two options: Deny prefetch Buffer all dirty lines Area cost is small 1.6% of system area Mostly block RAMs Little logic No clock frequency impact vld.w WB Buffer MISS … dirty lines … Dcache 9 cycle penalty DDR

Any Miss vs Sequential Vector Miss Collinear – nearly all misses in our benchmarks are sequential vector

Vector Length Prefetching Previously: constant # cache lines prefetched Now: Use multiple of vector length Only for sequential vector memory instructions Eg. Vector load of 32 elements Guarantees <= 1 miss per vector memory instr 31 vld.w fetch + prefetch 28*k

Vector Length Prefetching - Performance Peak 29% Not receptive 21% 2.2x no cache pollution 1*VL prefetching provides good speedup without tuning, 8*VL best

Overall Memory System Performance 67% 48% 31% 4% (4KB) (16KB) 15 Wider line + prefetching reduces memory unit stall cycles significantly Wider line + prefetching eliminates all but 4% of miss cycles

Improved Scalability Previous: 3-8x range, average of 5x for 16 lanes Now: 6-13x range, average of 10x for 16 lanes

Summary Explored cache design ~2x performance for ~2x system area Area growth due largely to memory crossbar Widened cache line size to 64B and depth to 16KB Enhanced VESPA w/ hardware data prefetching Up to 2.2x performance, average of 28% for K=15 Vector length prefetcher gains 21% on average for 1*VL Good for mixed workloads, no tuning, no cache pollution Peak at 8*VL, average of 29% speedup Overall improved VESPA memory system & scalability Decreased miss cycles to 4%, Decreased memory unit stall cycles to 31%

Vector Memory Unit + + + Memory Request Queue base rddata0 rddata1 stride*0 M U X rddataL stride*1 M U X + ... + stride*L M U X + index0 index1 indexL wrdata0 ... Memory Lanes=4 … wrdata1 wrdataL Dcache Read Crossbar Write Crossbar L = # Lanes - 1 Memory Write Queue … …