Fine-Grain Performance Scaling of Soft Vector Processors Peter Yiannacouras Jonathan Rose Gregory J. Steffan ESWEEK – CASES 2009, Grenoble, France Oct.

Slides:



Advertisements
Similar presentations
Comparison of Altera NIOS II Processor with Analog Device’s TigerSHARC
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
CSCI 4717/5717 Computer Architecture
ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos What’s the.
Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
The University of Adelaide, School of Computer Science
COMP25212 Advanced Pipelining Out of Order Processors.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Chapter 8. Pipelining.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Instruction-Level Parallelism (ILP)
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
VESPA: Portable, Scalable, and Flexible FPGA-Based Vector Processors Peter YiannacourasUniv. of Toronto J. Gregory Steffan Univ. of Toronto Jonathan Rose.
Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.
VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Retrospective on the VIRAM-1 Design Decisions Christoforos E. Kozyrakis IRAM Retreat January 9, 2001.
Computer ArchitectureFall 2007 © October 31, CS-447– Computer Architecture M,W 10-11:20am Lecture 17 Review.
Specific Choice of Soft Processor Features Mark Grover Prof. Greg Steffan Dept. of Electrical and Computer Engineering.
Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
Computer Processing of Data
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
Embedded Supercomputing in FPGAs
Coarse and Fine Grain Programmable Overlay Architectures for FPGAs
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Data Parallel FPGA Workloads: Software Versus Hardware Peter Yiannacouras J. Gregory Steffan Jonathan Rose FPL 2009.
SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005.
© 2004, D. J. Foreman 1 Computer Organization. © 2004, D. J. Foreman 2 Basic Architecture Review  Von Neumann ■ Distinct single-ALU & single-Control.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Improving Memory System Performance for Soft Vector Processors Peter Yiannacouras J. Gregory Steffan Jonathan Rose WoSPS – Oct 26, 2008.
EKT303/4 Superscalar vs Super-pipelined.
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
My Coordinates Office EM G.27 contact time:
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
Vector computers.
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
PipeliningPipelining Computer Architecture (Fall 2006)
Programmable Hardware: Hardware or Software?
Design-Space Exploration
ESE532: System-on-a-Chip Architecture
Multiscalar Processors
Application-Specific Customization of Soft Processor Microarchitecture
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
Architecture & Organization 1
5.2 Eleven Advanced Optimizations of Cache Performance
Pipelining: Advanced ILP
Architecture & Organization 1
Hardware Multithreading
EE 445S Real-Time Digital Signal Processing Lab Spring 2014
Dr. Javier Navaridas Pipelining Dr. Javier Navaridas COMP25212 System Architecture.
A High Performance SoC: PkunityTM
Chapter 8. Pipelining.
Improving Memory System Performance for Soft Vector Processors
CS 252 Spring 2000 Jeff Herman John Loo Xiaoyi Tang
Customizable Soft Vector Processors
The University of Adelaide, School of Computer Science
Application-Specific Customization of Soft Processor Microarchitecture
Presentation transcript:

Fine-Grain Performance Scaling of Soft Vector Processors Peter Yiannacouras Jonathan Rose Gregory J. Steffan ESWEEK – CASES 2009, Grenoble, France Oct 13, 2009

2 FPGA Systems and Soft Processors Soft Processor Custom HW HDL + CAD Software + Compiler Easier Faster Smaller Less Power Simplify FPGA design: Customize soft processor architecture ? Configurable COMPETE WeeksMonths Target: Data level parallelism → vector processors Used in 25% of designs [source: Altera, 2009] Digital System Hard Processor  Board space, latency, power  Specialized device, increased cost computation

3 Vector Processing Primer // C code for(i=0;i<16; i++) c[i]=a[i]+b[i] // Vectorized code set vl,16 vload vr0,a vload vr1,b vadd vr2,vr0,vr1 vstore vr2,c Each vector instruction holds many units of independent operations vr2[0]= vr0[0]+vr1[0] vr2[1]= vr0[1]+vr1[1] vr2[2]= vr0[2]+vr1[2] vr2[4]= vr0[4]+vr1[4] vr2[3]= vr0[3]+vr1[3] vr2[5]= vr0[5]+vr1[5] vr2[6]= vr0[6]+vr1[6] vr2[7]= vr0[7]+vr1[7] vr2[8]= vr0[8]+vr1[8] vr2[9]= vr0[9]+vr1[9] vr2[10]=vr0[10]+vr1[10] vr2[11]=vr0[11]+vr1[11] vr2[12]=vr0[12]+vr1[12] vr2[13]=vr0[13]+vr1[13] vr2[14]=vr0[14]+vr1[14] vr2[15]=vr0[15]+vr1[15] vadd 1 Vector Lane

4 Vector Processing Primer // C code for(i=0;i<16; i++) c[i]=a[i]+b[i] // Vectorized code set vl,16 vload vr0,a vload vr1,b vadd vr2,vr0,vr1 vstore vr2,c Each vector instruction holds many units of independent operations vadd 16 Vector Lanes vr2[0]= vr0[0]+vr1[0] vr2[1]= vr0[1]+vr1[1] vr2[2]= vr0[2]+vr1[2] vr2[4]= vr0[4]+vr1[4] vr2[3]= vr0[3]+vr1[3] vr2[5]= vr0[5]+vr1[5] vr2[6]= vr0[6]+vr1[6] vr2[7]= vr0[7]+vr1[7] vr2[8]= vr0[8]+vr1[8] vr2[9]= vr0[9]+vr1[9] vr2[10]=vr0[10]+vr1[10] vr2[11]=vr0[11]+vr1[11] vr2[12]=vr0[12]+vr1[12] vr2[13]=vr0[13]+vr1[13] vr2[14]=vr0[14]+vr1[14] vr2[15]=vr0[15]+vr1[15] 16x speedup Previous Work (on Soft Vector Processors) : 1. Scalability 2. Flexibility 3. Portability CASES’08

5 VESPA Architecture Design (Vector Extended Soft Processor Architecture) Scalar Pipeline 3-stage Vector Control Pipeline 3-stage Vector Pipeline 6-stage IcacheDcache Decode RF ALUALU MUXMUX WB VC RF VS RF VC WB VS WB Logic Decode Repli- cate Hazard check VR RF VR WB VR RF VR WB Decode Supports integer and fixed-point operations [VIRAM] 32-bit Lanes Shared Dcache Legend Pipe stage Logic Storage Lane 1 ALU,Mem Unit Lane 2 ALU, Mem, Mul

6 In This Work 1. Evaluate for real using modern hardware Scale to 32 lanes (previous work did 16 lanes) 2. Add more fine-grain architectural parameters 1. Scale more finely Augment with parameterized vector chaining support 2. Customize to functional unit demand Augment with heterogeneous lanes 3. Explore a large design space

7 Evaluation Infrastructure Binary Instruction Set Simulation EEMBC Benchmarks RTL Simulation SOFTWAREHARDWARE Verilog FPGA CAD Software cycles area, power, clock frequency GCC Compiler verification Full hardware design of VESPA soft vector processor Evaluate soft vector processors with high accuracy Stratix III 340 DDR2 Vectorized assembly subroutines GNU as ld

8 VESPA Scalability Up to 19x, average of 11x for 32 lanes → good scaling 19x 11x (Area=1) (Area=1.3) (Area=1.9) (Area=3.2) (Area=6.3) (Area=12.3) Powerful parameter … but is coarse-grained

9 Vector Lane Design Space Too coarse grain! Reprogrammability allows more exact-fit 8% of largest FPGA (Equivalent ALMs)

10 In This Work 1. Evaluate for real using modern hardware Scale to 32 lanes (previous work did 16 lanes) 2. Add more fine-grain architectural parameters 1. Scale more finely Augment with parameterized vector chaining support 2. Customize to functional unit demand Augment with heterogeneous lanes 3. Explore a large design space

11 Vector Chaining Simultaneous execution of independent element operations within dependent instructions vadd vr10, vr1,vr2 vmul vr20, vr10,vr11 dependency vadd vmul Dependent Instructions Independent Element Operations

12 Vector Chaining in VESPA Unified ALUALU Vector Register File B=1 B=2 Bank 0 Vector Register File Bank 1 MUXMUX MUXMUX vmul vadd vmul vadd Single Instruction Execution Multiple Instruction Execution time No Vector Chaining With Vector Chaining ALUALU ALUALU ALUALU Mem ALUALU ALUALU ALUALU ALUALU Mul Mem Mul Lanes=4 Performance increase if instructions correctly scheduled

13 ALU Replication B=2 APB=false Bank 0 Vector Register File Bank 1 MUXMUX vsub vadd Single Instruction Execution time With Vector Chaining Mem ALU Mul MUXMUX ALU B=2 APB=true Bank 0 Vector Register File Bank 1 MUXMUX With Vector Chaining Mem ALU Mul MUXMUX ALU MUXMUX vsub vadd Multiple Instruction Execution time Lanes=4

14 Vector Chaining Speedup (on an 8-lane VESPA) Don’t care More banks More ALUs More banks More ALUs Chaining can be quite costly in area: 27%-92% Performance is application dependent: 5%-76% Significant speed improvement over no chaining (22-35% avg) More fine-grain vs double lanes: 19-89% speed, 86% area Cycle Speedup vs No Chaining

15 In This Work 1. Evaluate for real using modern hardware Scale to 32 lanes (previous work did 16 lanes) 2. Add more fine-grain architectural parameters 1. Scale more finely Augment with parameterized vector chaining support 2. Customize to functional unit demand Augment with heterogeneous lanes 3. Explore a large design space

16 Heterogeneous Lanes Mul ALU Mul ALU Mul ALU Mul ALU Lane 1 Lane 2 Lane 3 Lane 4 vmul 4 Lanes (L=4) 2 Multiplier Lanes (X=2)

17 Heterogeneous Lanes Mul ALU Mul ALU Lane 1 Lane 2 Lane 3 Lane 4 vmul STALL! Save area, but reduce speed depending on demand on the multiplier 4 Lanes (L=4) 2 Multiplier Lanes (X=2)

18 Impact of Heterogeneous Lanes (on a 32-lane VESPA) FreeExpensiveModerate Performance penalty is application dependent: 0%-85% Modest area savings (6%-13%) – dedicated multipliers

19 In This Work 1. Evaluate for real using modern hardware Scale to 32 lanes (previous work did 16 lanes) 2. Add more fine-grain architectural parameters 1. Scale more finely Augment with parameterized vector chaining support 2. Customize to functional unit demand Augment with heterogeneous lanes 3. Explore a large design space

20 Design Space Exploration using VESPA Architectural Parameters DescriptionSymbolValues Number of LanesL1,2,4,8, … Memory Crossbar LanesM1,2, …, L Multiplier LanesX1,2, …, L Banks for Vector ChainingB1,2,4 ALU Replicate Per BankAPBon/off Maximum Vector LengthMVL2,4,8, … Width of Lanes (in bits)W1-32 Instruction Enable (each)-on/off Data Cache CapacityDDany Data Cache Line SizeDWany Data Prefetch SizeDPK< DD Vector Data Prefetch SizeDPV< DD/MVL Compute Architecture Memory Architecture Instruction Set Architecture

21 VESPA Design Space (768 architectural configurations) Fine-grain design space allows better-fit architecture 28x range 18x range 4x Evidence of efficiency: trade performance and area 1: Normalized Coprocessor Area 64 Normalized Wall Clock Time

22 Summary 1. Evaluated VESPA on modern FPGA hardware Scale up to 32 lanes with 11x average speedup 2. Augmented VESPA with fine-tunable parameters 1. Vector Chaining (by banking the register file) 22-35% better average performance than without Chaining configuration impact very application-dependent 2. Heterogeneous Lanes – lanes w/o multipliers Multipliers saved, costs performance (sometimes free) 3. Explored a vast architectural design space 18x range in performance, 28x range in area Use software for non-critical data-parallel computation

23 Thank You! VESPA release:

24 VESPA Parameters DescriptionSymbolValues Number of LanesL1,2,4,8, … Memory Crossbar LanesM1,2, …, L Multiplier LanesX1,2, …, L Banks for Vector ChainingB1,2,4 ALU Replicate Per BankAPBon/off Maximum Vector LengthMVL2,4,8, … Width of Lanes (in bits)W1-32 Instruction Enable (each)-on/off Data Cache CapacityDDany Data Cache Line SizeDWany Data Prefetch SizeDPK< DD Vector Data Prefetch SizeDPV< DD/MVL Compute Architecture Memory Architecture Instruction Set Architecture

25 VESPA Scalability Up to 27x, average of 15x for 32 lanes → good scaling 27x 15x (Area=1) (Area=1.3) (Area=1.9) (Area=3.2) (Area=6.3) (Area=12.3) Powerful parameter … but too coarse-grained

26 Proposed Soft Vector Processor System Design Flow Memory Interface Custom HW Peripherals Soft Proc Vector Lane 1 Is the soft processor the bottleneck? yes, increase lanes We propose adding vector extensions to existing soft processors User Code + Portable, Flexible, Scalable Vectorized Software Routine Vectorized Software Routine Vectorized Software Routine Portable Vectorized Software Routine Vectorized Software Routine Vectorized Software Routine Vector Lane 2 Vector Lane 3 Vector Lane 4 We want to evaluate soft vector processors for real

27 Vector Memory Unit Dcache base stride*0 index0 + MUXMUX... stride*1 index1 + MUXMUX stride*L indexL + MUXMUX Memory Request Queue Read Crossbar … Memory Lanes=4 rddata0 rddata1 rddataL wrdata0 wrdata1 wrdataL... Write Crossbar Memory Write Queue L = # Lanes - 1 … …

28 Overall Memory System Performance (4KB)(16KB) 67% 48% 31% 4% 15 Wider line + prefetching reduces memory unit stall cycles significantly Wider line + prefetching eliminates all but 4% of miss cycles 16 lanes