VESPA: Portable, Scalable, and Flexible FPGA-Based Vector Processors Peter YiannacourasUniv. of Toronto J. Gregory Steffan Univ. of Toronto Jonathan Rose.

Slides:



Advertisements
Similar presentations
Comparison of Altera NIOS II Processor with Analog Device’s TigerSHARC
Advertisements

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.
Performance of Cache Memory
CS2100 Computer Organisation Performance (AY2014/2015) Semester 2.
Processor Overview Features Designed for consumer and wireless products RISC Processor with Harvard Architecture Vector Floating Point coprocessor Branch.
A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University.
Application-Specific Customization of FPGA Soft- core Processors Journal Paper Presentation Presented by: Ahmad Sghaier Course Instructor: Dr. Shawki Areibi.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
Platforms, ASIPs and LISATek Federico Angiolini DEIS Università di Bologna.
Configurable System-on-Chip: Xilinx EDK
VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
Retrospective on the VIRAM-1 Design Decisions Christoforos E. Kozyrakis IRAM Retreat January 9, 2001.
Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma
Specific Choice of Soft Processor Features Mark Grover Prof. Greg Steffan Dept. of Electrical and Computer Engineering.
Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.
Programmable Logic- How do they do that? 1/16/2015 Warren Miller Class 5: Software Tools and More 1.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Ross Brennan On the Introduction of Reconfigurable Hardware into Computer Architecture Education Ross Brennan
Embedded Supercomputing in FPGAs
Coarse and Fine Grain Programmable Overlay Architectures for FPGAs
SPREE Tutorial Peter Yiannacouras April 13, 2006.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Automated Design of Custom Architecture Tulika Mitra
Data Parallel FPGA Workloads: Software Versus Hardware Peter Yiannacouras J. Gregory Steffan Jonathan Rose FPL 2009.
Fine-Grain Performance Scaling of Soft Vector Processors Peter Yiannacouras Jonathan Rose Gregory J. Steffan ESWEEK – CASES 2009, Grenoble, France Oct.
1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.
SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Macro instruction synthesis for embedded processors Pinhong Chen Yunjian Jiang (william) - CS252 project presentation.
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.
The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005.
XStream: Rapid Generation of Custom Processors for ASIC Designs Binu Mathew * ASIC: Application Specific Integrated Circuit.
Maximizing Speed and Density of Tiled FPGA Overlays via Partitioning Charles Eric LaForest J. Gregory Steffan University of Toronto ICFPT 2013.
NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.
© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Improving Memory System Performance for Soft Vector Processors Peter Yiannacouras J. Gregory Steffan Jonathan Rose WoSPS – Oct 26, 2008.
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
EKT303/4 Superscalar vs Super-pipelined.
NISC set computer no-instruction
1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.
1 of 14 Lab 2: Design-Space Exploration with MPARM.
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
Presenter: Yi-Ting Chung Fast and Scalable Hybrid Functional Verification and Debug with Dynamically Reconfigurable Co- simulation.
CSE 340 Computer Architecture Summer 2016 Understanding Performance.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Programmable Hardware: Hardware or Software?
Variable Word Width Computation for Low Power
ESE532: System-on-a-Chip Architecture
ISPASS th April Santa Rosa, California
Application-Specific Customization of Soft Processor Microarchitecture
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
5.2 Eleven Advanced Optimizations of Cache Performance
FPGAs in AWS and First Use Cases, Kees Vissers
A Review of Processor Design Flow
Improving Memory System Performance for Soft Vector Processors
Computer Evolution and Performance
A small SOPC-based aircraft autopilot system that contains an FPGA with a Nios processor core, a DSP processor, and memory is seen above. The bottom sensor.
CS 252 Spring 2000 Jeff Herman John Loo Xiaoyi Tang
Customizable Soft Vector Processors
Application-Specific Customization of Soft Processor Microarchitecture
Presentation transcript:

VESPA: Portable, Scalable, and Flexible FPGA-Based Vector Processors Peter YiannacourasUniv. of Toronto J. Gregory Steffan Univ. of Toronto Jonathan Rose Univ. of Toronto

2 Soft Processors in FPGA Systems HDL + CAD C + Compiler Easier Faster Smaller Less Power Data-level parallelism → soft vector processors Configurable – how can we make use of this?

3 Vector Processing Primer // C code for(i=0;i<16; i++) b[i]+=a[i] // Vectorized code set vl,16 vload vr0,b vload vr1,a vadd vr0,vr0,vr1 vstore vr0,b Each vector instruction holds many units of independent operations b[0]+=a[0] b[1]+=a[1] b[2]+=a[2] b[4]+=a[4] b[3]+=a[3] b[5]+=a[5] b[6]+=a[6] b[7]+=a[7] b[8]+=a[8] b[9]+=a[9] b[10]+=a[10] b[11]+=a[11] b[12]+=a[12] b[13]+=a[13] b[14]+=a[14] b[15]+=a[15] vadd 1 Vector Lane

4 Vector Processing Primer // C code for(i=0;i<16; i++) b[i]+=a[i] // Vectorized code set vl,16 vload vr0,b vload vr1,a vadd vr0,vr0,vr1 vstore vr0,b Each vector instruction holds many units of independent operations vadd 16 Vector Lanes b[0]+=a[0] b[1]+=a[1] b[2]+=a[2] b[4]+=a[4] b[3]+=a[3] b[5]+=a[5] b[6]+=a[6] b[7]+=a[7] b[8]+=a[8] b[9]+=a[9] b[10]+=a[10] b[11]+=a[11] b[12]+=a[12] b[13]+=a[13] b[14]+=a[14] b[15]+=a[15] 16x speedup 1) Portable 2) Flexible 3) Scalable

5 Soft Vector Processor Benefits 1. Portable SW: Agnostic to HW implementation Eg. Number of lanes HW: Can be implemented on any FPGA architecture 2. Flexible Many parameters to tune (by end-user, not vendor) Eg. Number of lanes, width of lanes, etc. 3. Scalable SW: Applies to any code with data-level parallelism HW: Number of lanes can grow with capacity of device Parallelism can scale with Moore’s law How would this fit in with current FPGA design flow?

6 Conventional FPGA Design Flow Memory Interface Custom Accelerator Peripherals Soft Proc Custom Accelerator Custom Accelerator Software Routine Software Routine Software Routine Is the soft processor the bottleneck? yes, find hot code Three options: 1)Manual hardware design 2)Acquire RTL IP-core 3)High level synthesis Eg. Altera C2H Push button Code dependent

7 Proposed Soft Vector Processor System Design Flow Memory Interface Custom Accelerator Peripherals Soft Proc Vector Lane 1 Vector Lane 2 Is the soft processor the bottleneck? yes, increase lanes We propose adding vector extensions to existing soft processors Vector Lane 3 Vector Lane 4 User Code + Portable, Flexible, Scalable Vectorized Software Routine Vectorized Software Routine Vectorized Software Routine Portable, Easy-to-use Vectorized Software Routine Vectorized Software Routine Vectorized Software Routine

8 Our Goals 1. Evaluate soft vector processing for real: Using a complete hardware design (in Verilog) On real FPGA hardware (Stratix 1S80C6) Running full benchmarks (EEMBC) From off-chip memory (DDR-133MHz) 2. Quantify performance/area tradeoffs Across different vector processor configurations 3. Explore application-specific customizations Reduce generality of soft vector processors

9 Current Infrastructure Vectorized assembly subroutines GNU as + Vector support ELF Binary Instruction Set Simulation + SPREE + Vector support scalar μP + vpu VC RF VS RF VC WB VS WB Logic Decode Repli- cate Hazard check VR RF ALUALU Mem Unit x & satur. VR WB MUXMUX Satu- rate Rshift VR RF ALUALU x & satur. VR WB MUXMUX Satu- rate Rshift EEMBC C Benchmarks RTL Simulation SOFTWAREHARDWARE Verilog CAD Software cycles area, frequency GCC ld verification Manually designed coprocessor TM4 Vector Extended Soft Processor Architecture

10 VESPA Architecture Design Scalar Pipeline 3-stage Vector Control Pipeline 3-stage Vector Pipeline 6-stage IcacheDcache Decode RF ALUALU MUXMUX WB VC RF VS RF VC WB VS WB Logic Decode Repli- cate Hazard check VR RF ALUALU x & satur. VR WB MUXMUX Satu- rate Rshift VR RF ALUALU x & satur. VR WB MUXMUX Satu- rate Rshift Mem Unit Decode Supports integer and fixed-point operations, and predication 32-bit datapaths Shared Dcache 10

11 Experiment #1: Vector Lane Exploration Vary the number of vector lanes implemented Using parameterized vector core Measure speedup on 6 EEMBC benchmarks Directly on Stratix I 1S80C6 clocked at 50 MHz Was designed for Stratix III, runs at 135 MHz Using 32KB direct-mapped level 1 cache DDR 133MHz => 10 cycle miss penalty Measure area cost Equate silicon area of all resources used Report in units of Equivalent LEs

12 Performance Scaling Across Vector Lanes Good scaling – average of 1.85x for 2 lanes to 6.3x for 16 lanes Cycle Speedup Normalized to 1 Lane Scaling past 16 limited by number of multipliers in Stratix 1S80 6.3x

13 Design Characteristics on Stratix III Lanes Clock Frequency (MHz) Logic Used (ALMs) Mulipliers Used (18-bit DSPs) Block RAMs Used (M9Ks) Clock Frequency steady … until 64 lanes ALMs grow by 570 ALMs/lane DSPs grow by 4(1+L) Block RAMs unaffected … until 32 lanes when port width dominates Device: 3S200C2

14 Application-Specific Vector Processing Customize to the application if: 1. It is the only application that will run, OR 2. The FPGA can be reconfigured between runs Observations: Not all applications 1. Operate on 32-bit data types 2. Use the entire vector instruction set Eliminate unused hardware (reduce area) Reduce cost (buy smaller FPGA) Re-invest area savings into more lanes Speed up clock (nets span shorter distances)

15 Opportunity for Customization BenchmarkLargest Data Type Size Percentage of Vector ISA used autcor4 bytes9.6% conven1 byte5.9% fbital2 bytes14.1% viterb2 bytes13.3% rgbcmyk1 byte5.9% rgbyiq2 bytes8.1% Lots of opportunity to customize width & ISA support  0% reduction up to 75% reduction <15% utilization

16 Customizing the Vector Processor Parameterized core can very easily change: L - Number of Vector Lanes W - Bit-width of the vector lanes M – Size of memory crossbar MVL – Maximum Vector Length Instruction set automatically subsetted Each vector instruction individually enabled/disabled Control logic & datapath hardware automatically removed

17 Experiment #2: Reducing Area by Reducing Vector Width Up to 54% of vector coprocessor area eliminated 54% 38% Savings increase with more lanes => better scalability Normalized Vector Coprocessor Area largest data type size (in bytes)

18 Experiment #3: Reducing Area by Subsetting Instruction Set Up to 55% of VPU area eliminated, 46% on average 55%46% Normalized Vector Coprocessor Area Again, savings increase with more lanes => better scalability

19 Experiment #4: Combined Width Reduction and Instruction Set Subsetting 61% 70% Performance scaling (seen previously) at almost 1/3 the area cost

20 Re-Invest Area Savings into Lanes (Improved VESPA) 9.3x 11.5x Area savings can be converted into better performance

21 Summary Evaluated soft vector processors Real hardware, memory, and benchmarks Observed significant performance scaling Average of 6.3x with 16 lanes Further scaling possible on newer devices Explored measures to reduce area cost Reducing vector width Reducing supported instruction set Combining width and instruction set reduction 61% area reduction on average, up to 70% Soft vector processors provide a portable, flexible, and scalable framework for exploiting data level parallelism that is easier to use than designing custom FPGA hardware

22 Future Work Improve scalability bottlenecks Memory system Evaluate scaling past 16 lanes Port to platform with newer FPGA Compare against hardware What do we pay for simpler design?

23 Performance Impact of Cache Size Measure impact of cache size on 16 lane VPU Streaming Streaming => prefetching could be fruitful

24 Combined Width Reduction and Instruction Set Subsetting Close to 70% area reduction

25 Performance vs Scalar (C) Code 1 Lane2 Lanes4 Lanes8 Lanes16 Lanes autcor conven fbital viterb rgbcmyk rgbyiq GEOMEAN

26 Vector Memory Unit Dcache base stride*0 index0 + MUXMUX... stride*1 index1 + MUXMUX stride*L indexL + MUXMUX Memory Request Queue Read Crossbar … Memory Lanes=4 rddata0 rddata1 rddataL wrdata0 wrdata1 wrdataL... Write Crossbar Memory Write Queue L = # Lanes - 1 … …