VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University.

Slides:



Advertisements
Similar presentations
Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.
Advertisements

Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, University of British Columbia.
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
VEGAS: A Soft Vector Processor Aaron Severance Some slides from Prof. Guy Lemieux and Chris Chou 1.
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA A Parameterizable.
Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.
Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
1 VENICE A Soft Vector Processor Aaron Severance Advised by Prof. Guy Lemieux Zhiduo Liu, Chris Chou, Jason Yu, Alex Brant, Maxime Perreault, Chris Eagleston.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Configurable System-on-Chip: Xilinx EDK
Introduction to ARM Architecture, Programmer’s Model and Assembler Embedded Systems Programming.
VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Synergistic Processing In Cell’s Multicore Architecture Michael Gschwind, et al. Presented by: Jia Zou CS258 3/5/08.
Retrospective on the VIRAM-1 Design Decisions Christoforos E. Kozyrakis IRAM Retreat January 9, 2001.
1 Design and Implementation of Turbo Decoders for Software Defined Radio Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Prince Sultan College For Woman
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.
Study of AES Encryption/Decription Optimizations Nathan Windels.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Parallelism Processing more than one instruction at a time. Pipelining
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
By: Oleg Schtofenmaher Maxim Fudim Supervisor: Walter Isaschar Characterization presentation for project Winter 2007 ( Part A)
Embedded Supercomputing in FPGAs
Coarse and Fine Grain Programmable Overlay Architectures for FPGAs
Chapter One Introduction to Pipelined Processors.
Data Parallel FPGA Workloads: Software Versus Hardware Peter Yiannacouras J. Gregory Steffan Jonathan Rose FPL 2009.
Fine-Grain Performance Scaling of Soft Vector Processors Peter Yiannacouras Jonathan Rose Gregory J. Steffan ESWEEK – CASES 2009, Grenoble, France Oct.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.
Accelerating Homomorphic Evaluation on Reconfigurable Hardware Thomas Pöppelmann, Michael Naehrig, Andrew Putnam, Adrian Macias.
1 TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH- THROUGHPUT FPGA APPLICATIONS Aaron Severance University of British Columbia Advised by Guy Lemieux.
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
Implementing algorithms for advanced communication systems -- My bag of tricks Sridhar Rajagopal Electrical and Computer Engineering This work is supported.
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.
EKT303/4 Superscalar vs Super-pipelined.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
Soft Vector Processors with Streaming Pipelines Aaron Severance Joe Edwards Hossein Omidian Guy G. F. Lemieux.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
CoDeveloper Overview Updated February 19, Introducing CoDeveloper™  Targeting hardware/software programmable platforms  Target platforms feature.
A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
HPEC 2003 Linear Algebra Processor using FPGA Jeremy Johnson, Prawat Nagvajara, Chika Nwankpa Drexel University.
Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
ESE532: System-on-a-Chip Architecture
Visit for more Learning Resources
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
FPGAs in AWS and First Use Cases, Kees Vissers
Morgan Kaufmann Publishers
Embedded OpenCV Acceleration
Centar ( Global Signal Processing Expo
Final Project presentation
Improving Memory System Performance for Soft Vector Processors
6- General Purpose GPU Programming
ESE532: System-on-a-Chip Architecture
Presentation transcript:

VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University of British Columbia 1

Motivation Embedded processing on FPGAs  High performance, computationally intensive  Soft processors, e.g. Nios/MicroBlaze, too slow How to deliver High Performance?  Multiprocessor on FPGA  Custom Hardware accelerators (Verilog RTL)  Synthesized accelerators (C to FPGA) 2

Motivation Soft vector processor to the rescue  Previous works have demonstrated soft vector processor as a viable option to provide: Scalable performance and area Purely software-based Decouples hardware/software development Key performance bottlenecks  Memory access latency  On-chip data storage efficiency 3

Contribution VEGAS Architecture key features  Cacheless Scratchpad Memory  Fracturable ALUs  Concurrent memory access via DMA Advantages  Eliminates on-chip data replication Also: huge # of vectors, long vector lengths  More parallel ALUs  Fewer memory loads/stores 4

VEGAS Architecture Scalar Core: 200MHz DMA Engine & External DDR2 Vector Core: 120MHz Concurrent Execution FIFO synchronized 5

Scratchpad Memory in Action Vector Scratchpad Memory Vector Lane 0 Vector Lane 1 Vector Lane 2 Vector Lane 3 srcAsrcBDestsrcAsrcBDest 6

Scratchpad Memory in Action srcA Dest 7

Scratchpad Advantage Performance  Huge working set (256kB++)  Explicitly managed by software  Async load/store via concurrent DMA Efficient data storage 2x copies)  Double-clocked memory(Trad. RF 2x copies) 4x copies  8b data stays as 8b(Trad. RF 4x copies) +1 copy  No cache(Trad. RF +1 copy) 8

Scratchpad Advantage Accessed by address register  Huge # of vectors in scratchpad VEGAS uses only 8 vector addr. reg. (V0..V7) Modify content to access different vectors Auto-increment lessens need to change V0..V7  Long vector lengths Fill entire scratchpad 9

Scratchpad Advantage: Median Filter Vector address registers  easier than unrolling Traditional Vector Median Filter For J = For I = J.. 24 V1 = vector[i]  vector load V2 = vector[j]  vector load CompareAndSwap( V1, V2 ) vector[j] = V2  vector store Vector[i] = V1  vector store Optimize away 1 vector load + 1 vector store using temp  Total of 222 loads and 222 stores 10

11 Scratchpad Advantage: Median Filter

Fracturable ALUs 12 Multiplier – uses 4 x 16b multipliers Multiplier also does shifts + rotate Adder – uses 4 x 8b adders

Fracturable ALUs Advantage Increased processing power  4-Lane VEGAS 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle  Median filter example 32b data: 184 cycles / pixel 16b data: 93 cycles / pixel 8b data: 47 cycles / pixel 13

Area and Frequency 14 Num. Lanes VEGAS ALMDSPM9KFmax

ALM Usage 15

Performance 16 BenchmarkNiosII/fVEGASNiosII/V32 Speedup V1V32 fir x motest x median x autocor x conven x imgblend x filt3x x

Area-Delay Product Area*Delay measures “throughput per mm 2 ”  Compared to earlier vector processors, VEGAS offers 2-3x better throughput per unit area 17

Integer Matrix Multiply  4096 x 4096 integers (64MB data set) Intel Core 2 (65nm), 2.5GHz, 16GB DDR2  Vanilla IJK:474 seconds  Vanilla KIJ:134 s  Tiled IJK:93 s  Tiled KIJ:68 s VEGAS (65nm Altera Stratix3)  Vector:44 s(Nios only: 5407 s)  256kB Scratchpad, 32 Lanes (about 50% of chip)  200MHz NIOS, 100MHz Vector, 1GB DDR2 SODIMM 18

19 Conclusions Vector processor  Purely software-based acceleration No hardware design / RTL recompile needed—just program  Faster chip design Can build vector processor before software algorithms finalized Simple programming model Maps well to FPGA  Many small memories, multiplier blocks  Should map well to ASIC

20 Conclusions Key features  Scratchpad Memory Enhance performance with fewer loads/stores No on-chip data replication; efficient storage Double-clocked to hide memory latency  Fracturable ALUs Operates on 8b, 16b, 32b data efficiently Single vector core accelerates many applications Result  2-3x better Area-Delay product than VIPERS/VESPA  Out performs Intel Core 2 at Integer Matrix Multiply

Issues / Future Work No floating-point yet  Adding “complex function” support, to include floating-point or similar operations Algorithms with only short vectors  Split vector processor into 2, 4, 8 pieces  Run multiple instances of algorithm Multiple vector processors  Connecting them to work cooperatively  Goals: increase throughput, exploit task-level parallelism (ie, chaining or pipelining) 21