1 VENICE A Soft Vector Processor Aaron Severance Advised by Prof. Guy Lemieux Zhiduo Liu, Chris Chou, Jason Yu, Alex Brant, Maxime Perreault, Chris Eagleston.

Slides:

Advertisements

Similar presentations

Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, University of British Columbia.

Advertisements

DSPs Vs General Purpose Microprocessors

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

VEGAS: A Soft Vector Processor Aaron Severance Some slides from Prof. Guy Lemieux and Chris Chou 1.

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Parallell Processing Systems1 Chapter 4 Vector Processors.

VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University.

08/31/2001Copyright CECS & The Spark Project SPARK High Level Synthesis System Sumit GuptaTimothy KamMichael KishinevskyShai Rotem Nick SavoiuNikil DuttRajesh.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

Reconfigurable Computing: What, Why, and Implications for Design Automation André DeHon and John Wawrzynek June 23, 1999 BRASS Project University of California.

PipeRench: A Coprocessor for Streaming Multimedia Acceleration Seth Goldstein, Herman Schmit et al. Carnegie Mellon University.

Advanced Processor Architectures for Embedded Systems Witawas Srisa-an CSCE 496: Embedded Systems Design and Implementation.

Instruction Level Parallelism (ILP) Colin Stevens.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.

Study of AES Encryption/Decription Optimizations Nathan Windels.

Digital Signal Processors for Real-Time Embedded Systems By Jeremy Kohel.

Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.

Basics and Architectures

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

Titan: Large and Complex Benchmarks in Academic CAD

Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.

Embedded Supercomputing in FPGAs

Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Softcore Vector Processor Team ASP Brandon Harris Arpith Jacob.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Automated Design of Custom Architecture Tulika Mitra

Data Parallel FPGA Workloads: Software Versus Hardware Peter Yiannacouras J. Gregory Steffan Jonathan Rose FPL 2009.

SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.

ARM for Wireless Applications ARM11 Microarchitecture On the ARMv6 Connie Wang.

1 TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH- THROUGHPUT FPGA APPLICATIONS Aaron Severance University of British Columbia Advised by Guy Lemieux.

Maximizing Speed and Density of Tiled FPGA Overlays via Partitioning Charles Eric LaForest J. Gregory Steffan University of Toronto ICFPT 2013.

A few issues on the design of future multicores André Seznec IRISA/INRIA.

DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –

© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.

Exploiting Parallelism

An Improved “Soft” eFPGA Design and Implementation Strategy

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

Soft Vector Processors with Streaming Pipelines Aaron Severance Joe Edwards Hossein Omidian Guy G. F. Lemieux.

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.

W AVEFRONT S KIPPING USING BRAM S FOR C ONDITIONAL A LGORITHMS ON V ECTOR P ROCESSORS Aaron Severance Joe Edwards Guy G.F. Lemieux.

Vector computers.

PipeliningPipelining Computer Architecture (Fall 2006)

Buffering Techniques Greg Stitt ECE Department University of Florida.

Presenter: Darshika G. Perera Assistant Professor

Nios II Processor: Memory Organization and Access

Performance of Single-cycle Design

Embedded Systems Design

Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux

FPGAs in AWS and First Use Cases, Kees Vissers

Lecture 5: GPU Compute Architecture

Anne Pratoomtong ECE734, Spring2002

Lecture 5: GPU Compute Architecture for the last time

VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder

Multivector and SIMD Computers

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

Samuel Larsen Saman Amarasinghe Laboratory for Computer Science

Chapter 12 Pipelining and RISC

Presentation transcript:

1 VENICE A Soft Vector Processor Aaron Severance Advised by Prof. Guy Lemieux Zhiduo Liu, Chris Chou, Jason Yu, Alex Brant, Maxime Perreault, Chris Eagleston The University of British Columbia

Motivation FPGAs for embedded processing Exist in many embedded systems already Glue logic, high speed I/O’s Ideally do data processing on-chip Reduced power, footprint, bill of materials Several goals to balance Performance (must meet, no(?) bonus for exceeding) FPGA resource usage; more -> larger FPGA needed Human resource requirements & time (cost/time to market) Ease of programming, debugging Flexibility & reusability Both short (run-time) term and long term 2

Some Options: Custom accelerator High performance, low device usage Time-consuming to design and debug 1 hardware accelerator per function Hard processor Fixed number (1 currently) of relatively simple processors Fixed balance for all applications Data parallelism (SIMD) ILP, MLP, speculation, etc. High level synthesis Overlay architectures 3

HLS vs. Overlays High level synthesis (bottom up) Build from sequential code and make faster Only infer needed functions Run synthesis (minutes to hours) after algorithm change Overlay architectures (top down) Build from everything, prune unneeded paths later Difficult to prune to bare minimum of resources Recompile program (seconds) after algorithm change 4

Programming Model Both HLS and Overlays have restricted or flexible options Restricted Fixed starting point (e.g. C with vector intrinsics) Code can be rewritten to fit the model? Yes: Good performance No: Give up? (run on soft/hard processor) Flexible Arbitrary starting point (e.g. sequential C code) Code can be (is) rewritten to fit possible architectures? Yes: Good performance No: Poor performance 5

Our Focus Overlay architecture Start with full data parallel engine Prune unneeded functions later Fast design cycles (productivity) Compile software, don’t re-run synthesis/place & route Flexible, reusable Could use HLS techniques on top of it… Restricted programming model Forces programmer to write ‘good’ code Can be inferred from flexible (C code) where appropriate 6

Overlay Options: Soft processor: limited performance Single issue, in-order 2 or 4-way superscalar/VLIW register file maps inefficiently to FPGA Expensive to implement CAMs for OoOE Multiprocessor-on-FPGA: complexity Parallel programming and debugging Area overhead for interconnect Cache coherence, memory consistency Soft vector processor: balance For common embedded multimedia applications Easily scalable data parallelism; just add more lanes (ALUs) 7

Soft Vector Processor Change algorithm  same RTL, just recompile software Simple programming model Data-level parallelism, exposed to the programmer One hardware accelerator supports many applications Scalable performance and area Write once, run anywhere… Small FPGA: 1 ALU (smaller than Nios II/f) Large FPGA: 10’s to 100’s of parallel ALUs Parameterizable; remove unused functions to save area 8

Vector Processing Organize data as long vectors Replace inner loop with vector instruction Hardware loops until all elements are processed May execute repeatedly (sequentially) May process multiple at once (in parallel) 9 for( i=0; i<N; i++ ) a[i] = b[i] * c[i] set vl, N vmult a, b, c

Hybrid Vector-SIMD 10 for( i=0; i<8; i++ ) { C[i] = A[i] + B[i] E[i] = C[i] * D[i] } C E C E

Previous SVP Work 1 st Gen: VIPERS (UBC) & VESPA (UofT) Similar to Cray-1 / VIRAM Vector data in registers Load/store for memory accesses 2 nd Gen: VEGAS (UBC) Optimized for FPGAs Vector data in flat scratchpad Addresses in scratchpad stored in separate register file Concurrent DMA engine 11

VEGAS (2 nd Gen) Architecture Scalar Core: NiosII/f DMA engine: to/from external memory Vector Core: VEGAS Concurrent execution: in order dispatch, explicit syncing 12

VEGAS (2 nd Gen): Efficient On-Chip Memory Use Scratchpad-based “register file” (2kB..2MB) Vector address register file stores vector locations Very long vectors (no maximum length) Any number of vectors (no set number of “vector data registers”) Double-pumped to achieve 4 ports 2 Read, 1 Write, 1 DMA Read/Write Lanes support sub-word SIMD 32-bit ALU configurable as 1x32, 2x16, 4x8 Keeps data packed in scratchpad for larger working set 13

VENICE Architecture 3rd Generation Optimized for Embedded Multimedia Applications High Frequency, Low Area 14

VENICE Overview Vector Extensions to NIOS Implemented Compactly and Elegantly Continues from VEGAS (2 nd Gen) Scratchpad memory Asynchronous DMA transactions Sub-word SIMD (1x32, 2x16, 4x8 bit operations) Optimized for smaller implementations VEGAS achieves best performance/area at 4-8 lanes Vector programs don’t scale indefinitely Communications networks scale > O(N) VENICE targets 1-4 lanes About 50%.. 75% of the size of VEGAS 15

VENICE Architecture 16

Selected Contributions In-pipeline vector alignment network No need for separate move/element shift instructions Increased performance for sliding windows Ease of programming/compiling (data packing) Only shift/rotate vector elements Scales O( N * log(N) ) Acceptable to use multiple networks for few (1 to 4) lanes 17

Scratchpad Alignment 18

Selected Contributions In-pipeline vector alignment network 2D/3D vector operations Set # of repeated ops, increment on sources and dest Increased instruction dispatch rate for multimedia Replaces increment/window of address registers (2 nd Gen) Had to be implemented in registers instead of BRAM Modified 3 registers per cycle 19

VENICE Programming FIR using 2D Vector Ops int num_taps, num_samples; int16_t *v_output, *v_coeffs, *v_input; // Set up 2D vector parameters: vector_set_vl( num_taps ); // inner loop count vector_set_2D( num_samples, // outer loop count 1*sizeof(int16_t), // dest gap (-num_taps )*sizeof(int16_t), // srcA gap (-num_taps+1)*sizeof(int16_t) ); // srcB gap // Execute instruction; does 2D loop over entire input, multiplying and accumulating vector_acc_2D( VVH, VMULLO, v_output, v_coeffs, v_input ); 20

Area Breakdown VENICE: Lower control & overall area ICN (alignment) scales faster but does not dominate 21

Selected Contributions In-pipeline vector alignment network 2D/3D vector operations Single flag/condition code per instruction Previous SVPs had separate flag registers/scratchpad Instead, encode a flag bit with each byte/halfword/word Can be stored in 9 th bit of 9/18/36 bit wide BRAMs 22

Selected Contributions In-pipeline vector alignment network 2D/3D vector operations Single flag/condition code per instruction More efficient hybrid multiplier implementation To support 1x32-bit, 2x16-bit, 4x8-bit multiplies 23

Fracturable Multipliers in Stratix III/IV DSPs 2 DSP Blocks + extra logic2 DSP Blocks (no extra logic) 24

Selected Contributions In-pipeline vector alignment network 2D/3D vector operations Single flag/condition code per instruction More efficient hybrid multiplier implementation Increased frequency 200MHz+ vs ~125MHz of previous SVPs Deeper pipelining Optimized circuit design 25

Average Speedup vs. ALMs 26

Computational Density 27

Speeding up the 4x4 DCT (16-bit) 28

Future Work Scatter/gather functionality Vector indexed memory accesses Need to coalesce to get good bandwidth utilization Automatic pruning of unneeded overlay functionality HLS in reverse Hybrid HLS  overlays Start from overlay and synthesize in extra functionality? Overlay + custom instructions? 29

Conclusions Soft Vector Processors Scalable performance No hardware recompiling necessary VENICE Optimized for FPGAs, 1 to 4 lanes 5X Performance/Area of Nios II/f on multimedia benchmarks 30