Embedded Supercomputing in FPGAs

Slides:

Advertisements

Similar presentations

Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, University of British Columbia.

Advertisements

Comparison of Altera NIOS II Processor with Analog Device’s TigerSHARC

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.

VEGAS: A Soft Vector Processor Aaron Severance Some slides from Prof. Guy Lemieux and Chris Chou 1.

The University of Adelaide, School of Computer Science

1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.

Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.

Parallell Processing Systems1 Chapter 4 Vector Processors.

VESPA: Portable, Scalable, and Flexible FPGA-Based Vector Processors Peter YiannacourasUniv. of Toronto J. Gregory Steffan Univ. of Toronto Jonathan Rose.

VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University.

1 VENICE A Soft Vector Processor Aaron Severance Advised by Prof. Guy Lemieux Zhiduo Liu, Chris Chou, Jason Yu, Alex Brant, Maxime Perreault, Chris Eagleston.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Introduction CS 524 – High-Performance Computing.

SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.

FPGA Based Fuzzy Logic Controller for Semi- Active Suspensions Aws Abu-Khudhair.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.

1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.

Digital Signal Processors for Real-Time Embedded Systems By Jeremy Kohel.

Basics and Architectures

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Softcore Vector Processor Team ASP Brandon Harris Arpith Jacob.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.

Data Parallel FPGA Workloads: Software Versus Hardware Peter Yiannacouras J. Gregory Steffan Jonathan Rose FPL 2009.

ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.

Fine-Grain Performance Scaling of Soft Vector Processors Peter Yiannacouras Jonathan Rose Gregory J. Steffan ESWEEK – CASES 2009, Grenoble, France Oct.

SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Safe Overclocking Safe Overclocking of Tightly Coupled CGRAs and Processor Arrays using Razor © 2012 Guy Lemieux Alex Brant, Ameer Abdelhadi, Douglas Sim,

© 2004 Mercury Computer Systems, Inc. FPGAs & Software Components Graham Bardouleau & Jim Kulp Mercury Computer Systems, Inc. High Performance Embedded.

 Virtual machine systems: simulators for multiple copies of a machine on itself.  Virtual machine (VM): the simulated machine.  Virtual machine monitor.

XStream: Rapid Generation of Custom Processors for ASIC Designs Binu Mathew * ASIC: Application Specific Integrated Circuit.

Network On Chip Platform

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor.

Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.

Data Management for Decision Support Session-4 Prof. Bharat Bhasker.

© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.

Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.

EKT303/4 Superscalar vs Super-pipelined.

A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.

بسم الله الرحمن الرحيم MEMORY AND I/O.

Soft Vector Processors with Streaming Pipelines Aaron Severance Joe Edwards Hossein Omidian Guy G. F. Lemieux.

Zhiduo Liu Aaron Severance Satnam Singh Guy Lemieux Accelerator Compiler for the VENICE Vector Processor.

My Coordinates Office EM G.27 contact time:

Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.

Computer Operation. Binary Codes CPU operates in binary codes Representation of values in binary codes Instructions to CPU in binary codes Addresses in.

W AVEFRONT S KIPPING USING BRAM S FOR C ONDITIONAL A LGORITHMS ON V ECTOR P ROCESSORS Aaron Severance Joe Edwards Guy G.F. Lemieux.

CS203 – Advanced Computer Architecture Performance Evaluation.

Vector computers.

CS203 – Advanced Computer Architecture

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Hiba Tariq School of Engineering

Application-Specific Customization of Soft Processor Microarchitecture

Embedded Systems Design

A Quantitative Analysis of Stream Algorithms on Raw Fabrics

Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux

FPGAs in AWS and First Use Cases, Kees Vissers

CDA 3101 Spring 2016 Introduction to Computer Organization

Vector Processing => Multimedia

The University of British Columbia

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Compiler Back End Panel

Compiler Back End Panel

Application-Specific Customization of Soft Processor Microarchitecture

Presentation transcript:

Embedded Supercomputing in FPGAs with the VectorBlox MXP Matrix Processor Aaron Severance, UBC VectorBlox Computing Prof. Guy Lemieux, UBC CEO VectorBlox Computing http://www.vectorblox.com

Typical Usage and Motivation Embedded processing FPGAs often control custom devices Imaging, audio, radio, screens Heavy data processing requirements FPGA tools for data processing VHDL too difficult to learn and use C-to-hardware tools too “VHDL-like” FPGA-based CPUs (Nios/MicroBlaze) too slow Complications Very slow recompiles of FPGA bitstream Device control circuits may have sensitive timing requirements FPGAs are used today in many embedded tasks Signal processing, multimedia Soft processor systems are becoming more common, simplifies development According to a 2007 survey by Embedded.com, 36% of respondents use soft processor in FPGA design Although Nios/MB highly optimized © 2012 VectorBlox Computing Inc.

© 2012 VectorBlox Computing Inc. A New Tool MXP™ Matrix Processor Performance 100x – 1000x over Nios II/f, MicroBlaze Easy to use, pure software Just C, no VHDL/Verilog ! No FPGA recompilation for each algorithm change No bitstream changes Save time (FPGA place+route can take hours, run out of space, etc) Correctness Easy-to-debug, e.g. printf() or gdb Simulator runs on PC, eg regression testing Run on real FPGA hardware, eg real-time testing © 2012 VectorBlox Computing Inc.

Background: Vector Processing Data-level parallelism Organize data as long vectors Vector instruction execution Multiple vector lanes (SIMD) Hardware automatically repeats SIMD operation over entire length of vector 4 SIMD Vector Lanes Vector Assembly C Code for ( i=0; i<8; i++ ) a[i] = b[i] * c[i]; set vl, 8 vmult a, b, c Source Vectors Destination Vector Long vectors of 32 and above Vector addressing Efficient way to gather data Eliminates pack/unpack instructions Will be discussed more on the next slide © 2012 VectorBlox Computing Inc.

Why Vector Processing? Efficient for embedded computation E.g. VIRAM for embedded media apps Maps well to FPGAs Different tradeoffs than ASICs Uses parallel, deep pipelines Streams data through execution units Distributed memories for registers/scratchpad Avoids tight coupling, forwarding networks

Preview: MXP Internals

SYSTEM DESIGN WITH MXP™ © 2012 VectorBlox Computing Inc.

MXP™ Processor: Configurable IP © 2012 VectorBlox Computing Inc.

Integrates into Existing Systems © 2012 VectorBlox Computing Inc.

Typical System

Programming MXP Libraries on top of vendor tools Eclipse based IDEs, command line tools GCC, GDB, etc. Functions and Macros extend C, C++ Vector Instructions ALU, DMA, Custom Instructions Same software for different configurations Wide MXP -> higher performance

Example: Adding 3 Vectors #include “vbx.h” int main() { const int length = 8; int A[length] = {1,2,3,4,5,6,7,8}; int B[length] = {10,20,30,40,50,60,70,80}; int C[length] = {100,200,300,400,500,600,700,800}; int D[length]; vbx_dcache_flush_all(); const int data_len = length * sizeof(int); vbx_word_t *va = (vbx_word_t*)vbx_sp_malloc( data_len ); vbx_word_t *vb = (vbx_word_t*)vbx_sp_malloc( data_len ); vbx_word_t *vc = (vbx_word_t*)vbx_sp_malloc( data_len ); vbx_dma_to_vector( va, A, data_len ); vbx_dma_to_vector( vb, B, data_len ); vbx_dma_to_vector( vc, C, data_len ); vbx_set_vl( length ); vbx( VVW, VADD, vb, va, vb ); vbx( VVW, VADD, vc, vb, vc ); vbx_dma_to_host( D, vc, data_len ); vbx_sync(); vbx_sp_free(); } © 2012 VectorBlox Computing Inc.

Algorithm Design on FPGAs HW and SW development is decoupled Select HW parameters and go No VHDL required for computing Only resynthesize when requirements change Design SW with these main concepts Vectors of data Scratchpad with DMA Same software can run on any FPGA © 2012 VectorBlox Computing Inc.

© 2012 VectorBlox Computing Inc. MXP™ Matrix Processor © 2012 VectorBlox Computing Inc.

MXP™ System Architecture 1. Scalar CPU 2. Concurrent DMA 3. Vector SIMD 3-way Concurrency

MXP Internal Architecture (1) © 2012 VectorBlox Computing Inc.

© 2012 VectorBlox Computing Inc. Scratchpad Memory Multi-banked, parallel access Addresses striped across banks, like RAID disks C 8 4 Data is Striped Across Memory Banks D 9 5 1 E A 6 2 F B 7 3 © 2012 VectorBlox Computing Inc.

© 2012 VectorBlox Computing Inc. Scratchpad Memory Multi-banked, parallel access Vector can start at any location C 8 4 Data is Striped Across Memory Banks D 9 5 1 E A 6 2 Vector starts here F B 7 3 © 2012 VectorBlox Computing Inc.

© 2012 VectorBlox Computing Inc. Scratchpad Memory Multi-banked, parallel access Vector can start at any location Vector can have any length C 8 4 Data is Striped Across Memory Banks Vector starts here D 9 5 1 Vector of length 10 E A 6 2 F B 7 3 © 2012 VectorBlox Computing Inc.

© 2012 VectorBlox Computing Inc. Scratchpad Memory Multi-banked, parallel access Vector can start at any location Vector can have any length One “wave” of elements can be read every cycle C 8 4 C 8 4 One clock cycle: Parallel access to one full “wave” of vector elements Data is Striped Across Memory Banks D 9 5 1 D 9 5 1 E A 6 2 E A 6 2 F B 7 3 F B 7 3 © 2012 VectorBlox Computing Inc.

Scratchpad-based Computing vbx_word_t *vdst, *vsrc1, *vsrc2; vbx( VVW, VADD, vdst, vsrc1, vsrc2 ); © 2012 VectorBlox Computing Inc.

Scratchpad-based Computing vbx_word_t *vdst, *vsrc1, *vsrc2; vbx( VVW, VADD, vdst, vsrc1, vsrc2 ); © 2012 VectorBlox Computing Inc.

Scratchpad-based Computing vbx_word_t *vdst, *vsrc1, *vsrc2; vbx( VVW, VADD, vdst, vsrc1, vsrc2 ); © 2012 VectorBlox Computing Inc.

Scratchpad-based Computing vbx_word_t *vdst, *vsrc1, *vsrc2; vbx( VVW, VADD, vdst, vsrc1, vsrc2 ); © 2012 VectorBlox Computing Inc.

MXP Internal Architecture (2) .

Custom Vector Instructions

MXP Internal Architecture (3)

Rich Feature Set Feature MXP Register file 4kB to 2MB # Vectors (registers) unlimited Max Vector Length Max Element Width 32b Sub-word SIMD 2 x 16b, 4 x 8b Automatic Dispatch/Increment 2D/3D Parallelism 1 to 128 (x4 for 8b) Clock speed Up to 245 MHz Latency-hiding Concurrent 1D/2D DMA Floating-point Optional via Custom Instructions User-configurable DMA, ALUs, Multipliers, S/G Ports

© 2012 VectorBlox Computing Inc. Performance Examples Application Kernels Speedup (factor) VectorBlox MXPTM Processor Size © 2012 VectorBlox Computing Inc.

Chip Area Requirements Nios II/f V1 4k V4 16k V16 64k V32 128k V64 256k Stratix IV-530 ALMs 1,223 3,433 7,811 21,211 46,411 80,720 212,480 DSPs 4 12 36 132 260 516 1,024 M9Ks 14 29 39 112 200 384 1,280 Nios II/f V1 4k V4 16k V16 64k V32 128k Cyclone IV-115 LEs 2,898 4,467 11,927 45,035 89,436 114,480 DSPs 4 12 48 192 388 532 M9Ks 21 32 36 97 165 432 © 2012 VectorBlox Computing Inc.

Average Speedup vs. Area (Relative to Nios II/f = 1.0) © 2012 VectorBlox Computing Inc.

Sobel Edge Detection MXP achieves high utilization Long vectors keep data streaming through FU’s In pipeline alignment, accumulate Concurrent vector/DMA/scalar alleviate stalling

Current/Future Work Multiple operand custom instructions Custom RTL performance, vector control Modular Instruction Set Application Specific Vector ISA Processor C++ object programming model

© 2012 VectorBlox Computing Inc. Conclusions Vector processing with MXP on FPGAs Easy to use/deploy Scalable performance (area vs speed) Speedups up to 1000x No hardware recompiling necessary Rapid algorithm development Hardware purely ‘sandboxed’ from algorithm © 2012 VectorBlox Computing Inc.

The VectorBlox MXP™ Matrix Processor Scalable performance Pure C programming Direct device access No hardware design Easy to debug RTL

Application Performance Comparison to Intel i7-2600 (running on one 3.4GHz core, without SSE/AVX instructions) CPU Fir 2Dfir Life Imgblend Median Motion Estimation Matrix Multiply Intel i7-2600 0.05s 0.36s 0.13s 0.09s 9.86s 0.25s 50.0s MXP 0.43s 0.19s 0.50s 2.50s 0.21s 15.8s Speedup 1.0x 0.8x 0.7x 0.2x 3.9x 1.7x 3.2x © 2012 VectorBlox Computing Inc.

Benchmark Characteristics © 2012 VectorBlox Computing Inc.