Reconfigurable Architectures

Slides:

Advertisements

Similar presentations

Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.

Advertisements

Introduction to Programmable Logic John Coughlan RAL Technology Department Electronics Division.

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.

Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Lecture 6: Multicore Systems

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Lecture 15: Reconfigurable Coprocessors October 31, 2013 ECE 636 Reconfigurable Computing Lecture 15 Reconfigurable Coprocessors.

A Survey of Logic Block Architectures For Digital Signal Processing Applications.

An Introduction to Reconfigurable Computing Mitch Sukalski and Craig Ulmer Dean R&D Seminar 11 December 2003.

Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

Reconfigurable Computing: What, Why, and Implications for Design Automation André DeHon and John Wawrzynek June 23, 1999 BRASS Project University of California.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

PipeRench: A Coprocessor for Streaming Multimedia Acceleration Seth Goldstein, Herman Schmit et al. Carnegie Mellon University.

Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.

Evolution of implementation technologies

Trends toward Spatial Computing Architectures Dr. André DeHon BRASS Project University of California at Berkeley.

CS294-6 Reconfigurable Computing Day 26 Thursday, November 19 Integrating Processors and RC Arrays.

Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)

Reconfigurable Hardware in Wearable Computing Nodes Christian Plessl 1 Rolf Enzler 2 Herbert Walder 1 Jan Beutel 1 Marco Platzner 1 Lothar Thiele 1 1 Computer.

Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.

February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.

Paper Review I Coarse Grained Reconfigurable Arrays Presented By: Matthew Mayhew I.D.# ENG*6530 Tues, June, 10,

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 12: May 15, 2001 Interfacing Heterogeneous Computational.

Reconfigurable Computing. Lect-02.2 Course Schedule Introduction to Reconfigurable Computing FPGA Technology, Architectures, and Applications FPGA Design.

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Archs, VHDL 3 Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.

Paper Review: XiSystem - A Reconfigurable Processor and System

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Automated Design of Custom Architecture Tulika Mitra

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR FPGA Fabric n Elements of an FPGA fabric –Logic element –Placement –Wiring –I/O.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.

ARM for Wireless Applications ARM11 Microarchitecture On the ARMv6 Connie Wang.

Understanding Sources of Inefficiency in General-Purpose Chips R.Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. Lee, S. Richardson, C. Kozyrakis,

COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION 03/26/

EE3A1 Computer Hardware and Digital Design

ENG3050 Embedded Reconfigurable Computing Systems Application Specific Instruction Processors “ASIPS” Application Specific Instruction Processors “ASIPS”

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)

The TM3270 Media-Processor. Introduction Design objective – exploit the high level of parallelism available. GPPs with Multi-media extensions (Ex: Intel’s.

COARSE GRAINED RECONFIGURABLE ARCHITECTURES 04/18/2014 Aditi Sharma Dhiraj Chaudhary Pruthvi Gowda Rachana Raj Sunku DAY

Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.

Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,

M.Mohajjel. Why? TTM (Time-to-market) Prototyping Reconfigurable and Custom Computing 2Digital System Design.

A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.

Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.

Winter-Spring 2001Codesign of Embedded Systems1 Essential Issues in Codesign: Architectures Part of HW/SW Codesign of Embedded Systems Course (CE )

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 24: May 25, 2005 Heterogeneous Computation Interfacing.

Introduction to Field Programmable Gate Arrays (FPGAs) EDL Spring 2016 Johns Hopkins University Electrical and Computer Engineering March 2, 2016.

Embedded Systems Design

Architecture & Organization 1

Electronics for Physicists

FPGAs in AWS and First Use Cases, Kees Vissers

Vector Processing => Multimedia

Architecture & Organization 1

Dynamically Reconfigurable Architectures: An Overview

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder

Electronics for Physicists

Presentation transcript:

Reconfigurable Architectures Andrea Lodi

SoC trends Increasing mask cost (~ 3M$) Increasing design complexity Increasing design time (~ 3M$) Rapidly changing communication standards Low-power design in wireless environment Increasing algorithmic complexity requirements

time-to-market failed Product life cycle sales Growth Maturity Decrease LOSS time-to-market met time-to-market failed time

Trends in wireless systems Increased on-chip Transistor density Increased design complexity Algorithm complexity Moore’s law 400 Battery capacity 300 Millions of transistors/Chip 1997 1999 2001 2003 2005 2007 2009 200 Increased Algorithmic complexity Low battery capacity growth Technology (nm) 100 1997 1999 2001 2003 2005 2007 2009 Demand for reusability and flexibility Demand for high performance and energy efficiency

Digital architecture design space

Parallelism in computation Thread level parallelism Instruction level parallelism (ILP) Pipeline (loop level) Fine-grain parallelism (bit/byte-level)

Instruction level parallelism b c d + + + + + + ASIC Implementation 3 * * *3 * e - - + +

Spatial vs. Temporal Computing (Ax + B)x + C Ax2 + Bx + c Temporal (Processor) Spatial (ASIC)

Superscalar/VLIW processors FU limitations Register file size limitation Crossbar inefficiency

Byte-level parallelism in processors MMX technology: 57 new instructions Byte and half word parallel computation SIMD execution model

Bit-level parallelism Reverse (int v) { int x, r; for (c=0; x<WIDTH; x++) { r |= v&1; v = v >> 1; R = r << 1; } return r; popcount (int v) { int r=0; while (v) { if (v&1) r++; v = v >> 1; } return r; + v r v r

Pipeline parallelism + + + + + + + + + + + v r = register for (j=0; j<MAX; j++) b[j] = popcount[a[j]]; = register + + + + + + + + + + + r

FPGA FPGA (Field-Programmable Gate Array) composed of 2 elements: Array of clbs (configurable logic blocks) composed of : 1 or few small size LUTs (4:1 or 3:1) Control logic: mux controlled by configuration bits Dedicated computational logic (carry chain …) Configurable routing network connecting clbs composed of: Different length wires Connection blocks connecting clbs to the routing network Switch blocks connecting routing wires LUTs, configuration bits to program clbs and the routing network represent the FPGA configuration, which determines the function implemented

Configurable logic block

Xilinx Clb Xilinx clb 4000 series: 11 input 4 output bits 3 LUTs Carry logic 2 output registers

Configurable routing network

Example

Density Comparison

FPGA vs. Processor FPGA Processor (computing in space) Parallel execution Configurable in 102-103 cycles Fine-grained data Application specific operators Large area (switches, SRAM) Entire applications don’t fit Slow synthesis, P&R tools Processor (computing in time) Sequential execution Programmable every cycle Fixed-size operands Basic operators (ALU) Compact Handles complex control flow Fast compilers

Reconfigurable processors But: 90% execution time spent in computational kernels: FPGAs 10-100x speed-up over processors FPGAs 10-100x denser than processors (bit-ops/2s) Reconfigurable processor: Risc + FPGA

Reconfigurable processor architecture Hybrid architectures: RISC processor FPGA

Computational models RC Array: IO Processor/Interface logic Attached processor Piperench, T-Recs ISA Extension Function unit: PRISC, OneChip, Chimaera Coprocessor Garp, NAPA, Molen

IO Processor/Interface Logic Case for: Always have some system adaptation to do Modern chips have capacity to hold processor + glue logic reduce part count Glue logic vary many protocols, services only need few at a time Logic used in place of ASIC environment customization external FPGA/PLD devices Looks like IO peripheral to processor Example protocol handling stream computation compression, encrypt peripherals sensors, actuators

Example: Interface/Peripherals Triscend E5

Instruction Set Extension Instruction Bandwidth Processor can only describe a small number of basic computations in a cycle I bits 2I operations This is a small fraction of the operations one could do even in terms of www Ops w22(2w) operations Processor could have to issue w2(2 (2w) -I) operations just to describe some computations An a priori selected base set of functions could be very bad for some applications

Instruction Set Extension Idea: provide a way to augment the processor’s instruction set with operations needed by a particular application

Architectural Models for I.S.A extension PLEIADES XTENSA Good performance Easy to program Configured at mask-level High performance Overdesigned for most applications Difficult to program Cpu surrounded by a collection of Application-specific Custom Computing Devices Risc CPU featuring application-specific function units optionally inserted in the processor pipeline Zhang et al, 2000 Tensilica inc, 2002

Dynamic ISA Extension models Standard processor coupled with embedded programmable logic where application specific functions are dynamically re-mapped depending on the performed algorithm 1: Coprocessor model 2: Function unit model

Coprocessor model: Garp Explicit instructions moving data to and from the array High communication overhead (long latency array operations) Processor stalled each time the array is active Array performs at TASK level (Very coarse grain) 10-20x on stream, feed-forward operations 2-3x when data-dependencies limit pipelining Callahan, Hauser, Wawrzynek, 2000

Function unit model: Prisc Array fit in the risc pipeline No communication overhead Some degree of parallelism between function units Gate array performs combinatorial instructions ONLY (very fine grain) Low speedup figures (2x/3x) Razdan, Smith 1994

Function Unit Model: pros No communication overhead: Strict synergy between FPGA and other function units FPGA can be used frequently even for small functions Small reconfigurable array area Flow control handled by the core Memory access handled by the core Easy instruction set extension Configuration streams compiled from C

32-bit load/store Risc architecture (5 stages pipeline) EXTENDIBLE INSTRUCTION SET RISC ARCHITECTURE 32-bit load/store Risc architecture (5 stages pipeline) Set of specialized functional units Multiply/Mac Unit Branch/Decrement Unit Alu featuring “MMX” byte-wide concurrent operations VLIW Elaboration Concurrent fetch and execution of two 32-bit instructions per cycle Fully bypassed, to minimize pipeline stalls (Average of 10/20% for most computational cores) Embedded reconfigurable device for dynamic ISA extension DSP-oriented reconfigurable functional unit (PiCoGA) Fully configurable at execution time Elaboration and configuration controlled by asm instructions inserted in C source code PiCoGA used as a programmable Data-path with independent pipeline structure

XiRisc Architecture

Dynamic Instruction Set Extension

Dynamic Instruction Set Extension Register File ….. pgaload pgaop $3,$4,$5 …... Add $8, $3 Configuration Memory

PiCoGA Architecture Processor Interface Dynamically reconfigurable (Pipelined Configurable Gate Array): Embedded datapath for dynamic i.s.a. extension Dynamically reconfigurable Structured in rows activated in data- flow fashion by the PiCoGA control unit Can hold a state pGA-op latency depends on the specific mapped function Functionality is determined from DFG extracted from C code Processor Interface PiCoGA Control Unit PicoRow (Synchronous Element)

Pico-cell Description 4x32-bit input data from Reg File 2x32-bit output data to Reg File PiCoGA Control Unit INPUT LOGIC LUT 16x2 OUTPUT LOGIC, REGISTERS CARRY CHAIN EN PiCoGA control unit signals Configuration bus Loop-back 12 global lines to/from Reg File CONNECT BLOCK SWITCH RLC …

Computing on PiCoGA PiCoGA Control Unit Data in Mapping Pga_op1 Data Flow Graph PiCoGA Control Unit Data out Mapping Pga_op2

Multi-context Array PiCoGA Configuration Cache Func. 1 Func. 2 Func. 3 Func. 4 Func. n While a plane is executing another may be reconfigured → No reconfiguration time overhead Four configuration planes are available, one of them executing Plane switch takes just 1 clock cycle

Architecture Flexibility Yes Speed-up from pGA (5x – 100x) Parallelism to exploit ? (Ex: Turbo Decod., Motion Est.) No Yes Bit-level operations ? (Ex: DES, Reed-Solomon) No Yes Speed-up from DSP instructions and VLIW (1.5x – 2x) MAC intensive ? (Ex: FFT, Scalar product) No Yes Memory intensive ? (Ex: DCT, Motion Est.) Improvements for a large number of Data & Signal Processing algorithms

Programming XiRisc: Restrictions Fixed-point algorithms Variable size specification at the bit level Not supported yet: Dynamic memory allocation Math library Operating System

XiRisc Compilation Flow File.c C COMPILER PROFILER Software Simulation PiCoGA Configurator PiCoGAop Configuration Bit stream Configuration Library

Example: Motion Estimation Sum of Absolute Difference (SAD) - High instruction-level and inter-iteration parallelism

Data Flow Graph ….. pixel-pixel absolute difference Abs (p1[i] – p2[i]) p1[i], p2[i] pixel ….. Absolute Difference Sum tree

Sum of Absolute Difference AD1 AD2 AD3 AD4 From Register File SAD Writeback to Register File SAD8 SAD8

Latency and Issue Delay Place & Route High-Level C Compiler Mapping Place & Route DFG-based description Configuration Bits Griffy Compiler Emulation Function with Latency and Issue Delay

Performance evaluation Emulation function Latency and Issue-Delay back-annotation Profiling

Motion Estimation: Results 16 SAD operations in parallel PiCoGA occupation: ~100% Speed-up: 7x (with respect to standard XiRisc) MPEG preliminary result: H.261 standard QCIF (176x144): 10 frame/sec

Reed-Solomon Encoder: Results Encoder RS(15,9): 4-bit symbols PiCoGA occupation: ~25% Speed-up: 37x Throughput: 70.6 Mb/sec Encoder RS(255,239) widely used: 8-bit symbols PiCoGA occupation: ~60% Speed-up: 135x Throughput: 187.1 Mb/sec

Speed-up and Power Consumption Algorithm Energy consumption reduction (vs. std. XiRisc) Speed-up DES encryption 89% 13.5x Turbo decoder 75% 11.7x Motion prediction 46% 4.5x Median filter 60% 7.7x CRC 49% 4.3x