Reconfigurable Architectures Andrea Lodi
SoC trends Increasing mask cost (~ 3M$) Increasing design complexity Increasing design time (~ 3M$) Rapidly changing communication standards Low-power design in wireless environment Increasing algorithmic complexity requirements
time-to-market failed Product life cycle sales Growth Maturity Decrease LOSS time-to-market met time-to-market failed time
Trends in wireless systems Increased on-chip Transistor density Increased design complexity Algorithm complexity Moore’s law 400 Battery capacity 300 Millions of transistors/Chip 1997 1999 2001 2003 2005 2007 2009 200 Increased Algorithmic complexity Low battery capacity growth Technology (nm) 100 1997 1999 2001 2003 2005 2007 2009 Demand for reusability and flexibility Demand for high performance and energy efficiency
Digital architecture design space
Parallelism in computation Thread level parallelism Instruction level parallelism (ILP) Pipeline (loop level) Fine-grain parallelism (bit/byte-level)
Instruction level parallelism b c d + + + + + + ASIC Implementation 3 * * *3 * e - - + +
Spatial vs. Temporal Computing (Ax + B)x + C Ax2 + Bx + c Temporal (Processor) Spatial (ASIC)
Superscalar/VLIW processors FU limitations Register file size limitation Crossbar inefficiency
Byte-level parallelism in processors MMX technology: 57 new instructions Byte and half word parallel computation SIMD execution model
Bit-level parallelism Reverse (int v) { int x, r; for (c=0; x<WIDTH; x++) { r |= v&1; v = v >> 1; R = r << 1; } return r; popcount (int v) { int r=0; while (v) { if (v&1) r++; v = v >> 1; } return r; + v r v r
Pipeline parallelism + + + + + + + + + + + v r = register for (j=0; j<MAX; j++) b[j] = popcount[a[j]]; = register + + + + + + + + + + + r
FPGA FPGA (Field-Programmable Gate Array) composed of 2 elements: Array of clbs (configurable logic blocks) composed of : 1 or few small size LUTs (4:1 or 3:1) Control logic: mux controlled by configuration bits Dedicated computational logic (carry chain …) Configurable routing network connecting clbs composed of: Different length wires Connection blocks connecting clbs to the routing network Switch blocks connecting routing wires LUTs, configuration bits to program clbs and the routing network represent the FPGA configuration, which determines the function implemented
Configurable logic block
Xilinx Clb Xilinx clb 4000 series: 11 input 4 output bits 3 LUTs Carry logic 2 output registers
Configurable routing network
Example
Density Comparison
FPGA vs. Processor FPGA Processor (computing in space) Parallel execution Configurable in 102-103 cycles Fine-grained data Application specific operators Large area (switches, SRAM) Entire applications don’t fit Slow synthesis, P&R tools Processor (computing in time) Sequential execution Programmable every cycle Fixed-size operands Basic operators (ALU) Compact Handles complex control flow Fast compilers
Reconfigurable processors But: 90% execution time spent in computational kernels: FPGAs 10-100x speed-up over processors FPGAs 10-100x denser than processors (bit-ops/2s) Reconfigurable processor: Risc + FPGA
Reconfigurable processor architecture Hybrid architectures: RISC processor FPGA
Computational models RC Array: IO Processor/Interface logic Attached processor Piperench, T-Recs ISA Extension Function unit: PRISC, OneChip, Chimaera Coprocessor Garp, NAPA, Molen
IO Processor/Interface Logic Case for: Always have some system adaptation to do Modern chips have capacity to hold processor + glue logic reduce part count Glue logic vary many protocols, services only need few at a time Logic used in place of ASIC environment customization external FPGA/PLD devices Looks like IO peripheral to processor Example protocol handling stream computation compression, encrypt peripherals sensors, actuators
Example: Interface/Peripherals Triscend E5
Instruction Set Extension Instruction Bandwidth Processor can only describe a small number of basic computations in a cycle I bits 2I operations This is a small fraction of the operations one could do even in terms of www Ops w22(2w) operations Processor could have to issue w2(2 (2w) -I) operations just to describe some computations An a priori selected base set of functions could be very bad for some applications
Instruction Set Extension Idea: provide a way to augment the processor’s instruction set with operations needed by a particular application
Architectural Models for I.S.A extension PLEIADES XTENSA Good performance Easy to program Configured at mask-level High performance Overdesigned for most applications Difficult to program Cpu surrounded by a collection of Application-specific Custom Computing Devices Risc CPU featuring application-specific function units optionally inserted in the processor pipeline Zhang et al, 2000 Tensilica inc, 2002
Dynamic ISA Extension models Standard processor coupled with embedded programmable logic where application specific functions are dynamically re-mapped depending on the performed algorithm 1: Coprocessor model 2: Function unit model
Coprocessor model: Garp Explicit instructions moving data to and from the array High communication overhead (long latency array operations) Processor stalled each time the array is active Array performs at TASK level (Very coarse grain) 10-20x on stream, feed-forward operations 2-3x when data-dependencies limit pipelining Callahan, Hauser, Wawrzynek, 2000
Function unit model: Prisc Array fit in the risc pipeline No communication overhead Some degree of parallelism between function units Gate array performs combinatorial instructions ONLY (very fine grain) Low speedup figures (2x/3x) Razdan, Smith 1994
Function Unit Model: pros No communication overhead: Strict synergy between FPGA and other function units FPGA can be used frequently even for small functions Small reconfigurable array area Flow control handled by the core Memory access handled by the core Easy instruction set extension Configuration streams compiled from C
32-bit load/store Risc architecture (5 stages pipeline) EXTENDIBLE INSTRUCTION SET RISC ARCHITECTURE 32-bit load/store Risc architecture (5 stages pipeline) Set of specialized functional units Multiply/Mac Unit Branch/Decrement Unit Alu featuring “MMX” byte-wide concurrent operations VLIW Elaboration Concurrent fetch and execution of two 32-bit instructions per cycle Fully bypassed, to minimize pipeline stalls (Average of 10/20% for most computational cores) Embedded reconfigurable device for dynamic ISA extension DSP-oriented reconfigurable functional unit (PiCoGA) Fully configurable at execution time Elaboration and configuration controlled by asm instructions inserted in C source code PiCoGA used as a programmable Data-path with independent pipeline structure
XiRisc Architecture
Dynamic Instruction Set Extension
Dynamic Instruction Set Extension Register File ….. pgaload pgaop $3,$4,$5 …... Add $8, $3 Configuration Memory
PiCoGA Architecture Processor Interface Dynamically reconfigurable (Pipelined Configurable Gate Array): Embedded datapath for dynamic i.s.a. extension Dynamically reconfigurable Structured in rows activated in data- flow fashion by the PiCoGA control unit Can hold a state pGA-op latency depends on the specific mapped function Functionality is determined from DFG extracted from C code Processor Interface PiCoGA Control Unit PicoRow (Synchronous Element)
Pico-cell Description 4x32-bit input data from Reg File 2x32-bit output data to Reg File PiCoGA Control Unit INPUT LOGIC LUT 16x2 OUTPUT LOGIC, REGISTERS CARRY CHAIN EN PiCoGA control unit signals Configuration bus Loop-back 12 global lines to/from Reg File CONNECT BLOCK SWITCH RLC …
Computing on PiCoGA PiCoGA Control Unit Data in Mapping Pga_op1 Data Flow Graph PiCoGA Control Unit Data out Mapping Pga_op2
Multi-context Array PiCoGA Configuration Cache Func. 1 Func. 2 Func. 3 Func. 4 Func. n While a plane is executing another may be reconfigured → No reconfiguration time overhead Four configuration planes are available, one of them executing Plane switch takes just 1 clock cycle
Architecture Flexibility Yes Speed-up from pGA (5x – 100x) Parallelism to exploit ? (Ex: Turbo Decod., Motion Est.) No Yes Bit-level operations ? (Ex: DES, Reed-Solomon) No Yes Speed-up from DSP instructions and VLIW (1.5x – 2x) MAC intensive ? (Ex: FFT, Scalar product) No Yes Memory intensive ? (Ex: DCT, Motion Est.) Improvements for a large number of Data & Signal Processing algorithms
Programming XiRisc: Restrictions Fixed-point algorithms Variable size specification at the bit level Not supported yet: Dynamic memory allocation Math library Operating System
XiRisc Compilation Flow File.c C COMPILER PROFILER Software Simulation PiCoGA Configurator PiCoGAop Configuration Bit stream Configuration Library
Example: Motion Estimation Sum of Absolute Difference (SAD) - High instruction-level and inter-iteration parallelism
Data Flow Graph ….. pixel-pixel absolute difference Abs (p1[i] – p2[i]) p1[i], p2[i] pixel ….. Absolute Difference Sum tree
Sum of Absolute Difference AD1 AD2 AD3 AD4 From Register File SAD Writeback to Register File SAD8 SAD8
Latency and Issue Delay Place & Route High-Level C Compiler Mapping Place & Route DFG-based description Configuration Bits Griffy Compiler Emulation Function with Latency and Issue Delay
Performance evaluation Emulation function Latency and Issue-Delay back-annotation Profiling
Motion Estimation: Results 16 SAD operations in parallel PiCoGA occupation: ~100% Speed-up: 7x (with respect to standard XiRisc) MPEG preliminary result: H.261 standard QCIF (176x144): 10 frame/sec
Reed-Solomon Encoder: Results Encoder RS(15,9): 4-bit symbols PiCoGA occupation: ~25% Speed-up: 37x Throughput: 70.6 Mb/sec Encoder RS(255,239) widely used: 8-bit symbols PiCoGA occupation: ~60% Speed-up: 135x Throughput: 187.1 Mb/sec
Speed-up and Power Consumption Algorithm Energy consumption reduction (vs. std. XiRisc) Speed-up DES encryption 89% 13.5x Turbo decoder 75% 11.7x Motion prediction 46% 4.5x Median filter 60% 7.7x CRC 49% 4.3x