Reconfigurable Computing Nehir Sönmez
Reconfigurable Computing Standard Definition: A reconfigurable computer is a device which computes by using post-fabrication spatial components of compute elements. [Dehon] FPGA implementation of a processor core to run a program is excluded - not spatial mapping of problem. ASIC implementations excluded – not postfabrication programmable. The definition restricts RC to mapping to fine- grained devices (such as FPGAs). Whereas General Purpose computers compute by making connections in time.
What is Reconfigurable Computing? Computation using hardware that can adapt at the logic level to solve specific problems Why is this interesting? –Some applications are poorly suited to microprocessor. –VLSI “explosion” provides increasing resources. –Hardware/Software –Relatively new research area.
Spatial Computation Example: grade = 0.2 × mt × mt × mt × project; A hardware resource (multiplier or adder) is allocated for each operator in the compute graph. The abstract computation graph becomes the implementation template.
Temporal Computation A hardware resource is time-multiplexed to implement the actions of the operators in the compute graph. Close to a sequential processor/software solution. Many inbetween cases exist.
Why is Custom Logic Faster Than Software? Spatial vs. Temporal Computation –Processors divide computation across time, dedicated logic divides across space
Why is Custom Logic Faster Than Software? Specialization –Instruction set may not provide the operations your program needs –Processors provide hardware that may not be useful in every program or in every cycle of a given program Multipliers Dividers Instruction Memory –Processors need lots of memory to hold the instructions that make up a program and to hold intermediate results. Bit Width Mismatches –In general, processors have a fixed bit width, and all computations are performed on that many bits Multimedia vector instructions (MMX) a response to this
Microprocessor-based Systems –Generalized to perform many functions well. –Operates on fixed data sizes. –Inherently sequential. Data Storage (Register File) ALU ABC 64
Reconfigurable Computing –Create specialized hardware for each application. –Functional units optimized to perform a special task. If (A > B) { H = A; L = B; } Else { H = B; L = A; } Functional Unit A B H L
Dataflow Superscalar must find dataflow graph at run time RC constructs data flow graph at compile time no logic control overhead no window size limitations
Implementation Spectrum –ASIC gives high performance at cost of inflexibility. –Processor is very flexible but not tuned to the application. –Reconfigurable hardware is a nice compromise. MicroprocessorReconfigurable Hardware ASIC
Flexibility vs Data-Processing Rate
Field-Programmable Gate Array –Each logic element outputs one data bit. –Interconnect programmable between elements. –Interconnect tracks grouped into channels. LE Logic Element Tracks
FPGA Architecture Issues –Need to explore architectural issues. –How much functionality should go in a logic element? –How many routing tracks per channel? –Switch “population”? Logic Element
Real World Physical Issues –Modelling FPGA delay. –Improving performance through buffering/segmentation. –Technology dependent. –The cost of reconfigurability. SS Wires have real cost
Translating a Design to an FPGA –CAD to translate circuit from text description to physical implementation well understood. –CAD to translate from C program to circuit not well understood. –Very difficult for application designers to successfully write high-performance applications C program. C = A+B. Circuit A B +C Array Need for design automation!
High-level Compilers –Difficult to estimate hardware resources. –Some parts of program more appropriate for processor (hardware/software codesign). –Compiler must parallelize computation across many resources. –Engineers like to write in C rather than pushing little blocks around. C = A+B AB + C for (i = 0; i<n, i++) {. }
Reconfigurable Hardware –Each logic element operates on four one-bit inputs. –Output is one data bit. –Can perform any boolean function of four inputs 2 = 64K functions! Logic Element A B C D Out A B C D = out 2 4
Basic Logic Block Architecture
Xilinx - Spartan II Architecture IOBs provide the interface between the package pins and the internal logic CLBs provide the functional elements for constructing most logic Dedicated block RAM memories of 4096 bits each Clock DLLs for clockdistribution delay compensation and clock domain control Versatile multi-level interconnect structure
Spartan II Configurable Logic Block LUT capacity is completely determined by the number of inputs, not the complexity Basic block is a logic cell (LC) – A 4-input function generator (LUT), – Carry logic – storage element. Each CLB contains – four LCs, organized in two similar slices. – logic that combines function generators to provide functions of five or six inputs.
Spartan II CLB
Example: Two Bit Adder FA AB CoCo CiCi S Made of Full Adders A+B = D Logic synthesis tool reduces circuit to SOP form C o = ABC i + ABC i + ABC i + ABC i S = ABC i + ABC i + ABC i + ABC i LUT CoCo CiCi B A S CiCi B A
Circuit Compilation 1.Technology Mapping 2.Placement 3.Routing LUT ? Assign a logical LUT to a physical location. Select wire segments And switches for Interconnection.
Processor + FPGA 1. FPGA serves as coprocessor for data intensive applications. Three possibilities Backplane bus (e.g. PCI) Proc chip daughtercard FPGA chip FPGA Proc 2. FPGA serves as embedded computer for low latency transfer. “Reconfigurable Functional Unit”
Processor + FPGA (cont..) –FPGA logic embedded inside processor. –A number of problems with 2 and 3. Process technology an issue. ALU much faster than FPGA generally. FPGA much faster than the entire processor. RF ALU FPGA Processor 3. Processor integration
Multi-FPGA Systems –Most applications don’t fit on one device. –Create need for partitioning designs across many devices. –Effectively a “netlist computer” Each FPGA is a logic processor interconnected in a given topology. F F F F F F F F F
Xilinx XC4000 Cell –2 4-input look-up tables –1 3-input look-up table –2 D flip flops
Altera Flex10K
Xilinx Virtex CLB
Reconfiguration Reconfiguration methodology Static Partially static (=partial reconfiguration) Dynamic
The Design Process 1.Partition a program into sections to be implemented on hardware and software separately 2.Synthesize the computations destined for reconfigurable hardware into gate-level or circuit level description. 3.Map the circuit onto reconfigurable blocks and connect them using reconfigurable routing. 4.After compilation, the circuit is ready for configuration onto the hardware at runtime.
RC Objectives RC objectives: Specialization, performance, flexibility Basic idea: “Programmable Hardware” Specialization l Performance l Power consumption l Flexibility l Programming
Routing strategies Reconfigurable Devices Reconfigurable Computing A B C A B C Continuous Routing Structured Routing
Xilinx XC4000 Routing 25
By including reconfigurability we can increase flexibility with high specialization Reconfigurable Instruction Set Processors ProcessorPLD Reconfigurable Processor
Coprocessor based approach ASIP based approach Reconfigurable Instruction Set Processors · · · Task 1 Task K · · · Task K+1Task N SoftwareHardware Task 1Task 2Task N Software Hardware · · ·
Typical example: CPU + PCI board –Altera ARC-PCI –Compaq Pamette System on Chip (SoC) –Altera´s Excalibur device –Chameleon Systems, Inc. Coprocessor based approach (I) Reconfigurable Instruction Set Processors
Altera ARC-PCI Coprocessor based approach (II) Reconfigurable Instruction Set Processors
Compaq Pamette Coprocessor based approach (III) Reconfigurable Instruction Set Processors
Altera´s Excalibur device –Embedded Processor: ARM, MIPS or NIOS Coprocessor based approach (IV) Reconfigurable Instruction Set Processors
Chameleon Systems, Inc. Coprocessor based approach (V) Reconfigurable Instruction Set Processors
Reconfigurable unit within CPU ASIP based approach (I) Reconfigurable Instruction Set Processors Fetch Decode Issue Integer Unit FP Unit Branch Unit LD/ST Unit Reconfigurable Unit
Challenge: CAD tools ASIP based approach (II) Reconfigurable Instruction Set Processors C Code Compiler Assembly Code Instruction Description (Configuration bits)
ASIP based approach (III) Reconfigurable Instruction Set Processors C Parsing Optimizations Inst. Identification Inst. Selection Config. Scheduling Code Generation C Code Assembly Code Hardware Generation Configuration bits Hardware Estimator Compiler Structure
Example: Philips CinCISe Architecture ASIP based approach (II) Reconfigurable Instruction Set Processors Encoded Instruction Word Register File ALU RFU MUX
Why Compute With FPGAs? Huge performance gap between software and hand-designed hardware systems –Often 100-to-1 ratio of performance or performance/area Hardware systems not so good for general computing –Big design, cost barriers to implementation –Not practical to buy a new machine every time you want to run a different program Reconfigurable systems offer best-of-both-worlds –Run-time programmability –Hardware-level performance
Good Applications for Reconfigurable Computing Relatively small application graph –FPGAs have limited capacity –Simple control flow helps a lot Data Parallelism –Execute same computations on many independent data elements –Pipeline computations through the hardware Small and/or varying bit widths –Take advantage of the ability to customize the size of operators
Reconfigurable Computing Successes RSA Decryption –Programmable-Active-Memory machine set record for decryption of RSA-encrypted data DNA Sequence Matching –Reconfigurable hardware has achieved 100x better performance than contemporary supercomputers Signal Processing –FPGA-based filters often get 10x better performance than DSP chips –Benefit from customization of hardware to the application Emulation –Use reconfigurable logic to simulate new processors at high speeds Cryptographic Attacks –High-performance low-cost implementations for breaking encryption algorithms
FPGAs vs CPUs Capacity: Instructions are very dense representation, logic blocks aren’t Tools: Compilers for reconfigurable logic aren’t very good –Some operations are hard to implement on FPGAs One approach to capacity is to exploit the rule of software –Run the 90% of code that takes 10% of execution time on a conventional processor –Run the 10% of code that takes 90% of execution time on reconfigurable logic Programmable-reconfigurable processors
Fine-Grained System: CHIMERAE Treat reconfigurable array as ALU within superscalar –Array implements some number of custom instructions for each program –Register file is interface between programmable and reconfigurable
CHIMERAE Programmed in C –Instruction combining –Control localization –SIMD Within a Register Simulation Studies –Example applications only require 8 RFUOPs in the reconfigurable array –Equivalent to 32 rows in RFU Performance Results –Vary strongly from application to application –Also dependent on model used for RFU delay –Average speedup of 20-30%, one application sees >2x improvement
Coarse-Grained System: Garp Small programmable processor with large reconfigurable array –Interface through memory system
Garp Again, Programmed in C –Compiler attempts to map loop nests onto the reconfigurable array Data Encryption Standard –Estimate 24x speedup over UltraSPARC Image Dithering –9x Speedup Sorting –2x Speedup
Advantages of RC Relative to microprocessors: on average a higher percentage of peak (or raw) computational density is achieved with reconfigurable devices. Fine-grain flexibility leads to exploitation of problem specific parallelism at many levels. Also, many different computation models (or patterns) can be supported. In general, it is possible to match problem characteristics to hardware, through the use of problem specific architectures and low-level circuit specialization. Spatial mapping of computation versus multiplexing of function units (as in processors) relieves pressure for memory capacity, BW, and low-latency and local communication patterns. Modern FPGAs make good system-level components: Relatively large number of IOs (many parallel memory ports) High- BW communications. Machines based on these components can easily scale peak performance by riding Moore’s curve (FPGAs are process drivers). Low-level redundancy permits fault-tolerance and great cost savings. Built-in microprocessors. Is there still room for research in novel devices for RC?
Advantages of RC Even in an application with fixed algorithms, reconfigurable devices may offer advantages over a full-custom or ASIC approach: FPGAs are processes drivers, therefore a generation ahead of ASIC. Increasing NREs for ASIC and full-custom has pushed "cross-over" point way out. Time to market advantage. Programmability leads to: project risk management extended product life-times Dynamic reconfiguration might permit even higher efficiency through hardware sharing (multiplexing) and on the fly circuit specialization. Largely unexploited (unproven) to date. A few research projects have explored this idea.
RC Disadvantages Reconfiguration time might be critical in run-time reconfigurable systems. Low utilization of hardware resources in configurable systems.
FPGAs are Reconfigurable 1. Commercial applications have not taken advantage of reconfigurability. Xilinx/Altera haven’t done much to help. Methodologies/tools nearly nonexistent. 2. Volume/cost graphs don’t accurately capture the potential real costs and other advantages. Re configuration uses: Field upgrades. product life extension, changing requirements. In system board-level testing and field diagnostics. Tolerance to manufacturing faults. Risk-management in system development. Runtime reconfiguration -- higher silicon efficiency. Time-multiplexed pre-designed circuits take maximum use of resources. Runtime specialized circuit generation.
Silicon Usage
Performance: ~10x Speedup Efficiency: ~10x Lower Chip Costs: ~0.5x --increased yield Decreased complexity Decreased design cost