An Introduction to Reconfigurable Computing Mitch Sukalski and Craig Ulmer Dean R&D Seminar 11 December 2003
Reconfigurable Computing… is computation on a platform with reconfigurable (i.e., modifiable at run-time) hardware capable of implementing application-specific algorithms and functionality on demand.
Computing Spectrum Execute x/xor Fetch Decode Registers + Memory Writeback Software General-Purpose CPU Easily reprogrammed Low cost Fundamental bottlenecks + z -1 xorx + x ABD π x C result Hardware Application-Specific Integrated Circuit (ASIC) Not modifiable High cost Extremely fast Soft-Hardware Field Programmable Gate Arrays (FPGAs) Reconfigurable hardware Medium cost Speedup potential
History The Teramac CCM: Multi-Chip Module of FPGAs Fixed+Variable CPU: Users can attach new computational circuits to a fixed ALU Xilinx Virtex FPGA 1945: Eckert, Mauchly, von Neumann: ENIAC 1945: “von Neumann architecture” 1960: Estrin: Fixed+Variable Structure Computer 1970’s: Simple PLDs 1985: Xilinx introduces first FPGA 1990’s: Custom Computing Machines (CCMs) 1999: FPGAs exceed million logic gates 2002: FPGAs include complex cores ENIAC Connecting computational Blocks for an algorithm Xilinx Virtex II Pro (image courtesy of rapidio.org)
Reconfigurable Computing in Modern HPC Stand-alone platforms –OctigaBay 12K –SRC-6 –Starbridge Hypercomputer Accelerator cards –Timelogic’s DeCypher –Nallatech’s BenNUEY –Annapolis Micro Systems WILDSTAR II
Example: Computational Fluid Dynamics William Smith & Austars Schnore at GE Global Research From:“Towards an RCC-based Accelerator for Computational Fluid Dynamics,” ERSA 2003
And now for some details… Field Programmable Gate Arrays (FPGAs) Common RC design techniques Reported examples
Field-Programmable Gate Arrays (FPGAs) FPGAs emulate digital logic circuitry –Large array of configurable logic blocks –Internal routing through programmable interconnection network FPGAs hold hardware configuration in SRAM –Change the digital circuitry by loading new configuration Design approach: –User designs in hardware description language –Synthesis tools translate to logic gates –Mapping tools target specific FPGA
Register LUT Simplified Logic Block Emulates logic function –Thousands per chip Lookup Table (LUT) –Holds truth table –Inputs produce outputs 1-bit registers –Hold data between cycles Note: Greatly simplified
LUT Example:1-bit Adder ABC in C out Sum Register LUT A B C 0 A B C 0 C out Sum Truth Table
Routing Data between Logic Blocks Need to connect logic blocks Wires and Switchboxes –LBs connect to local wires –Switchboxes route long connections Routing set at compile time –Performed by tools
Reconfiguration Modern FPGAs SRAM based –Can be loaded with new circuitry Full reconfiguration –Few megabytes of configuration –Milliseconds Partial reconfiguration –Reprogram only a portion of chip –Reduces configuration time –Non-trivial, poorly supported FPGA Full Configuration Image Partial Configuration Image
Design Techniques Digital logic design techniques for exploiting FPGAs
FPGAs as Computational Accelerators Use FPGAs as soft-hardware –Port algorithm to hardware –Run inside FPGA –Reuse hardware Techniques –Concurrency, memory, partial evaluation
1. Concurrency Load FPGA with multiple computational circuits –Hardware state machines are like threads, but.. –All tasks are always running Raw parallelism –Units run in parallel –Example: Key breaking Pipelining –Chain units together in series –Example: Streaming computations, data-flow
2. Custom Memory Interactions Most FPGA cards have multiple memory banks –Fetch/store multiple data values at same time –Predictable performance (as opposed to caches) –Hide address generation SRAM Bank 0 SRAM Bank 1 SRAM Bank 2 SRAM Bank 3 X X X SRAM Bank 4 FPGA
3. Partial Evaluation Know data constants at design time –Apply to circuits and reduce hardware –Synthesis tools perform automatically Note: FPGAs unique because we can easily generate new, optimized hardware configurations for each set of constants. Example: 4-bit Ripple-Carry Adder
RC Performance Examples CFD: 23 GFLOPS sustained –“Towards an RCC-based Accelerator for Computational Fluid Dynamics,” Smith & Schnore, 2003 Adaptive beamforming: 20 GFLOPS –Parallel systolic array architecture –“20 GFLOPS QR processor on a Xilinx Virtex-E FPGA,” Walke, et. al., 2000 Real-time holographic video display at 30fps –“Using field programmable gate arrays to scale up the speed of holographic video computation,” Nwodoh
In Summary Reconfigurable computing uses FPGAs to emulate application-specific hardware –Achieve performance gains with dedicated hardware It is possible to implement just about any kind of digital hardware in the FPGA. –Limited by capacity and effort –Resurrect application-specific hardware architectures –SIMD, MIMD, Systolic Processor Arrays, Data-Flow…