FPGA-based Supercomputers FPGA Boards and FPGA-based Supercomputers
Resources PCI PCI-X Reconfigurable Supercomputing http://en.wikipedia.org/wiki/Peripheral_Component_Interconnect PCI-X http://en.wikipedia.org/wiki/PCI-X Reconfigurable Supercomputing T. El-Ghazawi, K. Gaj, D. Buell, D. Pointer Tutorial at the Supercomputing 2005 conference http://hpcl.seas.gwu.edu/openfpga/tutorial_html/index.html
FPGA Device Capacity Trends Virtex-5 550 MHz 24M gates* Virtex-II Pro 450 MHz 8M gates* Virtex-4 500 MHz 16M gates* Virtex-II 450 MHz 8M gates Spartan-3 326 MHz 5M gates Virtex-E 240 MHz 4M gates Xilinx Device Complexity Virtex 200 MHz 1M gates XC4000 100 MHz 250K gates Spartan-II 200 MHz 200K gates Spartan 80 MHz 40K gates XC3000 85 MHz 7.5K gates XC5200 50 MHz 23K gates XC2000 50 MHz 1K gates 1985 1987 1991 1995 1998 1999 2000 2002 2003 2004 2006 Year Source: http://class.ece.iastate.edu/cpre583/lectures/Lect-01.ppt
FPGA Boards
General Architecture of an FPGA-Based Board BUS Processing Element (PE#0) (PE#1) (PE#N-1) COMMON MEMORY / INTERCONNECT NETWORK LOCAL MEMORY CLK BUS INTERFACE CONTROLLER I/O CARD
Reconfigurable Computing Boards (Accelerators) Boards may have one or several interconnected FPGA chips Support different bus standards, e.g. PCI, PCI-X, VME May have direct real-time data I/O through a daughter board Boards may have local onboard memory (OBM) to handle large data while avoiding the system bus (e.g. PCI) bottleneck
Reconfigurable Computing Boards (Accelerators) Many boards per node can be supported Host program (e.g. C) to interface user (and mP) with board via a board API Driver API functions may include functionalities such as Reset, Open, Close, Set Clocks, DMA, Read, Write, Download Configurations, Interrupt, Readback
PCI = Peripheral Component Interconnect Common Interface - PCI PCI = Peripheral Component Interconnect 64-bit bus 32-bit bus
PCI - Conventional hardware specifications 32-bit or 64-bit bus width 33.33 MHz clock with synchronous transfers peak transfer rate of 133 MB per second for 32-bit bus width (33.33 MHz × 32 bits × (1 byte ÷ 8 bits) = 133 MB/s) peak transfer rate of 266MB/s for 64-bit bus width 32-bit address space (4 gigabytes) 32-bit port space (now deprecated) 5-volt signaling
PCI-X (PCI eXtended) PCI-X doubles the width to 64-bit, revises the protocol, and increases the maximum signaling frequency to 133 MHz (peak transfer rate of 1014 MB/s) PCI-X 2.0 permits a 266 MHz rate (peak transfer rate of 2035 MB/s) and also 533 MHz rate, expands the configuration space to 4096 bytes, adds a 16-bit bus variant and allows for 1.5 volt signaling
Some Reconfigurable Boards Vendors ANNAPOLIS MICRO SYSTEMS, INC. (www.annapmicro.com) University of Southern California -USC/ISI (http://www.east.isi.edu). AMONTEC (www.amontec.com/chameleon.shtml) XESS Corporation (www.xess.com) CELOXICA (www.celoxica.com) CESYS (www.cesys.com) TRAQUAIR (www.traquair.com) SILICON SOFTWARE: (www.silicon-software.com) COMPAQ: (www.research.compaq.com/SRC/pamette/) ALPHA DATA: (www.alpha-data.com) Associated Professional Systems: (www.associatedpro.com) NALLATECH: (www.nallatech.com)
Representative Example Boards From Annapolis Micro Systems (AMI) http://www.annapmicro.com & Nallatech http://www.nallatech.com
ZBT, zero bus turnaround memory, no idle cycles between read-to-write and write-to-read Source: [AMS02]
Source: [AMS02]
WILDSTAR™ II Pro Reproduced and displayed with permission
WILDSTAR™ II Pro QDR: up to 400 MHz (typically 133 MHz) Each chip has six banks of up to 8 MB/bank, 48/chip Rocket I/O 3.2 Gbps Differential are parallel so, speed is how many and at what clock you run them Differential pairs have higher noise immunity Reproduced and displayed with permission
Nallatech's BenNUEY-PCI-4E Up to 7 VII Pros, 6 are for the DIME-II modular architecture, and intercard communication through Rapid I/O, all to PCI
Reconfigurable Supercomputers
Scalable Reconfigurable Systems Large numbers of reconfigurable processors and microprocessors Everything can be configured Functional units Interconnects Interfaces High-level of scalability Suitable for a wide range of applications Everything can be reconfigured over and over at run time (Run-Time Reconfiguration) to suite underlying applications Can be easily programmed by application scientists, at least in the same way of programming conventional parallel computers
Early Reconfigurable Architecture Interface P memory . . . I/O FPGA Microprocessor system Reconfigurable system
Current Reconfigurable Architecture P FPGA FPGA P . . . P memory P memory FPGA memory FPGA memory Shared Memory and or NIC
Possible Classes of Reconfigurable Supercomputers … μP N RP 1 … RP N Independent Board Design μP Board RP Board μP 1 … μP N RP 1 … RP N Joint Board Design Joint μP/RP Board Tighter Integration
Possible Classes of Reconfigurable Supercomputers – cont. … μP N μP inside of RP Design RP 1 RP N Joint μP/RP Board RP inside of μP Design RP 1 … RP N μP 1 μP N Joint μP/RP Board Tighter Integration
FPGA based supercomputers Machine Released SRC 6 from SRC Computers Cray XD1 from from Cray SGI Altix from SGI SRC 7 from SRC Computers, Inc, 2002 2005 2006
How to choose the system that best suits your needs? Typical users’ criteria: 1. Clock speed 2. Amount of memory 3. Cost of Ownership
How to choose the system that best suits your needs? Recommended users’ criteria: Tools - right level of abstraction - ease of development & verification - progress & backward compatibility 2. Libraries - basic operations - examples of full applications 3. Technical support
How to choose the system that Reconfigurable Processor System best suits your needs? Recommended users’ criteria (cont.): 4. Data Bandwidth Reconfigurable Processor System P system external I/O devices
How to choose the system that best suits your needs? Recommended users’ criteria (cont.): 5. Scalability - variable power and price - efficient communication among the modules
Recommended users’ criteria (cont.): 6. Transfer of control overhead Theoretical behavior Actual behavior P FPGA P FPGA Control transfer overhead time
7. Reconfiguration overhead P FPGA P FPGA P FPGA Reconf A Reconf A Reconf A Task A Task A Task A Reconf B Task A Reconf B Task B Task B Reconf C Task A Task C Reconf C Task C
7. Reconfiguration overhead (cont.) P FPGA 1 FPGA 2 Reconf A Reconf B Task A Reconf C Task B Task C
Recommended users’ criteria (cont.): 8. Number of FPGAs & number of microprocessors 9. Clock speed - maximum - variable vs. fixed 10. Amount of memory
Programming Reconfigurable Computers
SRC Programming Model Microprocessor FPGA VHDL ANSI C MAP C Libraries of macros function_1 macro_1 macro_2 macro_3 macro_4 ………………………. main.c macro_1(a, b, c) macro_2(b, d) macro_2(c, e) function_1() function_2() VHDL FPGA function_2 I/O a macro_3(s, t) macro_1(n, b) macro_4(t, k) Macro_1 ANSI C b c Macro_2 Macro_2 MAP C (subset of ANSI C) d e I/O
SRC Program Partitioning C function for P P system HLL C function for MAP FPGA system VHDL macro HDL
SRC Compilation Process Application sources Macro sources .c or .f files .mc or .mf files . . vhd or or .v files HDL HDL sources sources Logic synthesis Logic synthesis .v files .v files P Compiler MAP Compiler Netlists . . ngo ngo files files Object .o files .o files files Place & Route Place & Route Linker Linker .bin files .bin files Configuration Application bitstreams executable
Star Bridge Programming Environment - Viva Sheets Library Object
Star Bridge Compilation Process User input Graphical User Interface Netlists .ngo files Xilinx VIVA Place & Route .bin files Configuration bitstreams Application executable
Cray XD1 Programming Flows The MathWorks int mask (a, m) Mitrion-C { return (a & m); } MATLAB/ Simulink High-level Flow Synthesis Xilinx Mitrion process (a, m) is System Generator begin VHDL, z <= a and m; Verilog end process; VHDL or Verilog VHDL/Verilog Synthesis Mentor Graphics Gate-level EDIF Synopsys a z m Synplicity Xilinx Standard Flow Xilinx Place & Route 01001011010101 01010110101001 01000101011010 10100101010101 Source: [Cray, MAPLD05]
Xtreme DSP Design Flow
HDL-based SGI Altix Programming Flow Design iterations Design Verification Design Entry (Verilog, VHDL) .v, .vhd .v, .vhd Behavioral Simulation (VCS, Modelsim) IA-32 Linux Machine Design Synthesis (Synplify Pro, Amplify) .v, .vhd .edf Metadata Processing (Python) Design Implementation (ISE) .ncd, .pcf Static Timing Analysis (ISE Timing Analyzer) .cfg .bin Altix Device Programming (RASC Abstraction Layer, Device Manager, Device Driver) Real-time Verification (gdb) .c
HLL-based SGI Altix Programming Flow HLL Design Entry (Handel-C, Mitrion C, Viva) Design Verification RTL Generation and Integration with Core Services .v, .vhd Behavioral Simulation (VCS, Modelsim) .v, .vhd IA-32 Linux Machine .v, .vhd Design Synthesis (Synplify Pro, Amplify) Metadata Processing (Python) .edf Static Timing Analysis (ISE Timing Analyzer) .ncd, .pcf Design Implementation (ISE) .cfg .bin Altix Device Programming (RASC Abstraction Layer, Device Manager, Device Driver) Real-time Verification (gdb) .c
Processor Architecture Mitrion-C Programming Model for Cray & SGI Microprocessor FPGA Mitrion Distributed Processor Architecture (platform dependent) Application code (platform independent) VHDL main.c Mitrion-C Mitrion Compiler & Configurator function_1(in1) start_fpga() function_1(in2) start_fpga() FPGA RAM ANSI C based on Mitrion API application on the distributed processor Input & output I/O
Increased capability to describe Program Entry for FPGA Accelerator Boards Graphical Data Flow Diagram HDL HLL Software Traditional Hardware Software Extended (e.g. Corefire) Hardware Increased productivity Increased capability to describe parallel execution
Program Entry for Reconfigurable Computers HLL HDL Graphical Data Flow Diagram Software Star Bridge COM objects Hardware porting EDIF Software SRC Hardware HDL macros Increased productivity Increased capability to describe parallel execution
Program Entry for Reconfigurable Computers HLL HDL Graphical Data Flow Diagram Cray XD1 with Simulink Software Simulink Hardware Xilinx System Generator SGI or Cray with Mitrion Software Mitrion Processor Hardware Mitrion-C Increased productivity Increased capability to describe parallel execution