Download presentation
Presentation is loading. Please wait.
Published byEverett Goodwin Modified over 8 years ago
1
Reconfigurable Computing Leveraging FPGA Accelerators in High-Performance Computing Applications Craig Ulmer cdulmer@sandia.gov June 2, 2005 Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. Craig Ulmer, Ryan Hilles, and David ThompsonSNL/CA Keith Underwood and K. Scott HemmertSNL/NM
2
What is Center 8900 doing with FPGAs? Computer Sciences & Information Technologies 8900 –High-Performance Computing –Viz & Scientific computing (8963) FPGA LDRD for accelerating HPC –NM:Computational aspects –CA:Communication aspects FPGA LDRD for network security –NM/CA: Smart routers Catalyst
3
Outline Reconfigurable Computing –Communication Aspects –Computational Aspects New Platforms: Cray XD1 Summary
4
In a nutshell.. Reconfigurable Computing –Use FPGAs as a means of accelerating simulations –Implement key computations in hardware: soft-hardware double doX( double *a, int n) { int i; double x; x=0; for(i=0;i<n;i+=3){ x+= a[i] * a[i+1] + a[i+2]; … } … return x; } * + + a[i]a[i+1] Z -1 a[i+2] SoftwareHardwareSynthesized Hardware FPGA CPU RC Platform
5
Trivial Example: Sorting Sort a matrix of 64b values Hardware implementation –Sorting element –Array of elements –Data transfer engine Exploits parallelism –Do n comparisons each clock –Requires 3n clock cycles Limited to n values –Use as building block Sorting Array Address Wr Req Rd Req DMA Engine Sorting Control Double-Buffered Outgoing Memory Reg > Sorting Element 11.532.994.188.7 79.434.539.254.2 24.354.14.235.1 14.83.019.791.4 17.581.39.431.3 94.188.732.911.5 79.454.239.234.5 54.135.124.34.2 91.419.714.83.0 81.331.317.59.4 unsortedsorted
6
Sorting Performance Xilinx Virtex II/Pro50-7 FPGA –128 elements –110 MHz Data transfer by FPGA –Read/write concurrency –Pinned memory 2-4x Rate of Quicksort(128) –Host is 2.2GHz AMD Opteron
7
RC: Communication Aspects
8
New FPGAs are “Network Ready” Multi-gigabit transceivers (MGTs) –V2Pro:twenty-four 3Gbps channels –V2ProX:twenty 10Gbps channels –InfiniBand, FC, GigE, 10GigE Multiple embedded processors –PowerPC 405 400MHz Direct connection to network –Build NI in hardware –Use PPC core for complex protocols –User-defined computational circuits Xilinx Virtex II/Pro-7 Network Interface PowerPC CPU User- Defined Circuits
9
TCP/GigE Network Interface for FPGAs Network interface (NI) in hardware –TCP Offload Engine(TOE) –Gigabit Ethernet(GigE) Refined TCP functionality –Slow start, NACK detection, Throttling… –User data rate to host: 600 Mb/s Outgoing TCP Message Control Rocket I/O TxRx Align Decode ARP Cache Framer MAC Header Control ARP Reply Ping ReplyIP Header Incoming Byte Stream Timeout Monitor CRC Outgoing Byte Stream Incoming TCP Message Control TOE GigE UnitSlicesV2P20 TOE2,06822% GigE1,10212%
10
Transport Framework for Visualization Need a framework for supporting Visualization in FPGAs –Want to perform viz tasks, not rendering –Built Chromium interface to transport OpenGL graphics Results –Functional system where visualization in FPGA, Rendering in Host –Saturate GigE link with OpenGL data Viz App GigE Network GigE TCP Offload Engine Chromium Packetizer FPGA
11
Network Intrusion Detection System (NIDS) Collaboration with Georgia Tech –GT:Pattern matching cores –SNL: GigE NI cores Network Intrusion Detection System –Examine packets & remove malicious –Based on Snort rule set (1,500 rules) Filter multiple GigE links transparently –Demonstrated for two links NI Malicious Packet Pattern Matching NI OKDrop Scheduler
12
RC: Computational Aspects
13
Floating Point Computations Floating point has been weakness for FPGAs –Complex to implement, consume many gates Keith Underwood and K. Scott Hemmert –Developed FP library at SNL/NM FP Library –Highly-optimized and compact –Single/Double Precision –Add, Multiply, Divide, Square Root –V2P50: 10-20 DP cores, 130-210 MHz
14
Example: FP Distance Computation Compute c=sqrt(a 2 + b 2 ) –Simple pipeline (88 stages) Implemented on Xilinx Virtex II/Pro 50 –159 MHz clock rate –39% of V2P50 –75 MiOperations/s Incoming SRAM Outgoing SRAM DMA Engine Incoming SRAM From Host To Host
15
New Platforms: Cray XD1
16
Cray XD1 Dense multiprocessor system –12 AMD Opterons on 6 blades –6 Xilinx Virtex-II/Pro FPGAs –InfiniBand-like interconnect –6 SATA hard drives –4 PCI-X slots –3U Rack
17
Cray XD1: Placing FPGAs Near Memory Blades use HyperTransport –High bandwidth I/O FPGA accelerator –Xilinx Virtex II/Pro 50 (-7) –HT connection: 1.6+1.6GB/s –10x performance of PCI First of a new breed –Commodity HT accelerators CPU Main Memory Main Memory NI SRAM Expansion Module SRAM FPGA NI 1.6 GB/s HT Processor Blade
18
Summary FPGAs are getting faster and less expensive –Finally have capacity for interesting problems –Million gate parts for $5 Large amount of reconfigurable computing work at SNL –Experience with Virtex II/Pro Platform FPGAs –Networks & embedded CPUs –Willing to share cores, work with others
19
Deliverables NetworkingVisualizationComputationsXD1General OpenGigE OpenTOE Raw GigE GigE Bridge NIDS Filter Isosurfacer Chromium Packetizer Sorting Array MD5 2D Distance DMA Engine Computation Framework Asynch. Message FIFOs Clock domain data pass PowerPC instantiation SNL/CA Cores 32b/64b Floating-PointComputations Add Multiply Divide Square root Vector/Matrix Multiply SNL/NM Cores cdulmer@sandia.gov http://cdulmer.ran.sandia.gov
20
FPGA Slices Relative V2P Price & Density V2P100 V2P70 V2P7 V2P40 Relative Price & Density
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.