Download presentation
Presentation is loading. Please wait.
Published byGabriel Leonard Modified over 9 years ago
1
Reconfigurable Computing: A First Look at the Cray-XD1 Mitch Sukalski, David Thompson, Rob Armstrong, Curtis Janssen, and Matt Leininger Orgs: 8961 & 8963 September 1, 2004 Craig Ulmer
2
Outline Reconfigurable computing refresher –Progress update Cray XD1 –Architecture –General message passing –Reconfigurable Computing and the XD1
3
Reconfigurable Computing Update
4
Reconfigurable Computing Use reconfigurable hardware devices to implement key computations in hardware double doX( double *a, int n) { int i; double x; x=0; for(i=0;i<n;i+=3){ x+= a[i] * a[i+1] + a[i+2]; … } … return x; } * + + a[i]a[i+1] Z -1 a[i+2]
5
First Year Progress Computation (Underwood SNL/NM) –Double-precision Floating Point Cores Communication –Multi-gigabit Transceiver (MGT) interface –Gigabit Ethernet work Early application experiments –Simplified isosurfacing –Networked pattern matching
6
Peak Floating-Point Performance Core Single PrecisionDouble Precision Speed Cores per V2P100-6 Peak Performance Speed Cores per V2P100-6 Peak Performance Addition195 MHz8917 GFLOPS143 MHz405.7 GFLOPS Multiplication176 MHz7413 GFLOPS142 MHz273.8 GFLOPS Division120 MHz222.6 GFLOPS98 MHz60.58 GFLOPS From Underwood’s, “FPGAs vs. CPUs: Trends in Peak Floating-Point Performance,” in FPGA’04
7
Connecting FPGAs to the Network Fabric Modern FPGAs feature multi-gigabit transceivers –Experimented with GigE, Myrinet 2000, and IB –Implemented TCP Offload Engine (TOE) in hardware –Working on OpenTOE and OpenGigE cores MGT Control Tx IP Header ARP Ping ARP Cache MAC Framer Align CRC Rx CRC GT_Ethernet_2 Rocket I/O MGT Pad Ping Reply CRC Decode Incoming Data Queue Timeout Monitor SEQ Gen ACK Monitor CRC Gen ARP Reply Outgoing Data Queue SNL_OpenTOE T C P I/F S o c k e t I/F SNL_OpenGigE
8
Cray XD1 Overview
9
NDA Notice We do have an NDA with Cray Canada The XD1 we have on loan is an early Beta system
10
Cray XD1 Overview Dense MP system –12 AMD Opterons on 6 blades –6 Xilinx Virtex-II/Pro FPGAs –InfiniBand-like interconnect –6 SATA hard drives –4 PCI-X slots –3U Rack
11
Individual Blade DDR Memory DDR Memory RAP NI Opteron RapidArray Fabric (24 4x IB Ports) * All data rates are aggregates (i.e., 3.2 GB/s = 1.6 GB/s + 1.6 GB/s) HT: 3.2 GB/s4xIB: 2 GB/s HT: 6.4 GB/s “Einstein” Chip “HT”: 3.2 GB/s RAP NI RapidArray Fabric (24 4x IB Ports)
12
Message Passing MPICH 1.2.5 –Latency:2.25 μs –Bandwidth:1.3 GB/s (82% of HT-IB link) RapidArray message layer –Open source –MP, RDMA –Global address space Message Size (Bytes) Bandwidth (Million Bytes/s) MPI Bandwidth PCI-X 133 1.6GB/s HT
13
System Administration Active manager –Synchronize each node’s OS –Partition blade functionality –Control access rights Embedded processor –Monitors health (heartbeats) –Can restart nodes Issues?
14
Reconfigurable Computing and the Cray XD1
15
Connecting to the “Einstein” Accelerator RAP NI Host HT Net IB HT User-defined Circuits FPGA HT I/F FPGA Port Fabric Port 1.6+1.6GB/s QDR2 I/F QDR2 I/F QDR2 I/F QDR2 I/F 2MB SRAM 2MB SRAM 2MB SRAM 2MB SRAM 1.6+1.6GB/s
16
Example: Random Number Generator Monte Carlo app in need of good random numbers –Mersenne twister Implemented in FPGA –FPGA pushes to host memory –301 vs 101 Million Integers/s –~1.2 GB/s NI CPU Host Memory RNG FPGA
17
General XD1 Comments Reconfigurable computing –FPGA in memory –Fast local memory Other accelerators –ClearSpeed Global address space –Opteron limits (40b PA) Vendor lock-in –Incompatible network –All-in-one box? Current NI is a bottleneck Density vs. Reliability Value-added features Good Not-so-Good
18
Friendly Users? We have a month left on evaluation –Could use feedback from other users http://cdulmer.ran.sandia.gov/xd1 cdulmer@sandia.gov
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.