Reconfigurable Computing Aspects of the Cray XD1 Sandia National Laboratories / California Craig Ulmer Cray User Group (CUG 2005) May.

Slides:



Advertisements
Similar presentations
ECE 506 Reconfigurable Computing Lecture 2 Reconfigurable Architectures Ali Akoglu.
Advertisements

FPGA (Field Programmable Gate Array)
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Programmable Interval Timer
Efficacy of GPUs in RAID Parity Calculation 8/8/2007 Matthew Curry and Lee Ward Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed.
An Introduction to Reconfigurable Computing Mitch Sukalski and Craig Ulmer Dean R&D Seminar 11 December 2003.
Implementation methodology for Emerging Reconfigurable Systems With minimum optimization an appreciable speedup of 3x is achievable for this program with.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
ENGIN112 L38: Programmable Logic December 5, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 38 Programmable Logic.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Configurable System-on-Chip: Xilinx EDK
I/O Subsystem Organization and Interfacing Cs 147 Peter Nguyen
Embedded DRAM for a Reconfigurable Array S.Perissakis, Y.Joo 1, J.Ahn 1, A.DeHon, J.Wawrzynek University of California, Berkeley 1 LG Semicon Co., Ltd.
CS 151 Digital Systems Design Lecture 38 Programmable Logic.
Chapter 6 Memory and Programmable Logic Devices
Router Architectures An overview of router architectures.
Raghu Machiraju Slides: Courtesy - Prof. Huamin Wang, CSE, OSU
Router Architectures An overview of router architectures.
Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.
General FPGA Architecture Field Programmable Gate Array.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.
Dax: Rethinking Visualization Frameworks for Extreme-Scale Computing DOECGF 2011 April 28, 2011 Kenneth Moreland Sandia National Laboratories SAND P.
Network Intrusion Detection Systems on FPGAs with On-Chip Network Interfaces Christopher ClarkGeorgia Institute of Technology Craig UlmerSandia National.
LOGO BUS SYSTEM Members: Bui Thi Diep Nguyen Thi Ngoc Mai Vu Thi Thuy Class: 1c06.
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
Silicon Building Blocks for Blade Server Designs accelerate your Innovation.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Archs, VHDL 3 Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Efficient FPGA Implementation of QR
RiceNIC: A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Dr. Scott Rixner Rice Computer Architecture:
Logic and Computer Design Dr. Sanjay P. Ahuja, Ph.D. FIS Distinguished Professor of CIS ( ) School of Computing, UNF.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.
Design and Performance of a PCI Interface with four 2 Gbit/s Serial Optical Links Stefan Haas, Markus Joos CERN Wieslaw Iwanski Henryk Niewodnicznski Institute.
Reconfigurable Computing: A First Look at the Cray-XD1 Mitch Sukalski, David Thompson, Rob Armstrong, Curtis Janssen, and Matt Leininger Orgs: 8961 & 8963.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
CPEN Digital System Design
1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.
Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories Keith UnderwoodSNL/NM Craig Ulmer SNL/CA SOS-8 Workshop.
STMIK Jakarta STI&K, Jakarta - September Designing Image Processing Component using FPGA Device By : Sunny Arief Sudiro.
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
EE204 L12-Single Cycle DP PerformanceHina Anwar Khan EE204 Computer Architecture Single Cycle Data path Performance.
An FX software correlator for VLBI Adam Deller Swinburne University Australia Telescope National Facility (ATNF)
Floating-Point Divide and Square Root for Efficient FPGA Implementation of Image and Signal Processing Algorithms Xiaojun Wang, Miriam Leeser
Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.
CS 4396 Computer Networks Lab Router Architectures.
Copyright  2005 SRC Computers, Inc. ALL RIGHTS RESERVED Overview.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Threading Opportunities in High-Performance Flash-Memory Storage Craig Ulmer Sandia National Laboratories, California Maya GokhaleLawrence Livermore National.
Site Report DOECGF April 26, 2011 W. Alan Scott Sandia National Laboratories Sandia National Laboratories is a multi-program laboratory managed and operated.
The Alpha Thomas Daniels Other Dude Matt Ziegler.
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
Reconfigurable Computing Leveraging FPGA Accelerators in High-Performance Computing Applications Craig Ulmer June 2, 2005 Sandia is.
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
Reconfigurable Computing: HPC Network Aspects Mitch Sukalski (8961) David Thompson (8963) Craig Ulmer (8963) Pete Dean R&D Seminar December.
1 Advanced Digital Design Reconfigurable Logic by A. Steininger and M. Delvai Vienna University of Technology.
Fast Lookup for Dynamic Packet Filtering in FPGA REPORTER: HSUAN-JU LI 2014/09/18 Design and Diagnostics of Electronic Circuits & Systems, 17th International.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Instructor: Dr. Phillip Jones
Chapter 1 Introduction.
Five Key Computer Components
ADSP 21065L.
Presentation transcript:

Reconfigurable Computing Aspects of the Cray XD1 Sandia National Laboratories / California Craig Ulmer Cray User Group (CUG 2005) May 16, 2005 Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL Craig Ulmer, Ryan Hilles, and David ThompsonSNL/CA Keith Underwood and K. Scott HemmertSNL/NM

Overview Cray XD1 –Dense multiprocessor –Commodity parts –Good HPC performance FPGA Accelerators in XD1 –Close to host memory –Reconfigurable Computing –Our early experiences Message Size (Bytes) Bandwidth (Million Bytes/s) MPI Bandwidth PCI-X GB/s HT

Outline The Cray XD1 & Reconfigurable Computing Four simple applications –DMA, MD5, Sort, and Floating Point Observations Conclusions

Cray XD1 Architecture

Cray XD1 Dense multiprocessor system –12 AMD Opterons on 6 blades –6 Xilinx Virtex-II/Pro FPGAs –InfiniBand-like interconnect –6 SATA hard drives –4 PCI-X slots –3U Rack

XD1 Blade Architecture CPU Main Memory Main Memory NI SRAM Expansion Module SRAM FPGA NI Rapid Array Fabric 1.6 GB/s HT 1+1 GB/s “4x IB”3.2 GB/s HT 1.6 GB/s “HT”

Reconfigurable Computing (RC) Use reconfigurable hardware devices to implement key computations in hardware double doX( double *a, int n) { int i; double x; x=0; for(i=0;i<n;i+=3){ x+= a[i] * a[i+1] + a[i+2]; … } … return x; } * + + a[i]a[i+1] Z -1 a[i+2] SoftwareHardwareFPGA

Reconfigurable Computing and the XD1 There have always been three challenges in RC: 1.System Integration 2.Capacity 3.Development Environment XD1 is interesting because of (1) and possibly (2) –Describe our early experiences –Four simple applications to expose performance

Application 1: Data Exchange

Data Exchange How fast can we move data into/out of FPGA? –Cray provides basic HyperTransport interface –Host/FPGA initiated transfers HT-Compliance –64B burst boundaries –64b words FPGA-Initiated DMA Engine –Read/Write large blocks of data –Based on memory interface FPGA User’s Circuits Host HT Op DMA FPGA-Initiated Transfers

Data Transfer Performance

FPGA Frequency Affects Performance

Application 2: Data Hashing

Data Hashing Generate fingerprint for block of data MD5 Message Digest function –512b blocks of data –64 computations per block –Computation: Boolean, add, and rotate Two methods –Single-cycle computation (64) –Multi-cycle computation (320)

MD5 Performance MD5 is serial –Read/Modify fingerprint –Host does better FPGA clock rates –Multi-cycle: 190 MHz –Single-cycle: 66 MHz Multi-cycle has 5x stages but only 3x clock rate of single-cycle

Application 3: Sorting

Sorting Sort a matrix of 64b values Hardware implementation –Sorting element –Array of elements –Data transfer engine Exploits parallelism –Do n comparisons each clock –Requires 3n clock cycles Limited to n values –Use as building block Sorting Array Address Wr Req Rd Req DMA Engine Sorting Control Double-Buffered Outgoing Memory Reg > Sorting Element unsortedsorted

Sorting Performance V2P50-7 FPGA –128 elements –110 MHz Data transfer by FPGA –Read/write concurrency –Pinned memory 2-4x Rate of Quicksort(128)

Application 4: 64b Floating Point

Floating Point Computations Floating point has been weakness for FPGAs –Complex to implement, consume many gates Keith Underwood and K. Scott Hemmert –Developed FP library at SNL/NM FP Library –Highly-optimized and compact –Single/Double Precision –Add, Multiply, Divide, Square Root –V2P50: DP cores, 130+ MHz

Distance Computation Compute Pythagorean Theorem: c=sqrt(a 2 + b 2 ) –Implemented as simple pipeline (88 stages) –Fetch two inputs, store one output each clock –Host pushes data in, FPGA pushes data out Incoming SRAM Outgoing SRAM DMA Engine Control Incoming SRAM From Host To Host

Distance Performance FPGA Design –159 MHz clock rate –39% of V2P50 CPU and FPGA converge –75 MiOperations/s –1,200 MiB/s of input For Square Root and Divide, FPGA can be competitive

Observations FPGAs have high-bandwidth access to host memory –FPGAs can get 1,300 MiB/s – 10x PCI –Still need planning for data transfers –Note: host’s HT links are 2x Some applications work better than other –Need to exploit parallelism V2P50 is approaching a capacity for interesting work –Small number of FP cores –Encourage use of larger FPGAs

Conclusions XD1 is an interesting platform for RC research –FPGAs require careful planning –High-bandwidth access to memory –Sign of interesting HT architectures Acknowledgments –Steve Margerm of Cray Canada for XD1 Support –Keith Underwood and K. Scott Hemmert SNL/NM for FP library