Reconfigurable Computing Leveraging FPGA Accelerators in High-Performance Computing Applications Craig Ulmer June 2, 2005 Sandia is.

Slides:



Advertisements
Similar presentations
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Advertisements

Lecture 6: Multicore Systems
Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Efficacy of GPUs in RAID Parity Calculation 8/8/2007 Matthew Curry and Lee Ward Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed.
Implementation methodology for Emerging Reconfigurable Systems With minimum optimization an appreciable speedup of 3x is achievable for this program with.
Graduate Computer Architecture I Lecture 15: Intro to Reconfigurable Devices.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
CS 213 Commercial Multiprocessors. Origin2000 System – Shared Memory Directory state in same or separate DRAMs, accessed in parallel Upto 512 nodes (1024.
Introduction CS 524 – High-Performance Computing.
Seven Minute Madness: Special-Purpose Parallel Architectures Dr. Jason D. Bakos.
Introduction to Reconfigurable Computing CS61c sp06 Lecture (5/5/06) Hayden So.
1 Performed By: Khaskin Luba Einhorn Raziel Einhorn Raziel Instructor: Rivkin Ina Spring 2004 Spring 2004 Virtex II-Pro Dynamical Test Application Part.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Configurable System-on-Chip: Xilinx EDK
Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.
Router Architectures An overview of router architectures.
Field Programmable Gate Array (FPGA) Layout An FPGA consists of a large array of Configurable Logic Blocks (CLBs) - typically 1,000 to 8,000 CLBs per chip.
System Architecture A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Hyong-Youb Kim, Paul Willmann, Dr. Scott Rixner Rice.
Router Architectures An overview of router architectures.
Sven Ubik, Petr Žejdl CESNET TNC2008, Brugges, 19 May 2008 Passive monitoring of 10 Gb/s lines with PC hardware.
Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Extensible Message Layers for Multimedia Cluster Computers Dr. Craig Ulmer Center for Experimental Research in Computer Systems.
© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,
Network Intrusion Detection Systems on FPGAs with On-Chip Network Interfaces Christopher ClarkGeorgia Institute of Technology Craig UlmerSandia National.
Mapping of scalable RDMA protocols to ASIC/FPGA platforms
Design and Characterization of TMD-MPI Ethernet Bridge Kevin Lam Professor Paul Chow.
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
Silicon Building Blocks for Blade Server Designs accelerate your Innovation.
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
SLAAC SV2 Briefing SLAAC Retreat, May 2001 Heber, UT Brian Schott USC Information Sciences Institute.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
RiceNIC: A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Dr. Scott Rixner Rice Computer Architecture:
Computer Architecture Lecture10: Input/output devices Piotr Bilski.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
To be smart or not to be? Siva Subramanian Polaris R&D Lab, RTP Tal Lavian OPENET Lab, Santa Clara.
Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Reconfigurable Computing: A First Look at the Cray-XD1 Mitch Sukalski, David Thompson, Rob Armstrong, Curtis Janssen, and Matt Leininger Orgs: 8961 & 8963.
Using FPGAs to Supplement Ray-Tracing Computations on the Cray XD-1 Charles B. Cameron United States Naval Academy Department of Electrical Engineering.
Lecture 1 1 Computer Systems Architecture Lecture 1: What is Computer Architecture?
Jump to first page One-gigabit Router Oskar E. Bruening and Cemal Akcaba Advisor: Prof. Agarwal.
Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories Keith UnderwoodSNL/NM Craig Ulmer SNL/CA SOS-8 Workshop.
By V. Koutsoumpos, C. Kachris, K. Manolopoulos, A. Belias NESTOR Institute – ICS FORTH Presented by: Kostas Manolopoulos.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
An FX software correlator for VLBI Adam Deller Swinburne University Australia Telescope National Facility (ATNF)
Floating-Point Divide and Square Root for Efficient FPGA Implementation of Image and Signal Processing Algorithms Xiaojun Wang, Miriam Leeser
Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.
CS 4396 Computer Networks Lab Router Architectures.
Department of Computer Science and Engineering Applied Research Laboratory Architecture for a Hardware Based, TCP/IP Content Scanning System David V. Schuehler.
ECE 526 – Network Processing Systems Design Network Processor Introduction Chapter 11,12: D. E. Comer.
Reconfigurable Computing Aspects of the Cray XD1 Sandia National Laboratories / California Craig Ulmer Cray User Group (CUG 2005) May.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
Threading Opportunities in High-Performance Flash-Memory Storage Craig Ulmer Sandia National Laboratories, California Maya GokhaleLawrence Livermore National.
1 of 20 Smart-NICs: Power Proxying for Reduced Power Consumption in Network Edge Devices Karthikeyan Sabhanatarajan, Ann Gordon-Ross +, Mark Oden, Mukund.
Site Report DOECGF April 26, 2011 W. Alan Scott Sandia National Laboratories Sandia National Laboratories is a multi-program laboratory managed and operated.
Reconfigurable Computing: HPC Network Aspects Mitch Sukalski (8961) David Thompson (8963) Craig Ulmer (8963) Pete Dean R&D Seminar December.
Hardened IDS using IXP Didier Contis, Dr. Wenke Lee, Dr. David Schimmel Chris Clark, Jun Li, Chengai Lu, Weidong Shi, Ashley Thomas, Yi Zhang  Current.
Cray XD1 Reconfigurable Computing for Application Acceleration.
Mohamed ABDELFATTAH Andrew BITAR Vaughn BETZ. 2 Module 1 Module 2 Module 3 Module 4 FPGAs are big! Design big systems High on-chip communication.
Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
R&D on data transmission FPGA → PC using UDP over 10-Gigabit Ethernet Domenico Galli Università di Bologna and INFN, Sezione di Bologna XII SuperB Project.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
Presented by Reconfigurable HPC Research at ORNL using Field-Programmable Gate Arrays (FPGAs) Olaf O. Storaasli Future Technologies Group Computer Science.
Chapter 1 Introduction.
Presentation transcript:

Reconfigurable Computing Leveraging FPGA Accelerators in High-Performance Computing Applications Craig Ulmer June 2, 2005 Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL Craig Ulmer, Ryan Hilles, and David ThompsonSNL/CA Keith Underwood and K. Scott HemmertSNL/NM

What is Center 8900 doing with FPGAs? Computer Sciences & Information Technologies 8900 –High-Performance Computing –Viz & Scientific computing (8963) FPGA LDRD for accelerating HPC –NM:Computational aspects –CA:Communication aspects FPGA LDRD for network security –NM/CA: Smart routers Catalyst

Outline Reconfigurable Computing –Communication Aspects –Computational Aspects New Platforms: Cray XD1 Summary

In a nutshell.. Reconfigurable Computing –Use FPGAs as a means of accelerating simulations –Implement key computations in hardware: soft-hardware double doX( double *a, int n) { int i; double x; x=0; for(i=0;i<n;i+=3){ x+= a[i] * a[i+1] + a[i+2]; … } … return x; } * + + a[i]a[i+1] Z -1 a[i+2] SoftwareHardwareSynthesized Hardware FPGA CPU RC Platform

Trivial Example: Sorting Sort a matrix of 64b values Hardware implementation –Sorting element –Array of elements –Data transfer engine Exploits parallelism –Do n comparisons each clock –Requires 3n clock cycles Limited to n values –Use as building block Sorting Array Address Wr Req Rd Req DMA Engine Sorting Control Double-Buffered Outgoing Memory Reg > Sorting Element unsortedsorted

Sorting Performance Xilinx Virtex II/Pro50-7 FPGA –128 elements –110 MHz Data transfer by FPGA –Read/write concurrency –Pinned memory 2-4x Rate of Quicksort(128) –Host is 2.2GHz AMD Opteron

RC: Communication Aspects

New FPGAs are “Network Ready” Multi-gigabit transceivers (MGTs) –V2Pro:twenty-four 3Gbps channels –V2ProX:twenty 10Gbps channels –InfiniBand, FC, GigE, 10GigE Multiple embedded processors –PowerPC MHz Direct connection to network –Build NI in hardware –Use PPC core for complex protocols –User-defined computational circuits Xilinx Virtex II/Pro-7 Network Interface PowerPC CPU User- Defined Circuits

TCP/GigE Network Interface for FPGAs Network interface (NI) in hardware –TCP Offload Engine(TOE) –Gigabit Ethernet(GigE) Refined TCP functionality –Slow start, NACK detection, Throttling… –User data rate to host: 600 Mb/s Outgoing TCP Message Control Rocket I/O TxRx Align Decode ARP Cache Framer MAC Header Control ARP Reply Ping ReplyIP Header Incoming Byte Stream Timeout Monitor CRC Outgoing Byte Stream Incoming TCP Message Control TOE GigE UnitSlicesV2P20 TOE2,06822% GigE1,10212%

Transport Framework for Visualization Need a framework for supporting Visualization in FPGAs –Want to perform viz tasks, not rendering –Built Chromium interface to transport OpenGL graphics Results –Functional system where visualization in FPGA, Rendering in Host –Saturate GigE link with OpenGL data Viz App GigE Network GigE TCP Offload Engine Chromium Packetizer FPGA

Network Intrusion Detection System (NIDS) Collaboration with Georgia Tech –GT:Pattern matching cores –SNL: GigE NI cores Network Intrusion Detection System –Examine packets & remove malicious –Based on Snort rule set (1,500 rules) Filter multiple GigE links transparently –Demonstrated for two links NI Malicious Packet Pattern Matching NI OKDrop Scheduler

RC: Computational Aspects

Floating Point Computations Floating point has been weakness for FPGAs –Complex to implement, consume many gates Keith Underwood and K. Scott Hemmert –Developed FP library at SNL/NM FP Library –Highly-optimized and compact –Single/Double Precision –Add, Multiply, Divide, Square Root –V2P50: DP cores, MHz

Example: FP Distance Computation Compute c=sqrt(a 2 + b 2 ) –Simple pipeline (88 stages) Implemented on Xilinx Virtex II/Pro 50 –159 MHz clock rate –39% of V2P50 –75 MiOperations/s Incoming SRAM Outgoing SRAM DMA Engine Incoming SRAM From Host To Host

New Platforms: Cray XD1

Cray XD1 Dense multiprocessor system –12 AMD Opterons on 6 blades –6 Xilinx Virtex-II/Pro FPGAs –InfiniBand-like interconnect –6 SATA hard drives –4 PCI-X slots –3U Rack

Cray XD1: Placing FPGAs Near Memory Blades use HyperTransport –High bandwidth I/O FPGA accelerator –Xilinx Virtex II/Pro 50 (-7) –HT connection: GB/s –10x performance of PCI First of a new breed –Commodity HT accelerators CPU Main Memory Main Memory NI SRAM Expansion Module SRAM FPGA NI 1.6 GB/s HT Processor Blade

Summary FPGAs are getting faster and less expensive –Finally have capacity for interesting problems –Million gate parts for $5 Large amount of reconfigurable computing work at SNL –Experience with Virtex II/Pro Platform FPGAs –Networks & embedded CPUs –Willing to share cores, work with others

Deliverables NetworkingVisualizationComputationsXD1General OpenGigE OpenTOE Raw GigE GigE Bridge NIDS Filter Isosurfacer Chromium Packetizer Sorting Array MD5 2D Distance DMA Engine Computation Framework Asynch. Message FIFOs Clock domain data pass PowerPC instantiation SNL/CA Cores 32b/64b Floating-PointComputations Add Multiply Divide Square root Vector/Matrix Multiply SNL/NM Cores

FPGA Slices Relative V2P Price & Density V2P100 V2P70 V2P7 V2P40 Relative Price & Density