High Performance Computing

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Lecture 6: Multicore Systems
Parallel computer architecture classification
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
CHEP 2012 Computing in High Energy and Nuclear Physics Forrest Norrod Vice President and General Manager, Servers.
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
Today’s topics Single processors and the Memory Hierarchy
Zhao Lixing.  A supercomputer is a computer that is at the frontline of current processing capacity, particularly speed of calculation.  Supercomputers.
BY MANISHA JOSHI.  Extremely fast data processing-oriented computers.  Speed is measured in “FLOPS”.  For highly calculation-intensive tasks.  For.
This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Claude TADONKI Mines ParisTech – LAL / CNRS / INP 2 P 3 University of Oujda (Morocco) – October 7, 2011 High Performance Computing Challenges and Trends.
Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing Dr. Jason D. Bakos.
Some Thoughts on Technology and Strategies for Petaflops.
NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
1 Computer Science, University of Warwick Metrics  FLOPS (FLoating point Operations Per Sec) - a measure of the numerical processing of a CPU which can.
Lecture 1: Introduction to High Performance Computing.
Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.
1 AppliedMicro X-Gene ® ARM Processors Optimized Scale-Out Solutions for Supercomputing.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.
Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.
1 CHAPTER 2 COMPUTER HARDWARE. 2 The Significance of Hardware  Pace of hardware development is extremely fast. Keeping up requires a basic understanding.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Exploiting Disruptive Technology: GPUs for Physics Chip Watson Scientific Computing Group Jefferson Lab Presented at GlueX Collaboration Meeting, May 11,
High Performance Computing G Burton – ICG – Oct12 – v1.1 1.
 Design model for a computer  Named after John von Neuman  Instructions that tell the computer what to do are stored in memory  Stored program Memory.
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Maximizing The Compute Power With Mellanox InfiniBand Connectivity Gilad Shainer Wolfram Technology Conference 2006.
Extreme-scale computing systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward exa-scale computing.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Jaguar Super Computer Topics Covered Introduction Architecture Location & Cost Bench Mark Results Location & Manufacturer Machines in top 500 Operating.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.
A lower bound to energy consumption of an exascale computer Luděk Kučera Charles University Prague, Czech Republic.
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
1 Latest Generations of Multi Core Processors
© 2009 IBM Corporation Motivation for HPC Innovation in the Coming Decade Dave Turek VP Deep Computing, IBM.
Personal Chris Ward CS147 Fall  Recent offerings from NVIDA show that small companies or even individuals can now afford and own Super Computers.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. HP Update IDC HPC Forum.
Innovation for Our Energy Future Opportunities for WRF Model Acceleration John Michalakes Computational Sciences Center NREL Andrew Porter Computational.
Revision - 01 Intel Confidential Page 1 Intel HPC Update Norfolk, VA April 2008.
Gravitational N-body Simulation Major Design Goals -Efficiency -Versatility (ability to use different numerical methods) -Scalability Lesser Design Goals.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
Copyright © Curt Hill SIMD Single Instruction Multiple Data.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Background Computer System Architectures Computer System Software.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Super Computing By RIsaj t r S3 ece, roll 50.
Parallel Computers Today
Memory System Performance Chapter 3
Multicore and GPU Programming
Multicore and GPU Programming
CSE 102 Introduction to Computer Engineering
Presentation transcript:

High Performance Computing Architecture Overview Computational Methods 4/26/2017 Computational Methods

Computational Methods Outline Desktops, servers and computational power Clusters and interconnects Supercomputer architecture and design Future systems 4/26/2017 Computational Methods

What is available on your desktop? Intel Core i7 4790 (Haswell) processor 4 cores at 4.0 GHz 8 double precision floating point operations per second (FLOP/s) – full fused multiply add instructions (FMAs) 2 floating point units (FPUs) per core 16 FLOP per cycle FMA example: $0 = $0 x $2 + $1 64 GFLOP/s per core theoretical peak 256 GFLOP/s full system theoretical peak 4/26/2017 Computational Methods

Computational Methods What can a desktop do? Central difference 1-d corresponds to 1 subtraction (1 FLOP), 1 multiplication (1 FLOP), 1 division (4 FLOPs) Single 1-d grid with 512 zones = 3072 FLOPs 4/26/2017 Computational Methods

Computational Methods What can a desktop do? Consider central difference on a 5123 3-d mesh 5123 x 3 surfaces = 2.4 GFLOPs On a single core of a 4.0 GHz Core i7 0.03 seconds of run time for 1 update with 100% efficiency (assumes 100% fused multiply add instructions) With perfect on-chip parallelization on 4 cores 0.008 seconds per update 1.25 minutes for 10,000 updates Nothing, not even HPL, gets 100% efficiency! A more realistic efficiency is ~10% 4/26/2017 Computational Methods

Efficiency considerations No application ever gets 100% of theoretical peak for the following, not comprehensive, reasons 100% peak assumes 100% FMA instructions running on processors with AVX SIMD instructions. Does the algorithm in question even map well to FMAs? If not, the rest of vector instructions are 50% of FMA peak. Data must be moved from main memory through the CPU memory system. This motion has latency and a fixed bandwidth that may be shared with other cores. The algorithm may not map well into vector instructions at all. The code is then “serial” and runs at 1/8th of peak if optimal. The application may require I/O, which can be very expensive and stall computation. This alone can take efficiency down by an order of magnitude in some cases. 4/26/2017 Computational Methods

More realistic problem Consider numerical hydrodynamics Updating 5 variables Computing wave speeds, solving eigenvalue problem, computing numerical fluxes Roughly 1000 FLOPs per zone per update in 3d 134 GFLOPs per update for 5123 zones Runs take on order of hundreds of 14 hours on 4 cores at 10% efficiency 8 Bytes * 5123 * 5 variables = 5 GB 4/26/2017 Computational Methods

Consider these simulations… Numerical MHD, ~6K FLOPs per zone per update 12K time steps 1088 x 448 x 1088 grid = 530 million zones 3.2 TFLOPs per update = 38 PFLOPs for the full calculation ~17 days on Haswell desktop (wouldn’t even fit in memory anyway) Actual simulation was run on MSI Itasca system using 2,032 Nehalem cores (6X less FLOP/s per core) for 6 hours color.avi 4/26/2017 Computational Methods

Computational Methods HPC Clusters Group of servers connected with a dedicated high-speed network Mesabi at the MSI 741 servers (or nodes) Built by HP with Infiniband network from Mellanox 2.5 GHz, 12 core Haswell server processors 2 sockets per node 960 GFLOP/s per node theoretical peak 711 TFLOP/s total system theoretical peak 4/26/2017 Computational Methods

Computational Methods Intel Haswell 12 core server chip From cyberparse.co.uk 4/26/2017 Computational Methods

Computational Methods HPC Clusters Goal is to have network fast enough to ignore… why? All cores ideally are “close” enough to appear to be on one machine Keep the problem FLOP (or at least node) limited as much as possible Intel offers Bandwidth: ~68 GB/s from memory across 12 cores (depends on memory speed) Latency: ~12 cycles (order of tens of nanoseconds) EDR Infiniband offers Bandwidth: 100 Gb/s = 12.5 GB/s Latency: Varies on topology and location in network (~1-10 microseconds) 4/26/2017 Computational Methods

Computational Methods HPC Clusters Network types Infiniband FDR, EDR, etc Ethernet Up to 10 GB/s bandwidth Cheaper than Infiniband but also slower Custom Cray, IBM, Fujitsu, for example, all have custom networks (not “vanilla” clusters though) 4/26/2017 Computational Methods

Computational Methods Network Topology Layout of connections between servers in a cluster Latency and often bandwidth are best for pairs of servers “closer” in the network Network is tapered providing less bandwidth at Level 2 routers Example: dual fat tree for SGI Altix system from www.csm.ornl.gov 4/26/2017 Computational Methods

Computational Methods Supercomputers What’s the difference from a cluster? Distinction is a bit subjective Presented as a single system (or mainframe) to the user Individual servers are stripped down to absolute basics Typically cannot operate as independent computer from the rest of the system 4/26/2017 Computational Methods

Computational Methods Cray XC40 Node AMD Packages 4/26/2017 Computational Methods

Computational Methods Cray XC40 Topology 4/26/2017 Computational Methods

Computational Methods Cray XC40 System Trinity system at Los Alamos National Laboratory 4/26/2017 Computational Methods

Computational Methods Trinity Provides mission critical computing to the National Nuclear Security Administration (NNSA) Currently #6 on Top500.org November 2015 list Phase 1 is ~9600 XC40 Haswell nodes (dual socket, 16 cores per socket) ~307,000 cores Theoretical peak of ~11 PFLOP/s Phase 2 is ~9600 XC KNL nodes (single socket, > 60 cores per socket) Theoretical peak of ~18 PFLOP/s in addition to 11 PFLOP/s from Haswell nodes (just an conservative estimate, exact number cannot be released yet) 80 Petabytes of near-line storage Cray Sonexion lustre Draws up to 10 MW at peak utilization! 4/26/2017 Computational Methods

Accelerators and Many Integrated Cores Typical CPUs are not very energy efficient Meant to be general purpose, which requires lots of pieces to the chip Accelerators package much more FLOP capabilities with less energy consumption Not exactly general purpose Always requires more work from the programmer Performance improvements and application porting may take significant effort 4/26/2017 Computational Methods

Computational Methods Accelerators GPUs as accelerators NVIDIA GPUs can be used to perform calculations Latest Kepler generation offer ~1.4 TFLOP/s in a single package 4/26/2017 Computational Methods

Computational Methods Cray XC GPU Node NVIDIA K20X 4/26/2017 Computational Methods

Computational Methods Cray XC GPU System Pix Daint at CSCS (#7 on Top500.org November 2015 list) 4/26/2017 Computational Methods

Computational Methods Petascale and Beyond Current Top500 systems (mention usual complaint about Top500 metric here) 4/26/2017 Computational Methods

Computational Methods Petascale and Beyond The next step is “exascale.” In the US, the only near exascale systems are part of the DOE Coral project Cray + Intel are building Aurora KNH + new interconnect from Intel based on Cray Aries interconnect ~ 180 PFLOP/s IBM + NVIDIA are building Summit and Sierra nVidia GPUs + Power CPUs No precise performance estimate available (anywhere from 100- 300 PFLOP/s) 4/26/2017 Computational Methods

Computational Methods Petascale and Beyond These systems will vet MIC and GPUs as possible technologies to Exascale Both have their risks, and neither may end up getting us there Alternative technologies being investigated as part of DOE funded projects ARM (yes, cell phones have ARM, but that version of ARM will NOT be used in HPC probably ever) Both nVidia and Intel have DOE funded projects on GPUs and MIC FPGAs (forget it unless something revolutionary happens with that technology) “Quantum” computers D-Wave systems are quantum-like analog computers able to solve exactly one class of problems that fit into minimization by annealing. These systems will NEVER be useful for forecasting the weather or simulating any physics. True quantum gates are being developed but are not scalable right now. This technology may be decades off yet. 4/26/2017 Computational Methods

Computational Methods Programming Tomorrow will be a very brief overview of how one programs a supercomputer.  4/26/2017 Computational Methods