Kyle Spafford Jeremy S. Meredith Jeffrey S. Vetter

Slides:



Advertisements
Similar presentations
Introduction to Grid Application On-Boarding Nick Werstiuk
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Supermicro © 2009Confidential 06/01/2009 Supermicro GPU Server Solutions SYS-7046T-GRF SYS-6016T-GF-TM2 SYS-6016T-GF-TC2 SYS-6016T-GF SYS-6016T-XF.
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Optimization on Kepler Zehuan Wang
©2009 HP Confidential template rev Ed Turkel Manager, WorldWide HPC Marketing 4/7/2011 BUILDING THE GREENEST PRODUCTION SUPERCOMPUTER IN THE.
Multi-GPU System Design with Memory Networks
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.
OpenFOAM on a GPU-based Heterogeneous Cluster
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
1 AppliedMicro X-Gene ® ARM Processors Optimized Scale-Out Solutions for Supercomputing.
1 petaFLOPS+ in 10 racks TB2–TL system announcement Rev 1A.
CPP Staff - 30 CPP Staff - 30 FCIPT Staff - 35 IPR Staff IPR Staff ITER-India Staff ITER-India Staff Research Areas: 1.Studies.
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Supermicro © 2009 GPU Solutions Universal I/O Double-Sided Datacenter Optimized Twin Architecture SuperBlade ® Storage Present by Edward Li Server dept.
Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.
Exploiting Disruptive Technology: GPUs for Physics Chip Watson Scientific Computing Group Jefferson Lab Presented at GlueX Collaboration Meeting, May 11,
Twin + Platform DCO Platform GPU Tower
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Direct Self-Consistent Field Computations on GPU Clusters Guochun.
Reliable Datagram Sockets and InfiniBand Hanan Hit NoCOUG Staff 2010.
Maximizing The Compute Power With Mellanox InfiniBand Connectivity Gilad Shainer Wolfram Technology Conference 2006.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
GPU Programming with CUDA – Optimisation Mike Griffiths
The NE010 iWARP Adapter Gary Montry Senior Scientist
DMA-Assisted, Intranode Communication in GPU-Accelerated Systems Feng Ji*, Ashwin M. Aji†, James Dinan‡, Darius Buntinas‡, Pavan Balaji‡, Rajeev Thakur‡,
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Performance measurement with ZeroMQ and FairMQ
Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Efficient Intranode Communication in GPU-Accelerated Systems
Operating System Issues in Multi-Processor Systems John Sung Hardware Engineer Compaq Computer Corporation.
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.
1Thu D. NguyenCS 545: Distributed Systems CS 545: Distributed Systems Spring 2002 Communication Medium Thu D. Nguyen
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.
Sunpyo Hong, Hyesoon Kim
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
PuReMD Design Initialization – neighbor-list, bond-list, hydrogenbond-list and Coefficients of QEq matrix Bonded interactions – Bond-order, bond-energy,
Background Computer System Architectures Computer System Software.
GPU Solutions Universal I/O Double-Sided Datacenter Optimized Twin Architecture SuperBlade ® Storage SuperBlade ® Configuration Training Francis Lam Blade.
©2010 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Chia-Shen Hsu HP ProLiant.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
HELMHOLTZ INSTITUT MAINZ Dalibor Djukanovic Helmholtz-Institut Mainz PANDA Collaboration Meeting GSI, Darmstadt.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
Photos placed in horizontal position with even amount of white space between photos and header Sandia National Laboratories is a multi-program laboratory.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
SGI Rackable C2108-GP5 “Arcadia” Server
Balazs Voneki CERN/EP/LHCb Online group
SYS-7088B-TR4FT Quick View Front View - DDR4:SYS-7088B-TR4FT (15 PCI-E 3.0 default) Depth:28.87” (733 mm) 8 CPU Modules: P1 to P8 P1 P2 P3 P4 P5 P6 P7.
Productive Performance Tools for Heterogeneous Parallel Computing
PPC-L158T-R90-AXE EOL notice
Slurm User Group Meeting’16 - Athens, Greece – Sept , 2016
Lenovo New Thinksystem and
Intel Desktop Board D945GTP
Lenovo New Thinksystem and
NUMA scaling issues in 10GbE NetConf 2009 PJ Waskiewicz Intel Corp.
Giada P216(vPro) Features Specifications
SPARC Supercluster T4-4 Developers Performance and Applications Engineering, Hardware Systems 1 1.
Super Micro Technology Computing
What’s in the Box?.
CARLA Buenos Aires, Argentina - Sept , 2017
Open Source Activity Showcase Computational Storage SNIA SwordfishTM
Presentation transcript:

Kyle Spafford Jeremy S. Meredith Jeffrey S. Vetter

2Managed by UT-Battelle for the U.S. Department of Energy S3DDCA++ Early Work

3Managed by UT-Battelle for the U.S. Department of Energy “ An experimental high performance computing system of innovative design. ” “ Outside the mainstream of what is routinely available from computer vendors. ” -National Science Foundation, Track2D Call Fall 2008

4Managed by UT-Battelle for the U.S. Department of Energy Keeneland GT\ORNL

5Managed by UT-Battelle for the U.S. Department of Energy Inside a Node 4 Hot plug SFF (2.5”) HDDs 1 GPU module in the rear, lower 1U 2 GPU modules in upper 1U Dual 1GbE Dedicated management iLO3 LAN & 2 USB ports VGA UID LED & Button Health LED Serial (RJ45) Power Button QSFP (QDR IB) 2 Non-hot plug SFF (2.5”) HDD

6Managed by UT-Battelle for the U.S. Department of Energy Node Block Diagram DDR3 PCIe x16 CPU GPU (6GB) RAM QPI Infiniband QPI I/O Hub GPU (6GB) integrated PCIe x16 QPI

7Managed by UT-Battelle for the U.S. Department of Energy Why a dual I/O hub? GPU #0 GPU #1 PCIe Switch Tesla 1U IOH

8Managed by UT-Battelle for the U.S. Department of Energy Why a dual I/O hub? GPU #0 GPU #1 PCIe Switch Tesla 1U IOH Bottleneck!

9Managed by UT-Battelle for the U.S. Department of Energy Why a dual I/O hub? GPU #0 GPU #1 PCIe Switch Tesla 1U IOH 8.0 CPU #0 CPU #1 GPU #1 GPU # IOH GPU # IOH Bottleneck!

10Managed by UT-Battelle for the U.S. Department of Energy Introduction of NUMA 8.0 CPU #0 GPU #1 IOH 12.8 IOH CPU #0 GPU #0 IOH Short Path Long Path

11Managed by UT-Battelle for the U.S. Department of Energy Bandwidth Penalty CPU #0 H->D Copy

12Managed by UT-Battelle for the U.S. Department of Energy Bandwidth Penalty CPU #0 D->H Copy ~2 GB/s

13Managed by UT-Battelle for the U.S. Department of Energy Other Benchmark Results MPI Latency – 26% penalty for large messages, 12% small messages SHOC Benchmarks – Mismap penalty shown below – gives this effect context

14Managed by UT-Battelle for the U.S. Department of Energy Given a Multi-GPU app, how should processes be pinned?

15Managed by UT-Battelle for the U.S. Department of Energy Given a Multi-GPU app, how should processes be pinned?

16Managed by UT-Battelle for the U.S. Department of Energy CPU #1 CPU #0 GPU #1 IOH Infiniband GPU #0 GPU #2 Maximize GPU Bandwidth

17Managed by UT-Battelle for the U.S. Department of Energy CPU #1 CPU #0 GPU #1 IOH Infiniband GPU #0 GPU # Maximize GPU Bandwidth

18Managed by UT-Battelle for the U.S. Department of Energy CPU #1 CPU #0 GPU #1 IOH Infiniband GPU #0 GPU # Maximize MPI Bandwidth

19Managed by UT-Battelle for the U.S. Department of Energy CPU #1 CPU #0 GPU #1 IOH Infiniband GPU #0 GPU # Maximize MPI Bandwidth Pretty easy, right?

20Managed by UT-Battelle for the U.S. Department of Energy Pinning with numactl numactl --cpunodebind=0 --membind=0./program

21Managed by UT-Battelle for the U.S. Department of Energy if [[ $OMPI_COMM_WORLD_LOCAL_RANK == "2" ]] then numactl --cpunodebind=1 --membind=1./prog else if [[ $OMPI_COMM_WORLD_LOCAL_RANK == "1" ]] then numactl --cpunodebind=1 --membind=1./prog else # rank = 0 numactl --cpunodebind=0 --membind=0./prog fi Pinning with numactl Pinning with numactl

22Managed by UT-Battelle for the U.S. Department of Energy HPL Scaling Sustained MPI and GPU ops Uses other CPU cores via Intel MKL

23Managed by UT-Battelle for the U.S. Department of Energy What Happened with 0-1-1? CPU #1 CPU # MPI Tasks

24Managed by UT-Battelle for the U.S. Department of Energy What Happened with 0-1-1? CPU #0 CPU #1 MPI Tasks

25Managed by UT-Battelle for the U.S. Department of Energy What Happened with 0-1-1? CPU #0 CPU #1 MPI Tasks MKL Threads

26Managed by UT-Battelle for the U.S. Department of Energy What Happened with 0-1-1? CPU #0 CPU #1 MPI Tasks MKL Threads Threads inherit pinning!

27Managed by UT-Battelle for the U.S. Department of Energy What Happened with 0-1-1? CPU #0 CPU #1 MKL Threads

28Managed by UT-Battelle for the U.S. Department of Energy What Happened with 0-1-1? CPU #0 CPU #1 MKL Threads Two idle cores, 1 oversubscribed socket!

29Managed by UT-Battelle for the U.S. Department of Energy NUMA Impact on Apps

30Managed by UT-Battelle for the U.S. Department of Energy Well… time

31Managed by UT-Battelle for the U.S. Department of Energy Can we improve utilization by sharing a Fermi among multiple tasks?

32Managed by UT-Battelle for the U.S. Department of Energy Bandwidth of Most Bottlenecked Task

33Managed by UT-Battelle for the U.S. Department of Energy Is the second IO hub worth it?

34Managed by UT-Battelle for the U.S. Department of Energy Is the second IO hub worth it? Aggregate bandwidth to GPUs is 16.9 GB/s What about real app behavior? – Scenario A: “HPL” -- 1 MPI & 1 GPU task per GPU – Scenario B: A + 1 MPI for each other core

35Managed by UT-Battelle for the U.S. Department of Energy Contention Penalty

36Managed by UT-Battelle for the U.S. Department of Energy Puzzler – Pinning Redux Do ranks 1 and 2 always have a long path?

37Managed by UT-Battelle for the U.S. Department of Energy Puzzler – Pinning Redux Do ranks 1 and 2 always have a long path? CPU #0 GPU #1 IOH

38Managed by UT-Battelle for the U.S. Department of Energy Puzzler – Pinning Redux Do ranks 1 and 2 always have a long path? CPU #0 GPU #1 IOH CPU #1 Infiniband IOH

39Managed by UT-Battelle for the U.S. Department of Energy Split MPI and GPU – MPI Latency

40Managed by UT-Battelle for the U.S. Department of Energy Split MPI and GPU – PCIe bandwidth

41Managed by UT-Battelle for the U.S. Department of Energy Takeaways Dual IO hubs deliver – But add complexity Ignoring the complexity will sink some apps – Wrong pinning sunk HPL – Bandwidth bound kernels & “function offload” apps Threads and libnuma can help – but can be tedious to use

42Managed by UT-Battelle for the U.S. Department of Energy