Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development.
OpenFOAM on a GPU-based Heterogeneous Cluster
GPU PROGRAMMING David Gilbert California State University, Los Angeles.
Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Gregex: GPU based High Speed Regular Expression Matching Engine Date:101/1/11 Publisher:2011 Fifth International Conference on Innovative Mobile and Internet.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Exploiting Disruptive Technology: GPUs for Physics Chip Watson Scientific Computing Group Jefferson Lab Presented at GlueX Collaboration Meeting, May 11,
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Extracted directly from:
Modeling GPU non-Coalesced Memory Access Michael Fruchtman.
Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
CUDA - 2.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
What is QCD? Quantum ChromoDynamics is the theory of the strong force
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
Sunpyo Hong, Hyesoon Kim
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
Intra-Socket and Inter-Socket Communication in Multi-core Systems Roshan N.P S7 CSB Roll no:29.
Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.
GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.
Computer Graphics Graphics Hardware
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Employing compression solutions under openacc
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
Lecture 2: Intro to the simd lifestyle and GPU internals
Presented by: Isaac Martin
Jun Doi Tokyo Research Laboratory IBM Research
Computer Graphics Graphics Hardware
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Memory System Performance Chapter 3
Multicore and GPU Programming
6- General Purpose GPU Programming
Multicore and GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015

POWER Gains New Power of Computation 2  POWER + GPU – OpenPOWER foundation  Open architecture  Collaborative development of system & software – Nvidia’s GPU  Now supports POWER processors  Next generation supercomputers – CORAL collaboration  More than 100 Peta-Flops, SUMMIT and SIERRA – NVLINK  Faster data link between hosts and GPUs

IBM POWER8 and Nvidia Tesla GPU 3  The first IBM POWER product with GPU – IBM Power System S824L  2 sockets of POWER8 processors per node  Supports up to 2 Tesla GPUs per node – Connected via PCI express (Gen3 x16) – Tesla K40 or K80  Linux and little endian support – Ubuntu 14.04

Overview of POWER8 processor 4  12 cores per socket GHz (S824L) – 8 hardware threads (SMT) per core  96 threads per socket (192 threads per node)  SIMD instructions (VMX/VSX) – 2x 2-way double precision FMA instruction per cycle (2x 4-way for single precision) – GFlops (double) / GFlops (single) per socket GHz)

Development Environment of POWER+GPU systems 5  Very similar environment to usual x86+GPU Linux cluster – POWER processor supports little endian – Most of software runs as similar to x86 clusters  Compilers – gcc : can optimize for POWER8 – IBM XL compilers : strong optimizer for C/C++ and FORTRAN  CUDA7.0

Performance Consideration of Wilson-Dirac Operator Per lattice site: 1,488 flops (1,464 flops for even-odd preconditioning) Double: 2,688 byte load, 192 byte store Single: half of above Double: 2.06 byte/flop (2.03 even-odd) Single: 1.03 byte/flop (1.01 even-odd) Byte/flopDoubleSingle POWER Tesla K Memory bandwidth bottleneck Reducing memory access is important for performance (Especially for GPUs)

Data Structure Comparison SoA (Structure of Arrays) Spin1_1[0] Spin1_3[0] Spin1_2[0] Spin1_1[1] Spin1_1[2] Spin1_3[1] Spin1_2[1] Spin1_3[2] Spin1_2[2] Spin2_1[0] Spin2_3[0] Spin2_2[0] Spin1_1[N-1] Spin1_3[N-1] Spin1_2[N-1]... Array of 1 element of 3x4 spinor Elements: complex numbers... Better for coalescing on GPU AoS (Array of Structures) For POWER8 For GPU Spin1_1[0] Spin1_2[0] Spin1_3[0] Spin2_1[0] Spin2_2[0] Spin2_3[0] Spin4_2[0] Spin4_3[0]... 1 structure of 3x4 spinor Array 3x4 spinor elements Spin1_2[1] Spin1_3[1] Spin2_1[1] Spin1_1[1]... Spin4_2[1] Spin4_3[1]... Spin1_2[N-1] Spin1_3[N-1] Spin1_1[N-1]... Better for cache optimization on CPU

Offloading Wilson-Dirac Operator to GPU 8  All the data is ready to use on GPU – Limit the problem size to fit in GPU memory – Gauge matrices are transferred previously on GPU – Spinor fields used in BiCGStab solver are transferred and allocated previously on GPU – No data transfer between host and GPU (except for boundary exchange needed for parallelization)  Multi GPU offloading – There is no support for GPU direct (p2p copy)  GPU to GPU transfer is performed via host memory – Single thread / Multi-GPU execution

Optimization of Wilson-Dirac Operator on GPU 9  Warp size fitting – Setting the number of threads per thread block to multiple of 32 (Warp size) – Defined by size of X : Least common multiple of X and 32  E.g.) For X=48, using 2*48 = 96 threads (3 warps)  Gauge field parameterization – Loading 6 elements of SU(3) gauge matrix – Reconstruct 3 elements by calculation – 42 extra flops, 2/3 memory access to gauge field

GPU0GPU1 Parallelization of Lattice QCD on GPU Cluster X-dimension Y,Z and T dimensions We do not divide lattice in X dimension To avoid non-sequential access on the inner-most boundary GPU0GPU1 POWER8 Node GPU0GPU1GPU0GPU1 We divide lattice in T dimension by 2 GPUs in node Also we divide lattice in T by nodes, and in Z and Y dimension by nodes POWER8 Node

Offloading and Data Transfer of Wilson-Dirac Operator Send buffer Recv. buffer Spinor Gauge field (2) GPU-> Host transfer (3) Node-to-node transfer(MPI) (5) Host->GPU transfer Spinor (output) Send buffer (1) Making half-spinor of boundary (4) Inner calculation (6) Boundary calculation GPU Host

Asynchronous Data Transfer Using CUDA Streams Y+ Y- Inner calculation Y- Y+ Z- Z+ T- T+ Inner Z+ Z- T+ T- Making half-spinors Minus direction Matrix multiply Plus direction GPU-Host transfer MPI transfer (Async) Boundary calculation (Number of parallel dimensions)x2+1 streams to execute asynchronously Plus direction Matrix multiply Minus direction

Testing Environment P8 GPU Tesla K40 GPU Tesla K40 CPUGPU IBM POWER8Nvidia Tesla K40 # of cores122,880 Clock speed3.02 GHz0.745 GHz Peak perf.: Double Single GFlops GFlops 1,430 GFlops 4,290 GFlops Memory512 GB12 GB Memory Bandwidth192 GB/s288 GB/s IBM Power System S824L 4 nodes, connected via Infiniband Ubuntu , CUDA Toolkit version 7.0

POWER8 Performance: Strong Scaling (Double) 14 BiCGStab Even-Odd preconditioned Wilson-Dirac

POWER8 Performance: Weak Scaling (Double) 15 BiCGStab Even-Odd preconditioned Wilson-Dirac

K40 Performance: Strong Scaling (Double) 16 BiCGStab Even-Odd preconditioned Wilson-Dirac

K40 Performance: Weak Scaling (Double) 17 BiCGStab Even-Odd preconditioned Wilson-Dirac

K40 Performance: Strong Scaling (Single) 18 BiCGStab Even-Odd preconditioned Wilson-Dirac

K40 Performance: Weak Scaling (Single) 19 BiCGStab Even-Odd preconditioned Wilson-Dirac

Summary 20  POWER obtained new power of calculation – POWER8: High bandwidth/flop, high efficiency  350- GFlop/s on 4 nodes  Higher efficiency than GPU  Can be improved more by hand coding of SIMD – Tesla GPU: High computational capacity, high performance  GFlop/s with 8 K40 (in double precision)  GFlop/s with 8 K40 (in single precision)  Future work – Optimization with newer GPUs and NVLINK – Workload balancing POWER and GPUs – Other solvers and algorithms for more efficiency