Jun Doi Tokyo Research Laboratory IBM Research

Slides:

Advertisements

Similar presentations

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Advertisements

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Vector Processors Part 2 Performance. Vector Execution Time Enhancing Performance Compiler Vectorization Performance of Vector Processors Fallacies and.

Computer Architecture and Design Fall 2009 Indraneil Gokhale.

GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS University of Illinois at Urbana-Champaign Abhinav Bhatele, Eric Bohm, Laxmikant V.

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

Next KEK machine Shoji Hashimoto 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

P. Vranas, IBM Watson Research Lab 1 BG/L architecture and high performance QCD P. Vranas IBM Watson Research Lab.

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Slide 1 MIT Lincoln Laboratory Toward Mega-Scale Computing with pMatlab Chansup Byun and Jeremy Kepner MIT Lincoln Laboratory Vipin Sachdeva and Kirk E.

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

Performance Optimization Getting your programs to run faster.

NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services

Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation Ning Liu, Christopher Carothers 1.

Status and plans at KEK Shoji Hashimoto Workshop on LQCD Software for Blue Gene/L, Boston University, Jan. 27, 2006.

Processor Memory Processor-memory bus I/O Device Bus Adapter I/O Device I/O Device Bus Adapter I/O Device I/O Device Expansion bus I/O Bus.

Single Node Optimization Computational Astrophysics.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

Aurora/PetaQCD/QPACE Metting Regensburg University, April 14-15, 2010.

Performance Analysis on Blue Gene/P Tulin Kaman Department of Applied Mathematics and Statistics Stony Brook University.

ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010.

CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.

From Clustered SMPs to Clustered NUMA John M. Levesque The Advanced Computing Technology Center.

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

HPEC 2003 Linear Algebra Processor using FPGA Jeremy Johnson, Prawat Nagvajara, Chika Nwankpa Drexel University.

Dynamical Lattice QCD simulation Hideo Matsufuru for the JLQCD Collaboration High Energy Accerelator Research Organization (KEK) with.

Module 3: Operating-System Structures

Computational Requirements

ChaNGa: Design Issues in High Performance Cosmology

Instruction Level Parallelism

Chapter 1: Introduction

CSC 4250 Computer Architectures

Parallel Density-based Hybrid Clustering

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

5.2 Eleven Advanced Optimizations of Cache Performance

EE 107 Fall 2017 Lecture 7 Serial Buses – I2C Direct Memory Access

Architecture Background

Morgan Kaufmann Publishers

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

CENTRAL PROCESSING UNIT CPU (microprocessor)

Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.

NT1110 Computer Structure and Logic

COMP4211 : Advance Computer Architecture

Drinking from the Firehose Decode in the Mill™ CPU Architecture

Software and Hardware Circular Buffer Operations

BlueGene/L Supercomputer

Page Replacement.

Performance Optimization for Embedded Software

Trying to avoid pipeline delays

Hybrid Parallel Programming

Dynamic Routing and OSPF

Chapter 2: Operating-System Structures

Explaining issues with DCremoval( )

CSE 451: Operating Systems Autumn 2003 Lecture 2 Architectural Support for Operating Systems Hank Levy 596 Allen Center 1.

CSE 451: Operating Systems Autumn 2001 Lecture 2 Architectural Support for Operating Systems Brian Bershad 310 Sieg Hall 1.

CSE 451: Operating Systems Winter 2003 Lecture 2 Architectural Support for Operating Systems Hank Levy 412 Sieg Hall 1.

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Cache Performance Improvements

Chapter 2: Operating-System Structures

6- General Purpose GPU Programming

Presentation transcript:

Jun Doi (doichan@jp.ibm.com) Tokyo Research Laboratory IBM Research Performance evaluation and tuning of lattice QCD on the next generation Blue Gene Jun Doi (doichan@jp.ibm.com) Tokyo Research Laboratory IBM Research

Background We have tuned lattice QCD on Blue Gene/L We developed Wilson kernel installed in KEK’s Blue Gene Our kernel supports Wilson and even-odd preconditioned Wilson Sustained 35% vs peak performance on Blue Gene/L IBM has announced the next generation Blue Gene, Blue Gene/P Lattice QCD is one of the most important application for Blue Gene We are porting Wilson kernel on Blue Gene/P We are tuning and evaluating the performance of Wilson Dirac operator using new features added to Blue Gene/P

Major changes from Blue Gene/L to Blue Gene/P PowerPC core changes, clock speed increased BG/L: PowerPC440 700MHz BG/P: PowerPC450 850MHz Processor core per compute node increases BG/L: Dual core BG/P: Quad core SMP supports BG/L: No SMP support BG/P: 4-way SMP support, hybrid parallelization using OpenMP DMA for 3D torus network BG/L: No DMA BG/P: Direct remote put and get by DMA

Comparison of Lattice QCD tuning on Blue Gene/L and /P Optimization of complex number calculation BG/L: Applying double FPU instructions using inline assembly BG/P: We can do the same optimization Parallelization of lattice QCD BG/L: Mapping the lattice onto 3D torus network BG/P: We can do the same Usage of processor core in compute node BG/L: Using virtual node mode to use all cores for computation BG/P: We can select virtual node (VN) mode or SMP Optimization of boundary data exchange BG/L: Direct access to torus network, each core should handle by itself BG/P: We modify to use DMA, each core does not care of communication

Optimizing complex number calculation by double FPU (Same as Blue Gene/L)

Calculation of Wilson Dirac operator on Blue Gene Gamma matrix Calculation of multiplication of gauge matrix for each direction in 3 steps Making half spinor (Multiplying ) Multiplying -Kappa and adding to spinor (1) (2) Multiplying gauge matrix (3) spin 1 + + spin 1 h-spin 1 matrix U x h-spin 1 spin 1 -Kappa x h-spin 1 spin 4 spin 2 spin 2 spin 3 spin 3 spin 2 spin 4 spin 4 -Kappa x h-spin 2 + h-spin 2 matrix U x h-spin 2 spin 3

Step 1 : Making half spinor Merging 2 spins into new spin by multiplying i or 1 to a spin according to the gamma matrix and add to another spin Multiplying i or 1 and add to complex can be done by one double FPU instruction using constant value 1.0 spin 1 spin 2 spin 3 spin 4 + h-spin 1 h-spin 2 double FPU instruction v = FXCXNPMA(s,t,1.0) or v = FXCXNSMA(s,t,1.0) Re(v) = Re(s) – Im(t) Im(v) = Im(s) + Re(t) v = s + i * t v = FXCPMADD(s,t,1.0) or v = FXCPMNSUB(s,t,1.0) Re(v) = Re(s) + Re(t) Im(v) = Im(s) + Im(t) v = s + 1 * t

Step 2: Multiplying gauge matrix Multiplying matrix for forward Complex number multiplication can be calculated by 2 double FPU instructions v[0] = u[0][0] * w[0] + u[0][1] * w[1] + u[0][2] * w[2]; v[1] = u[1][0] * w[0] + u[1][1] * w[1] + u[1][2] * w[2]; v[2] = u[2][0] * w[0] + u[2][1] * w[1] + u[2][2] * w[2]; re(v[0]) = re(w[0])*re(u[0][0]) im(v[0]) = re(w[0])*im(u[0][0]) u[0][0] * w[0] :　Multiplying 2 complex numbers v[0] = FXPMUL (u[0][0],w[0]) v[0] = FXCXNPMA (v[0],u[0][0],w[0]) re(v[0]) += -im(w[0])*im(u[0][0]) im(v[0]) += im(w[0])*re(u[0][0]) + u[0][1] * w[1] + u[0][2] * w[2]; Using FMA instructions v[0] = FXCPMADD (v[0],u[0][1],w[1]) v[0] = FXCXNPMA (v[0],u[0][1],w[1]) v[0] = FXCPMADD (v[0],u[0][2],w[2]) v[0] = FXCXNPMA (v[0],u[0][2],w[2]) Conjugate complex and complex multiplication also can be calculated by 2 double FPU instructions re(v[0]) = re(w[0])*re(u[0][0]) im(v[0]) = re(w[0])*im(u[0][0]) Multiplying Hermitian matrix for backward v[0] = ~u[0][0] * w[0] + ~u[1][0] * w[1] + ~u[2][0] * w[2]; re(v[0]) += im(w[0])*im(u[0][0]) im(v[0]) += -im(w[0])*re(u[0][0]) v[0] = FXPMUL (u[0][0],w[0]) v[0] = FXCXNSMA (v[0],u[0][0],w[0])

Step 3: Multiplying minus Kappa and adding to spinor The same instruction can be used as Step 1 with constant value -Kappa instead of 1.0 For spin 1 and 2, multiplying -Kappa and add For spin 3 and 4, multiplying -Kappa and i or 1 according to the gamma matrix + spin 1 -Kappa x h-spin 1 spin 2 spin 3 spin 4 -Kappa x h-spin 2 double FPU instruction for spin 3 and 4 if v = FXCXNPMA(v,w,-Kappa) or v = FXCXNSMA(v,w,-Kappa) v += - Kappa * i * w for spin 3 and 4 if and for every spin 1 and 2 v = FXCPMADD(v,w,-Kappa) or v = FXCPNMSUB(v,w,-Kappa) v += -Kappa * 1 * w

Parallelization and optimization of communication

Mapping Lattice onto 3D torus network (same as BG/L) Parallelizing Wilson operator by dividing global lattice into small lattice Boundary data exchange needed for neighboring small lattice Mapping the lattice onto the topology of 3D torus network Dividing the lattice by torus size Can limit the communication between neighboring compute node We use core to core communication as 4th dimension of torus We map the lattice’s X to core to core communication and lattice XYZ to 3D torus network Z Y Torus network of Blue Gene X Y X core0 core1 core3 core2 4th dim

Boundary data exchange by DMA direct remote put We can put data directly from local memory to destination node’s local memory by DMA direct put operation We prepare the descriptor and pass to DMA then we can overlap computation or putting to other destination We can know when necessary data is received and stored by checking DMA counter at destination node We put all the boundary data at once to destination We make half spinors of boundary sites and store into send buffer for each direction then DMA puts data to destination Injection descriptor destination node data size address to store at destination address of source data direct put request torus network store data into local memory DMA Torus FIFO Torus FIFO DMA Destination node’s local memory buffer’s address Data array in local memory decreasing counter value DMA counter load data and write into torus FIFO read counter value to poll for data

Overlapping communication and computation Setting DMA counter values Making half spinor array for forward Global barrier Send buffer: half spinor array Making half spinor array for Y+ and Y- spinor 1 matrix U x h-spinor 1 Direct put to Y+ and Y- spinor 2 h-spinor 1 spinor 3 h-spinor 2 Making half spinor array for Z+ and Z- matrix U x h-spinor 2 spinor 4 h-spinor 1 h-spinor 2 Direct put to Z+ and Z- h-spinor 1 h-spinor 2 Making half spinor array for T+ and T- ... Direct put to T+ and T- Making half spinor array for backward Making half spinor array for X+ and X- Send buffer: half spinor array overlapping communication Exchange using shared memory Computation for X+ and X- spinor 1 h-spinor 1 h-spinor 1 spinor 2 h-spinor 2 Computation for Y+ and Y- spinor 3 h-spinor 1 h-spinor 2 spinor 4 h-spinor 2 Computation for Z+ and Z- h-spinor 1 h-spinor 2 ... Computation for T+ and T-

Performance measurement in virtual node mode

The performance of Wilson Dirac on Blue Gene/P 512 nodes virtual node mode Node mapping: 4x8x8x8 torus XYZ Wilson Dirac 4 cores in node Global lattice size 16x16x16x16 16x16x16x32 24x24x24x24 24x24x24x48 32x32x32x32 32x32x32x64 Lattice size / core 4x2x2x2 4x2x2x4 6x3x3x3 6x3x3x6 8x4x4x4 8x4x4x8 MFLOPS / core (vs peak) 596.3 (17.54 %) 709.9 (20.79 %) 928.7 (27.31 %) 1037.4 (30.51 %) 1132.3 (33.30 %) 1192.4 (35.07 %) w/ CG iteration MFLOPS /core 463.5 (13.63 %) 575.7 (16.93 % ) 744.9 (21.91 %) 827.32 (24.33 %) 889.6 (26.16 %) 918.4 (27.01 %) Even-odd preconditioned Wilson Dirac MFLOPS / core (vs peak) 397.8 (11.70 %) 524.4 (15.42 %) 735.4 (21.63 %) 834.0 (24.53 %) 910.5 (26.78 %) 974.4 (28.66 %) w/ CG iteration MFLOPS /core 313.3 (9.21 %) 431.5 (12.69 % ) 597.5 (17.57 %) 670.9 (19.73 %) 730.8 (21.49 %) 777.7 (22.87 %)

Weak scaling on Blue Gene/P VN mode Wilson Dirac Even-odd preconditioned Wilson Dirac Lattice size (X*Y*Z*T) / core Lattice size (X*Y*Z*T) / core

Strong scaling on Blue Gene/P VN mode Wilson Dirac Even-odd preconditioned Wilson Dirac Global lattice size Global lattice size

Comparing Blue Gene/P vs Blue Gene/L 512 nodes virtual node mode Ideal speed up is x2.43 Wilson Dirac Global lattice size 16x16x16x16 16x16x16x32 24x24x24x24 24x24x24x48 32x32x32x32 32x32x32x64 Blue Gene/L MFLOPS/core 776.7 (27.74 %) 856.6 (30.59 %) 933.4 (33.34 %) 940.1 (33.58 %) 1015.9 (36.28 %) 1019.1 (36.40 %) Blue Gene/P 596.3 (17.54 %) 709.9 (20.79 %) 928.7 (27.31 %) 1037.4 (30.51 %) 1132.3 (33.30 %) 1192.4 (35.07 %) Speed up / node x 1.54 x 1.66 x 1.99 x 2.21 x 2.23 x 2.34 Even-odd preconditioned Wilson Dirac Blue Gene/L MFLOPS/core 691.6 (24.70 %) 749.7 (26.77 %) 873.4 (31.19 %) 872.6 (31.17 %) 961.0 (34.32 %) 942.3 (34.58 %) Blue Gene/P 397.8 (11.70 %) 524.4 (15.42 %) 735.4 (21.63 %) 834.0 (24.53 %) 910.5 (26.78 %) 974.4 (28.66 %) Speed up / node x 1.15 x 1.40 x 1.68 x 1.91 x 1.89 x 2.07 Blue Gene/L : L1 cache write back mode Blue Gene/P : L1 cache write through mode

Performance measurement in SMP mode

2 approaches of parallelization using OpenMP Outer most loop parallelization Inner most loop parallelization Same code as VN mode with directives Same data access as VN mode for each core #pragma omp parallel { np = omp_get_num_threads(); pid = omp_get_thread_num(); nx = Nx / np; sx = Nx * pid / np; for(i=0;i<Nt*Nz*Ny;i++){ for(x=sx;x<nx;x++){ // computation for X } for(i=0;i<Nt*Nz;i++){ // computation for Y for(i=0;i<Nt*Ny;i++){ // computation for Z for(i=0;i<Nz*Ny;i++){ // computation for T #pragma omp parallel for private(x) for(i=0;i<Nt*Nz*Ny;i++){ for(x=0;x<Nx;x++){ // computation for X } for(i=0;i<Nt*Nz;i++){ // computation for Y for(i=0;i<Nt*Ny;i++){ // computation for Z for(i=0;i<Nz*Ny;i++){ // computation for T

The performance of Wilson Dirac on Blue Gene/P 512 nodes SMP mode Wilson Dirac : Outer most loop parallelization Global/local lattice size 16x16x16x16 / 16x2x2x2 16x16x16x32 / 16x2x2x4 24x24x24x24 / 24x3x3x3 24x24x24x48 / 24x3x3x6 32x32x32x32 / 32x4x4x4 32x32x32x64 / 32x4x4x8 MFLOPS/node (vs peak) 1620.3 (11.91 %) 2333.8 (17.16 %) 2613.9 (19.22 %) 3275.0 (24.08 %) 3715.48 (27.32 %) 3925.5 (28.86 %) w/ CG iteration MFLOPS/node 1048.1 (7.71 %) 1625.9 (11.96 % ) 2079.6 (15.29 %) 2619.6 (19.26 %) 2872.1 (21.12 %) 2985.7 (21.95 %) Wilson Dirac : Inner most loop parallelization MFLOPS/node (vs peak) 2426.0 (17.84 %) 2869.6 (21.10 %) 3622.2 (26.63 %) 4037.2 (29.69 %) 4268.5 (31.39 %) 4631.4 (34.05 %) w/ CG iteration MFLOPS/node 1434.9 (10.55 %) 2062.2 (15.16 % ) 2595.3 (19.08 %) 3065.1 (22.54 %) 3339.3 (24.55 %) 3515.0 (25.85 %) Even-odd preconditioned Wilson Dirac : Inner most loop parallelization MFLOPS/node (vs peak) 1614.8 (11.87 %) 2062.2 (15.16 %) 2745.9 (20.19 %) 3037.1 (22.33 %) 3349.2 (24.63 %) 3608.0 (26.53 %) w/ CG iteration MFLOPS/node 899.9 (6.62 %) 1322.1 (9.72 % ) 1940.5 (14.27 %) 2294.5 (16.87 %) 2556.5 (18.79 %) 2796.3 (20.56 %)

Weak scaling on Blue Gene/P SMP mode Wilson Dirac Even-odd preconditioned Wilson Dirac Lattice size (X*Y*Z*T) / node Lattice size (X*Y*Z*T) / node Inner most loop parallelization

Strong scaling on Blue Gene/P SMP mode Wilson Dirac Even-odd preconditioned Wilson Dirac Global lattice size Global lattice size Inner most loop parallelization

Comparison of VN mode and SMP mode Wilson Dirac Even-odd preconditioned Wilson Dirac

Summary Blue Gene/P shows good performance for lattice QCD x2 performance of Blue Gene/L in L1 write back mode even though Blue Gene/P is write through Very good weak scaling and good strong scaling for large lattice More flexible tuning opportunity than Blue Gene/L DMA makes easier to optimize communication and computation SMP mode has more potential to optimize Future works Increase performance of even-odd preconditioned Wilson and SMP mode Test in the actual application

Acknowledgement IBM Tokyo Research Laboratory We have tuned and run our codes on Blue Gene/P at IBM Watson Research Center Thanks to : IBM Tokyo Research Laboratory Kei Kawase IBM Watson Research Center James Sexton John Gunnels Philip Heidelberger IBM Rochester Jeff Parker LLNL Pavlos Vranas KEK Hideo Matsufuru Shoji Hashimoto

Backup

Comparing gauge matrix array layout Our gauge array layout (JLQCD collaboration) double _Complex U[3][3][X][Y][Z][T][Mu]; Mu is outer most Good for hardware prefetching (vector access) Good for reusing data in cache for large lattice Gauge array layout of CPS double _Complex U[3][3][Mu][X][Y][Z][T]; Mu is inner most Strided memory access is not good for prefetching because 3x3 matrix is not aligned to L1 cache line Data in cache can not reused for large lattice We recommend our gauge layout for Blue Gene

Comparison of the performance of gauge array layout 512 nodes virtual node mode Our gauge array layout Global lattice size 16x16x16x16 16x16x16x32 24x24x24x24 24x24x24x48 32x32x32x32 32x32x32x64 Lattice size / core 4x2x2x2 4x2x2x4 6x3x3x3 6x3x3x6 8x4x4x4 8x4x4x8 MFLOPS / core (vs peak) 596.3 (17.54 %) 709.9 (20.79 %) 928.7 (27.31 %) 1037.4 (30.51 %) 1132.3 (33.30 %) 1192.4 (35.07 %) w/ CG iteration MFLOPS /core 463.5 (13.63 %) 575.7 (16.93 % ) 744.9 (21.91 %) 827.32 (24.33 %) 889.6 (26.16 %) 918.4 (27.01 %) CPS’s gauge array layout MFLOPS / core (vs peak) 563.1 (16.56 %) 647.8 (19.05 %) 815.7 (23.99 %) 896.1 (26.36 %) 959.7 (28.23 %) 1013.7 (29.82 %) w/ CG iteration MFLOPS /core 440.4 (12.95 %) 533.7 (15.70 % ) 669.7 (19.70 %) 745.6 (21.93 %) 793.4 (23.33 %) 798.5 (23.48 %)