Download presentation
Presentation is loading. Please wait.
Published byΤιτάνια Δημαράς Modified over 6 years ago
1
Jun Doi (doichan@jp.ibm.com) Tokyo Research Laboratory IBM Research
Performance evaluation and tuning of lattice QCD on the next generation Blue Gene Jun Doi Tokyo Research Laboratory IBM Research
2
Background We have tuned lattice QCD on Blue Gene/L
We developed Wilson kernel installed in KEK’s Blue Gene Our kernel supports Wilson and even-odd preconditioned Wilson Sustained 35% vs peak performance on Blue Gene/L IBM has announced the next generation Blue Gene, Blue Gene/P Lattice QCD is one of the most important application for Blue Gene We are porting Wilson kernel on Blue Gene/P We are tuning and evaluating the performance of Wilson Dirac operator using new features added to Blue Gene/P
3
Major changes from Blue Gene/L to Blue Gene/P
PowerPC core changes, clock speed increased BG/L: PowerPC MHz BG/P: PowerPC MHz Processor core per compute node increases BG/L: Dual core BG/P: Quad core SMP supports BG/L: No SMP support BG/P: 4-way SMP support, hybrid parallelization using OpenMP DMA for 3D torus network BG/L: No DMA BG/P: Direct remote put and get by DMA
4
Comparison of Lattice QCD tuning on Blue Gene/L and /P
Optimization of complex number calculation BG/L: Applying double FPU instructions using inline assembly BG/P: We can do the same optimization Parallelization of lattice QCD BG/L: Mapping the lattice onto 3D torus network BG/P: We can do the same Usage of processor core in compute node BG/L: Using virtual node mode to use all cores for computation BG/P: We can select virtual node (VN) mode or SMP Optimization of boundary data exchange BG/L: Direct access to torus network, each core should handle by itself BG/P: We modify to use DMA, each core does not care of communication
5
Optimizing complex number calculation by double FPU
(Same as Blue Gene/L)
6
Calculation of Wilson Dirac operator on Blue Gene
Gamma matrix Calculation of multiplication of gauge matrix for each direction in 3 steps Making half spinor (Multiplying ) Multiplying -Kappa and adding to spinor (1) (2) Multiplying gauge matrix (3) spin 1 + + spin 1 h-spin 1 matrix U x h-spin 1 spin 1 -Kappa x h-spin 1 spin 4 spin 2 spin 2 spin 3 spin 3 spin 2 spin 4 spin 4 -Kappa x h-spin 2 + h-spin 2 matrix U x h-spin 2 spin 3
7
Step 1 : Making half spinor
Merging 2 spins into new spin by multiplying i or 1 to a spin according to the gamma matrix and add to another spin Multiplying i or 1 and add to complex can be done by one double FPU instruction using constant value 1.0 spin 1 spin 2 spin 3 spin 4 + h-spin 1 h-spin 2 double FPU instruction v = FXCXNPMA(s,t,1.0) or v = FXCXNSMA(s,t,1.0) Re(v) = Re(s) – Im(t) Im(v) = Im(s) + Re(t) v = s + i * t v = FXCPMADD(s,t,1.0) or v = FXCPMNSUB(s,t,1.0) Re(v) = Re(s) + Re(t) Im(v) = Im(s) + Im(t) v = s + 1 * t
8
Step 2: Multiplying gauge matrix
Multiplying matrix for forward Complex number multiplication can be calculated by 2 double FPU instructions v[0] = u[0][0] * w[0] + u[0][1] * w[1] + u[0][2] * w[2]; v[1] = u[1][0] * w[0] + u[1][1] * w[1] + u[1][2] * w[2]; v[2] = u[2][0] * w[0] + u[2][1] * w[1] + u[2][2] * w[2]; re(v[0]) = re(w[0])*re(u[0][0]) im(v[0]) = re(w[0])*im(u[0][0]) u[0][0] * w[0] : Multiplying 2 complex numbers v[0] = FXPMUL (u[0][0],w[0]) v[0] = FXCXNPMA (v[0],u[0][0],w[0]) re(v[0]) += -im(w[0])*im(u[0][0]) im(v[0]) += im(w[0])*re(u[0][0]) + u[0][1] * w[1] + u[0][2] * w[2]; Using FMA instructions v[0] = FXCPMADD (v[0],u[0][1],w[1]) v[0] = FXCXNPMA (v[0],u[0][1],w[1]) v[0] = FXCPMADD (v[0],u[0][2],w[2]) v[0] = FXCXNPMA (v[0],u[0][2],w[2]) Conjugate complex and complex multiplication also can be calculated by 2 double FPU instructions re(v[0]) = re(w[0])*re(u[0][0]) im(v[0]) = re(w[0])*im(u[0][0]) Multiplying Hermitian matrix for backward v[0] = ~u[0][0] * w[0] + ~u[1][0] * w[1] + ~u[2][0] * w[2]; re(v[0]) += im(w[0])*im(u[0][0]) im(v[0]) += -im(w[0])*re(u[0][0]) v[0] = FXPMUL (u[0][0],w[0]) v[0] = FXCXNSMA (v[0],u[0][0],w[0])
9
Step 3: Multiplying minus Kappa and adding to spinor
The same instruction can be used as Step 1 with constant value -Kappa instead of 1.0 For spin 1 and 2, multiplying -Kappa and add For spin 3 and 4, multiplying -Kappa and i or 1 according to the gamma matrix + spin 1 -Kappa x h-spin 1 spin 2 spin 3 spin 4 -Kappa x h-spin 2 double FPU instruction for spin 3 and 4 if v = FXCXNPMA(v,w,-Kappa) or v = FXCXNSMA(v,w,-Kappa) v += - Kappa * i * w for spin 3 and 4 if and for every spin 1 and 2 v = FXCPMADD(v,w,-Kappa) or v = FXCPNMSUB(v,w,-Kappa) v += -Kappa * 1 * w
10
Parallelization and optimization of communication
11
Mapping Lattice onto 3D torus network (same as BG/L)
Parallelizing Wilson operator by dividing global lattice into small lattice Boundary data exchange needed for neighboring small lattice Mapping the lattice onto the topology of 3D torus network Dividing the lattice by torus size Can limit the communication between neighboring compute node We use core to core communication as 4th dimension of torus We map the lattice’s X to core to core communication and lattice XYZ to 3D torus network Z Y Torus network of Blue Gene X Y X core0 core1 core3 core2 4th dim
12
Boundary data exchange by DMA direct remote put
We can put data directly from local memory to destination node’s local memory by DMA direct put operation We prepare the descriptor and pass to DMA then we can overlap computation or putting to other destination We can know when necessary data is received and stored by checking DMA counter at destination node We put all the boundary data at once to destination We make half spinors of boundary sites and store into send buffer for each direction then DMA puts data to destination Injection descriptor destination node data size address to store at destination address of source data direct put request torus network store data into local memory DMA Torus FIFO Torus FIFO DMA Destination node’s local memory buffer’s address Data array in local memory decreasing counter value DMA counter load data and write into torus FIFO read counter value to poll for data
13
Overlapping communication and computation
Setting DMA counter values Making half spinor array for forward Global barrier Send buffer: half spinor array Making half spinor array for Y+ and Y- spinor 1 matrix U x h-spinor 1 Direct put to Y+ and Y- spinor 2 h-spinor 1 spinor 3 h-spinor 2 Making half spinor array for Z+ and Z- matrix U x h-spinor 2 spinor 4 h-spinor 1 h-spinor 2 Direct put to Z+ and Z- h-spinor 1 h-spinor 2 Making half spinor array for T+ and T- ... Direct put to T+ and T- Making half spinor array for backward Making half spinor array for X+ and X- Send buffer: half spinor array overlapping communication Exchange using shared memory Computation for X+ and X- spinor 1 h-spinor 1 h-spinor 1 spinor 2 h-spinor 2 Computation for Y+ and Y- spinor 3 h-spinor 1 h-spinor 2 spinor 4 h-spinor 2 Computation for Z+ and Z- h-spinor 1 h-spinor 2 ... Computation for T+ and T-
14
Performance measurement in virtual node mode
15
The performance of Wilson Dirac on Blue Gene/P
512 nodes virtual node mode Node mapping: 4x8x8x8 torus XYZ Wilson Dirac 4 cores in node Global lattice size 16x16x16x16 16x16x16x32 24x24x24x24 24x24x24x48 32x32x32x32 32x32x32x64 Lattice size / core 4x2x2x2 4x2x2x4 6x3x3x3 6x3x3x6 8x4x4x4 8x4x4x8 MFLOPS / core (vs peak) 596.3 (17.54 %) 709.9 (20.79 %) 928.7 (27.31 %) (30.51 %) (33.30 %) (35.07 %) w/ CG iteration MFLOPS /core 463.5 (13.63 %) 575.7 (16.93 % ) 744.9 (21.91 %) (24.33 %) 889.6 (26.16 %) 918.4 (27.01 %) Even-odd preconditioned Wilson Dirac MFLOPS / core (vs peak) 397.8 (11.70 %) 524.4 (15.42 %) 735.4 (21.63 %) 834.0 (24.53 %) 910.5 (26.78 %) 974.4 (28.66 %) w/ CG iteration MFLOPS /core 313.3 (9.21 %) 431.5 (12.69 % ) 597.5 (17.57 %) 670.9 (19.73 %) 730.8 (21.49 %) 777.7 (22.87 %)
16
Weak scaling on Blue Gene/P VN mode
Wilson Dirac Even-odd preconditioned Wilson Dirac Lattice size (X*Y*Z*T) / core Lattice size (X*Y*Z*T) / core
17
Strong scaling on Blue Gene/P VN mode
Wilson Dirac Even-odd preconditioned Wilson Dirac Global lattice size Global lattice size
18
Comparing Blue Gene/P vs Blue Gene/L
512 nodes virtual node mode Ideal speed up is x2.43 Wilson Dirac Global lattice size 16x16x16x16 16x16x16x32 24x24x24x24 24x24x24x48 32x32x32x32 32x32x32x64 Blue Gene/L MFLOPS/core 776.7 (27.74 %) 856.6 (30.59 %) 933.4 (33.34 %) 940.1 (33.58 %) (36.28 %) (36.40 %) Blue Gene/P 596.3 (17.54 %) 709.9 (20.79 %) 928.7 (27.31 %) (30.51 %) (33.30 %) (35.07 %) Speed up / node x 1.54 x 1.66 x 1.99 x 2.21 x 2.23 x 2.34 Even-odd preconditioned Wilson Dirac Blue Gene/L MFLOPS/core 691.6 (24.70 %) 749.7 (26.77 %) 873.4 (31.19 %) 872.6 (31.17 %) 961.0 (34.32 %) 942.3 (34.58 %) Blue Gene/P 397.8 (11.70 %) 524.4 (15.42 %) 735.4 (21.63 %) 834.0 (24.53 %) 910.5 (26.78 %) 974.4 (28.66 %) Speed up / node x 1.15 x 1.40 x 1.68 x 1.91 x 1.89 x 2.07 Blue Gene/L : L1 cache write back mode Blue Gene/P : L1 cache write through mode
19
Performance measurement in SMP mode
20
2 approaches of parallelization using OpenMP
Outer most loop parallelization Inner most loop parallelization Same code as VN mode with directives Same data access as VN mode for each core #pragma omp parallel { np = omp_get_num_threads(); pid = omp_get_thread_num(); nx = Nx / np; sx = Nx * pid / np; for(i=0;i<Nt*Nz*Ny;i++){ for(x=sx;x<nx;x++){ // computation for X } for(i=0;i<Nt*Nz;i++){ // computation for Y for(i=0;i<Nt*Ny;i++){ // computation for Z for(i=0;i<Nz*Ny;i++){ // computation for T #pragma omp parallel for private(x) for(i=0;i<Nt*Nz*Ny;i++){ for(x=0;x<Nx;x++){ // computation for X } for(i=0;i<Nt*Nz;i++){ // computation for Y for(i=0;i<Nt*Ny;i++){ // computation for Z for(i=0;i<Nz*Ny;i++){ // computation for T
21
The performance of Wilson Dirac on Blue Gene/P
512 nodes SMP mode Wilson Dirac : Outer most loop parallelization Global/local lattice size 16x16x16x16 / 16x2x2x2 16x16x16x32 / 16x2x2x4 24x24x24x24 / 24x3x3x3 24x24x24x48 / 24x3x3x6 32x32x32x32 / 32x4x4x4 32x32x32x64 / 32x4x4x8 MFLOPS/node (vs peak) (11.91 %) (17.16 %) (19.22 %) (24.08 %) (27.32 %) (28.86 %) w/ CG iteration MFLOPS/node (7.71 %) (11.96 % ) (15.29 %) (19.26 %) (21.12 %) (21.95 %) Wilson Dirac : Inner most loop parallelization MFLOPS/node (vs peak) (17.84 %) (21.10 %) (26.63 %) (29.69 %) (31.39 %) (34.05 %) w/ CG iteration MFLOPS/node (10.55 %) (15.16 % ) (19.08 %) (22.54 %) (24.55 %) (25.85 %) Even-odd preconditioned Wilson Dirac : Inner most loop parallelization MFLOPS/node (vs peak) (11.87 %) (15.16 %) (20.19 %) (22.33 %) (24.63 %) (26.53 %) w/ CG iteration MFLOPS/node 899.9 (6.62 %) (9.72 % ) (14.27 %) (16.87 %) (18.79 %) (20.56 %)
22
Weak scaling on Blue Gene/P SMP mode
Wilson Dirac Even-odd preconditioned Wilson Dirac Lattice size (X*Y*Z*T) / node Lattice size (X*Y*Z*T) / node Inner most loop parallelization
23
Strong scaling on Blue Gene/P SMP mode
Wilson Dirac Even-odd preconditioned Wilson Dirac Global lattice size Global lattice size Inner most loop parallelization
24
Comparison of VN mode and SMP mode
Wilson Dirac Even-odd preconditioned Wilson Dirac
25
Summary Blue Gene/P shows good performance for lattice QCD
x2 performance of Blue Gene/L in L1 write back mode even though Blue Gene/P is write through Very good weak scaling and good strong scaling for large lattice More flexible tuning opportunity than Blue Gene/L DMA makes easier to optimize communication and computation SMP mode has more potential to optimize Future works Increase performance of even-odd preconditioned Wilson and SMP mode Test in the actual application
26
Acknowledgement IBM Tokyo Research Laboratory
We have tuned and run our codes on Blue Gene/P at IBM Watson Research Center Thanks to : IBM Tokyo Research Laboratory Kei Kawase IBM Watson Research Center James Sexton John Gunnels Philip Heidelberger IBM Rochester Jeff Parker LLNL Pavlos Vranas KEK Hideo Matsufuru Shoji Hashimoto
27
Backup
28
Comparing gauge matrix array layout
Our gauge array layout (JLQCD collaboration) double _Complex U[3][3][X][Y][Z][T][Mu]; Mu is outer most Good for hardware prefetching (vector access) Good for reusing data in cache for large lattice Gauge array layout of CPS double _Complex U[3][3][Mu][X][Y][Z][T]; Mu is inner most Strided memory access is not good for prefetching because 3x3 matrix is not aligned to L1 cache line Data in cache can not reused for large lattice We recommend our gauge layout for Blue Gene
29
Comparison of the performance of gauge array layout
512 nodes virtual node mode Our gauge array layout Global lattice size 16x16x16x16 16x16x16x32 24x24x24x24 24x24x24x48 32x32x32x32 32x32x32x64 Lattice size / core 4x2x2x2 4x2x2x4 6x3x3x3 6x3x3x6 8x4x4x4 8x4x4x8 MFLOPS / core (vs peak) 596.3 (17.54 %) 709.9 (20.79 %) 928.7 (27.31 %) (30.51 %) (33.30 %) (35.07 %) w/ CG iteration MFLOPS /core 463.5 (13.63 %) 575.7 (16.93 % ) 744.9 (21.91 %) (24.33 %) 889.6 (26.16 %) 918.4 (27.01 %) CPS’s gauge array layout MFLOPS / core (vs peak) 563.1 (16.56 %) 647.8 (19.05 %) 815.7 (23.99 %) 896.1 (26.36 %) 959.7 (28.23 %) (29.82 %) w/ CG iteration MFLOPS /core 440.4 (12.95 %) 533.7 (15.70 % ) 669.7 (19.70 %) 745.6 (21.93 %) 793.4 (23.33 %) 798.5 (23.48 %)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.