Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using VASP on Ranger Hang Liu. About this work and talk – A part of an AUS project for VASP users from UCSB computational material science group led by.

Similar presentations


Presentation on theme: "Using VASP on Ranger Hang Liu. About this work and talk – A part of an AUS project for VASP users from UCSB computational material science group led by."— Presentation transcript:

1 Using VASP on Ranger Hang Liu

2 About this work and talk – A part of an AUS project for VASP users from UCSB computational material science group led by Prof.Chris Van De Walle – Collaborative efforts with Dodi Heryadi and Mark Vanmoer at NCSA, Anderson Janotti and Maosheng Miao at NCSB, coordinated by Amitava Majumda at SDSC and Bill Barth at TACC – Many heuristics from HPC group at TACC, other users and their tickets – Goal: have the VASP running on Ranger with reasonable performance

3 VASP Basics – An ab-initio quantum mechanical molecular dynamics package. Current version 4.6, many users have latest development version – Straightforward compilation by both Intel and PGI compilers + Mvapich – Some performance libraries are needed, BLAS, LAPACK, FFT and ScaLapack.

4 Standard Compilation Intel + Mvapich FFLAGS = -O1 -xW PGI + Mvapich: FFLAGS = -tp barcelona-64 GotoBLAS + LAPACK + FFTW3

5 # [total] min max # wallclock 136076 1133.97 1133.97 1133.98 # user 135664 1130.54 1127.63 1131.5 # system 65.5808 0.546507 0.36 2.432 # mpi 31809.9 265.083 235.753 299.535 # %comm 23.3763 20.7899 26.4147 # gflop/sec 222.097 1.85081 1.71003 1.8556 # gbytes 78.4051 0.653376 0.646839 0.684002 Performance profiling of a testing run with 120 MPI tasks by IPM

6 Reasonable performance: 1.9GFLOPS/task Not memory intensive: 0.7GB/task Somehow communication intensive: 23% MPI Balanced instructions, communications and timings

7 Most instruction intensive routines executed with very good performance, The most time consuming routines looks like a random number generation and MPI communication, what does wave.f90 do? Observing the performance bottlenecks in VASP by TAU

8 Hybrid Compilation and NUMA Control –induced by a user ticket: VASP running much slower on Ranger than on Lonestar

9 On Ranger: LOOP: VPU time 226.76: CPU time 173.41 LOOP: VPU time 288.46: CPU time 218.79 LOOP: VPU time 383.08: CPU time 287.26 LOOP: VPU time 385.57: CPU time 275.94 LOOP: VPU time 405.54: CPU time 303.41 LOOP: VPU time 378.68: CPU time 279.45 LOOP: VPU time 383.70: CPU time 283.03 LOOP: VPU time 353.88: CPU time 259.22 LOOP: VPU time 407.47: CPU time 300.22 LOOP: VPU time 378.02: CPU time 276.17 LOOP: VPU time 414.57: CPU time 307.78 LOOP: VPU time 382.99: CPU time 282.14 LOOP: VPU time 248.97: CPU time 188.06 LOOP+: VPU time 4754.26: CPU time 3591.14 On Lonestar LOOP: VPU time 62.07: CPU time 62.52 LOOP: VPU time 76.34: CPU time 76. LOOP: VPU time 101.73: CPU time 101.83 LOOP: VPU time 101.84: CPU time 102.15 LOOP: VPU time 113.80: CPU time 114.01 LOOP: VPU time 105.28: CPU time 105.38 LOOP: VPU time 102.89: CPU time 103.00 LOOP: VPU time 94.83: CPU time 94.93 LOOP: VPU time 113.42: CPU time 113.53 LOOP: VPU time 102.02: CPU time 102.08 LOOP: VPU time 113.96: CPU time 114.04 LOOP: VPU time 102.45: CPU time 102.53 LOOP: VPU time 66.74: CPU time 66.84 LOOP+: VPU time 1365.13: CPU time 1389.13 User reported following timing of a VASP calculation Almost 3 times slower, must be something not right

10 -pe 8way 192 setenv OMP_NUM_THREADS=1 ibrun tacc_affinity./vasp LOOP: VPU time 61.31: CPU time 62.44 LOOP: VPU time 75.21: CPU time 75.33 LOOP: VPU time 97.97: CPU time 98.02 LOOP: VPU time 98.58: CPU time 98.65 LOOP: VPU time 108.35: CPU time 108.50 LOOP: VPU time 102.18: CPU time 102.45 LOOP: VPU time 99.29: CPU time 99.37 LOOP: VPU time 92.51: CPU time 92.57 LOOP: VPU time 108.44: CPU time 108.50 LOOP: VPU time 99.44: CPU time 99.51 LOOP: VPU time 108.74: CPU time 108.82 LOOP: VPU time 99.13: CPU time 99.23 LOOP: VPU time 64.47: CPU time 64.54 LOOP+: VPU time 1336.91: CPU time 1378.16 In user’s makefile MKLPATH = ${TACC_MKL_LIB} BLAS= -L$(MKLPATH) $(MKLPATH)/libmkl_em64t.a $(MKLPATH)/libguide.a -lpthread LAPACK= $(MKLPATH)/libmkl_lapack.a Right number of threads NUMA control commands Proper core-memory affinity Comparable performance to that on Lonestar Is MKL on Ranger multi-threaded ? Looks like it is

11 How can multi-threaded BLAS improve VASP performance? – VASP guide says: for good performance, VASP requires highly optimized BLAS routines – Multi-threaded BLAS is available on Ranger – MKL and GotoBLAS

12 LOOP: VPU time 123.00: CPU time 66.00 LOOP: VPU time 157.92: CPU time 82.67 LOOP: VPU time 190.06: CPU time 97.56 LOOP: VPU time 179.26: CPU time 93.55 LOOP: VPU time 193.40: CPU time 99.40 LOOP: VPU time 209.87: CPU time 107.05 LOOP: VPU time 185.16: CPU time 95.44 LOOP: VPU time 185.77: CPU time 96.56 LOOP: VPU time 190.02: CPU time 99.51 LOOP: VPU time 201.10: CPU time 105.07 LOOP: VPU time 191.07: CPU time 99.26 LOOP: VPU time 195.49: CPU time 101.31 LOOP: VPU time 193.11: CPU time 99.38 LOOP: VPU time 147.91: CPU time 76.65 LOOP+: VPU time 2677.34: CPU time 1477.63 Case-1: both BLAS and LAPACK are from MKL, 4 way, 4 threads in each way ==> almost the same as the 8x1 case. No improvement.

13 LOOP: VPU time 153.81: CPU time 46.27 LOOP: VPU time 198.21: CPU time 58.55 LOOP: VPU time 235.09: CPU time 69.63 LOOP: VPU time 225.93: CPU time 66.80 LOOP: VPU time 236.93: CPU time 71.55 LOOP: VPU time 256.36: CPU time 77.62 LOOP: VPU time 226.96: CPU time 68.61 LOOP: VPU time 230.06: CPU time 69.34 LOOP: VPU time 236.31: CPU time 71.27 LOOP: VPU time 251.50: CPU time 76.00 LOOP: VPU time 236.78: CPU time 71.45 LOOP: VPU time 241.77: CPU time 73.01 LOOP: VPU time 236.59: CPU time 71.39 LOOP: VPU time 182.20: CPU time 54.38 LOOP+: VPU time 3404.57: CPU time 1075.91 Case-2: both BLAS and LAPACK are from GotoBLAS, 4 way, 4 threads in each way BLAS= -L$(TACC_GOTOBLAS_LIB) -lgoto_lp64_mp –lpthread ==> The BLAS in GotoBLAS is much better than that in MKL. 30% faster for this case

14 4 way, 1 thread in each way: LOOP: VPU time 63.08: CPU time 63.72 LOOP: VPU time 80.91: CPU time 80.98 LOOP: VPU time 95.91: CPU time 96.00 LOOP: VPU time 91.77: CPU time 91.86 LOOP: VPU time 97.23: CPU time 97.42 LOOP: VPU time 105.29: CPU time 105.42 LOOP: VPU time 93.45: CPU time 93.57 LOOP: VPU time 94.48: CPU time 94.57 LOOP: VPU time 97.43: CPU time 97.52 LOOP: VPU time 103.22: CPU time 103.31 LOOP: VPU time 97.28: CPU time 97.35 LOOP: VPU time 99.45: CPU time 99.55 LOOP: VPU time 97.44: CPU time 97.50 LOOP: VPU time 74.86: CPU time 74.92 LOOP+: VPU time 1418.64: CPU time 1443.82 4 way, 2 threads in each way: LOOP: VPU time 89.57: CPU time 49.98 LOOP: VPU time 115.40: CPU time 63.41 LOOP: VPU time 136.96: CPU time 75.27 LOOP: VPU time 131.39: CPU time 72.32 LOOP: VPU time 138.68: CPU time 77.14 LOOP: VPU time 149.38: CPU time 83.26 LOOP: VPU time 132.74: CPU time 73.71 LOOP: VPU time 134.08: CPU time 74.40 LOOP: VPU time 138.26: CPU time 76.82 LOOP: VPU time 146.69: CPU time 81.61 LOOP: VPU time 138.44: CPU time 76.86 LOOP: VPU time 140.98: CPU time 78.38 LOOP: VPU time 138.54: CPU time 76.99 LOOP: VPU time 106.45: CPU time 58.89 LOOP+: VPU time 1998.97: CPU time 1148.74

15 Summary and Outlook VASP can be compiled straightforwardly, has reasonable performance When linked with multi-threaded libraries, set proper number of threads and NUMA control commands Multi-threaded GotoBLAS leads obvious performance improvement ScaLapack: maybe not scaled very well Task geometry: can a specific process-thread arrangement minimize communication cost?


Download ppt "Using VASP on Ranger Hang Liu. About this work and talk – A part of an AUS project for VASP users from UCSB computational material science group led by."

Similar presentations


Ads by Google