Using VASP on Ranger Hang Liu
About this work and talk – A part of an AUS project for VASP users from UCSB computational material science group led by Prof.Chris Van De Walle – Collaborative efforts with Dodi Heryadi and Mark Vanmoer at NCSA, Anderson Janotti and Maosheng Miao at NCSB, coordinated by Amitava Majumda at SDSC and Bill Barth at TACC – Many heuristics from HPC group at TACC, other users and their tickets – Goal: have the VASP running on Ranger with reasonable performance
VASP Basics – An ab-initio quantum mechanical molecular dynamics package. Current version 4.6, many users have latest development version – Straightforward compilation by both Intel and PGI compilers + Mvapich – Some performance libraries are needed, BLAS, LAPACK, FFT and ScaLapack.
Standard Compilation Intel + Mvapich FFLAGS = -O1 -xW PGI + Mvapich: FFLAGS = -tp barcelona-64 GotoBLAS + LAPACK + FFTW3
# [total] min max # wallclock # user # system # mpi # %comm # gflop/sec # gbytes Performance profiling of a testing run with 120 MPI tasks by IPM
Reasonable performance: 1.9GFLOPS/task Not memory intensive: 0.7GB/task Somehow communication intensive: 23% MPI Balanced instructions, communications and timings
Most instruction intensive routines executed with very good performance, The most time consuming routines looks like a random number generation and MPI communication, what does wave.f90 do? Observing the performance bottlenecks in VASP by TAU
Hybrid Compilation and NUMA Control –induced by a user ticket: VASP running much slower on Ranger than on Lonestar
On Ranger: LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP+: VPU time : CPU time On Lonestar LOOP: VPU time 62.07: CPU time LOOP: VPU time 76.34: CPU time 76. LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time 94.83: CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time 66.74: CPU time LOOP+: VPU time : CPU time User reported following timing of a VASP calculation Almost 3 times slower, must be something not right
-pe 8way 192 setenv OMP_NUM_THREADS=1 ibrun tacc_affinity./vasp LOOP: VPU time 61.31: CPU time LOOP: VPU time 75.21: CPU time LOOP: VPU time 97.97: CPU time LOOP: VPU time 98.58: CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time 99.29: CPU time LOOP: VPU time 92.51: CPU time LOOP: VPU time : CPU time LOOP: VPU time 99.44: CPU time LOOP: VPU time : CPU time LOOP: VPU time 99.13: CPU time LOOP: VPU time 64.47: CPU time LOOP+: VPU time : CPU time In user’s makefile MKLPATH = ${TACC_MKL_LIB} BLAS= -L$(MKLPATH) $(MKLPATH)/libmkl_em64t.a $(MKLPATH)/libguide.a -lpthread LAPACK= $(MKLPATH)/libmkl_lapack.a Right number of threads NUMA control commands Proper core-memory affinity Comparable performance to that on Lonestar Is MKL on Ranger multi-threaded ? Looks like it is
How can multi-threaded BLAS improve VASP performance? – VASP guide says: for good performance, VASP requires highly optimized BLAS routines – Multi-threaded BLAS is available on Ranger – MKL and GotoBLAS
LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP+: VPU time : CPU time Case-1: both BLAS and LAPACK are from MKL, 4 way, 4 threads in each way ==> almost the same as the 8x1 case. No improvement.
LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP+: VPU time : CPU time Case-2: both BLAS and LAPACK are from GotoBLAS, 4 way, 4 threads in each way BLAS= -L$(TACC_GOTOBLAS_LIB) -lgoto_lp64_mp –lpthread ==> The BLAS in GotoBLAS is much better than that in MKL. 30% faster for this case
4 way, 1 thread in each way: LOOP: VPU time 63.08: CPU time LOOP: VPU time 80.91: CPU time LOOP: VPU time 95.91: CPU time LOOP: VPU time 91.77: CPU time LOOP: VPU time 97.23: CPU time LOOP: VPU time : CPU time LOOP: VPU time 93.45: CPU time LOOP: VPU time 94.48: CPU time LOOP: VPU time 97.43: CPU time LOOP: VPU time : CPU time LOOP: VPU time 97.28: CPU time LOOP: VPU time 99.45: CPU time LOOP: VPU time 97.44: CPU time LOOP: VPU time 74.86: CPU time LOOP+: VPU time : CPU time way, 2 threads in each way: LOOP: VPU time 89.57: CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP+: VPU time : CPU time
Summary and Outlook VASP can be compiled straightforwardly, has reasonable performance When linked with multi-threaded libraries, set proper number of threads and NUMA control commands Multi-threaded GotoBLAS leads obvious performance improvement ScaLapack: maybe not scaled very well Task geometry: can a specific process-thread arrangement minimize communication cost?