Using VASP on Ranger Hang Liu. About this work and talk – A part of an AUS project for VASP users from UCSB computational material science group led by.

Slides:



Advertisements
Similar presentations
Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff.
Advertisements

1 ISCM-10 Taub Computing Center High Performance Computing for Computational Mechanics Moshe Goldberg March 29, 2001.
Introductions to Parallel Programming Using OpenMP
Xushan Zhao, Yang Chen Application of ab initio In Zr-alloys for Nuclear Power Stations General Research Institute for Non- Ferrous metals of Beijing September.
Eos Compilers Fernanda Foertter HPC User Assistance Specialist.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Porting, Benchmarking, and Optimizing Computational Material.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Performance of Applications Using Dual-Rail InfiniBand 3D Torus Network on the.
1 A Common Application Platform (CAP) for SURAgrid -Mahantesh Halappanavar, John-Paul Robinson, Enis Afgane, Mary Fran Yafchalk and Purushotham Bangalore.
High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.
Introduction CS 524 – High-Performance Computing.
Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.
Applications for K42 Initial Brainstorming Paul Hargrove and Kathy Yelick with input from Lenny Oliker, Parry Husbands and Mike Welcome.
Performance Engineering and Debugging HPC Applications David Skinner
Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.
ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.
Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
Task Farming on HPCx David Henty HPCx Applications Support
High Performance Computation --- A Practical Introduction Chunlin Tian NAOC Beijing 2011.
Lecture 8: Caffe - CPU Optimization
1 © 2012 The MathWorks, Inc. Speeding up MATLAB Applications.
Joshua Alexander University of Oklahoma – IT/OSCER ACI-REF Virtual Residency Workshop Monday June 1, 2015 Deploying Community Codes.
1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook.
1 Intel Mathematics Kernel Library (MKL) Quickstart COLA Lab, Department of Mathematics, Nat’l Taiwan University 2010/05/11.
WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.
Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.
Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.
Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
HFODD for Leadership Class Computers N. Schunck, J. McDonnell, Hai Ah Nam.
Experiences parallelising the mixed C-Fortran Sussix BPM post-processor H. Renshall, BE Dept associate, Jan 2012 Using appendix material from CERN-ATS-Note
Advanced User Support Amit Majumdar 5/7/09. Outline  Three categories of AUS  Update on Operational Activities  AUS.ASTA  AUS.ASP  AUS.ASEOT.
Zhengji Zhao, Nicholas Wright, and Katie Antypas NERSC Effects of Hyper- Threading on the NERSC workload on Edison NUG monthly meeting, June 6, 2013.
Software Overview Environment, libraries, debuggers, programming tools and applications Jonathan Carter NUG Training 3 Oct 2005.
Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,
1. 2 Define the purpose of MKL Upon completion of this module, you will be able to: Identify and discuss MKL contents Describe the MKL EnvironmentDiscuss.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,
Preliminary CPMD Benchmarks On Ranger, Pople, and Abe TG AUS Materials Science Project Matt McKenzie LONI.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
HPC F ORUM S EPTEMBER 8-10, 2009 Steve Rowan srowan at conveycomputer.com.
Single Node Optimization Computational Astrophysics.
1 How to do Multithreading First step: Sampling and Hotspot hunting Myongji University Sugwon Hong 1.
Other Tools HPC Code Development Tools July 29, 2010 Sue Kelly Sandia is a multiprogram laboratory operated by Sandia Corporation, a.
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 1 (Performance measurement)
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Getting Started: XSEDE Comet Shahzeb Siddiqui - Software Systems Engineer Office: 222A Computer Building Institute of CyberScience May.
NUMA Control for Hybrid Applications Kent Milfeld TACC May 5, 2015.
Scaling up R computation with high performance computing resources.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
SimTK 1.0 Workshop Downloads Jack Middleton March 20, 2008.
Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.
June 9-11, 2007SPAA SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas.
Parallel OpenFOAM CFD Performance Studies Student: Adi Farshteindiker Advisors: Dr. Guy Tel-Zur,Prof. Shlomi Dolev The Department of Computer Science Faculty.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Early Results of Deep Learning on the Stampede2 Supercomputer
4D-VAR Optimization Efficiency Tuning
Running R in parallel — principles and practice
High Performance Computing (CS 540)
Early Results of Deep Learning on the Stampede2 Supercomputer
Simulation at NASA for the Space Radiation Effort
Compiler Back End Panel
Compiler Back End Panel
Overview of HPC systems and software available within
Department of Computer Science, University of Tennessee, Knoxville
Parallel Computing Explained How to Parallelize a Code
Question 1 How are you going to provide language and/or library (or other?) support in Fortran, C/C++, or another language for massively parallel programming.
Presentation transcript:

Using VASP on Ranger Hang Liu

About this work and talk – A part of an AUS project for VASP users from UCSB computational material science group led by Prof.Chris Van De Walle – Collaborative efforts with Dodi Heryadi and Mark Vanmoer at NCSA, Anderson Janotti and Maosheng Miao at NCSB, coordinated by Amitava Majumda at SDSC and Bill Barth at TACC – Many heuristics from HPC group at TACC, other users and their tickets – Goal: have the VASP running on Ranger with reasonable performance

VASP Basics – An ab-initio quantum mechanical molecular dynamics package. Current version 4.6, many users have latest development version – Straightforward compilation by both Intel and PGI compilers + Mvapich – Some performance libraries are needed, BLAS, LAPACK, FFT and ScaLapack.

Standard Compilation Intel + Mvapich FFLAGS = -O1 -xW PGI + Mvapich: FFLAGS = -tp barcelona-64 GotoBLAS + LAPACK + FFTW3

# [total] min max # wallclock # user # system # mpi # %comm # gflop/sec # gbytes Performance profiling of a testing run with 120 MPI tasks by IPM

Reasonable performance: 1.9GFLOPS/task Not memory intensive: 0.7GB/task Somehow communication intensive: 23% MPI Balanced instructions, communications and timings

Most instruction intensive routines executed with very good performance, The most time consuming routines looks like a random number generation and MPI communication, what does wave.f90 do? Observing the performance bottlenecks in VASP by TAU

Hybrid Compilation and NUMA Control –induced by a user ticket: VASP running much slower on Ranger than on Lonestar

On Ranger: LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP+: VPU time : CPU time On Lonestar LOOP: VPU time 62.07: CPU time LOOP: VPU time 76.34: CPU time 76. LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time 94.83: CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time 66.74: CPU time LOOP+: VPU time : CPU time User reported following timing of a VASP calculation Almost 3 times slower, must be something not right

-pe 8way 192 setenv OMP_NUM_THREADS=1 ibrun tacc_affinity./vasp LOOP: VPU time 61.31: CPU time LOOP: VPU time 75.21: CPU time LOOP: VPU time 97.97: CPU time LOOP: VPU time 98.58: CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time 99.29: CPU time LOOP: VPU time 92.51: CPU time LOOP: VPU time : CPU time LOOP: VPU time 99.44: CPU time LOOP: VPU time : CPU time LOOP: VPU time 99.13: CPU time LOOP: VPU time 64.47: CPU time LOOP+: VPU time : CPU time In user’s makefile MKLPATH = ${TACC_MKL_LIB} BLAS= -L$(MKLPATH) $(MKLPATH)/libmkl_em64t.a $(MKLPATH)/libguide.a -lpthread LAPACK= $(MKLPATH)/libmkl_lapack.a Right number of threads NUMA control commands Proper core-memory affinity Comparable performance to that on Lonestar Is MKL on Ranger multi-threaded ? Looks like it is

How can multi-threaded BLAS improve VASP performance? – VASP guide says: for good performance, VASP requires highly optimized BLAS routines – Multi-threaded BLAS is available on Ranger – MKL and GotoBLAS

LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP+: VPU time : CPU time Case-1: both BLAS and LAPACK are from MKL, 4 way, 4 threads in each way ==> almost the same as the 8x1 case. No improvement.

LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP+: VPU time : CPU time Case-2: both BLAS and LAPACK are from GotoBLAS, 4 way, 4 threads in each way BLAS= -L$(TACC_GOTOBLAS_LIB) -lgoto_lp64_mp –lpthread ==> The BLAS in GotoBLAS is much better than that in MKL. 30% faster for this case

4 way, 1 thread in each way: LOOP: VPU time 63.08: CPU time LOOP: VPU time 80.91: CPU time LOOP: VPU time 95.91: CPU time LOOP: VPU time 91.77: CPU time LOOP: VPU time 97.23: CPU time LOOP: VPU time : CPU time LOOP: VPU time 93.45: CPU time LOOP: VPU time 94.48: CPU time LOOP: VPU time 97.43: CPU time LOOP: VPU time : CPU time LOOP: VPU time 97.28: CPU time LOOP: VPU time 99.45: CPU time LOOP: VPU time 97.44: CPU time LOOP: VPU time 74.86: CPU time LOOP+: VPU time : CPU time way, 2 threads in each way: LOOP: VPU time 89.57: CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP: VPU time : CPU time LOOP+: VPU time : CPU time

Summary and Outlook VASP can be compiled straightforwardly, has reasonable performance When linked with multi-threaded libraries, set proper number of threads and NUMA control commands Multi-threaded GotoBLAS leads obvious performance improvement ScaLapack: maybe not scaled very well Task geometry: can a specific process-thread arrangement minimize communication cost?