Taylor Barnes June 10, 2016 Introduction to Quantum Espresso at NERSC.

Slides:



Advertisements
Similar presentations
11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array.
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Distributed Systems CS
Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.
PARATEC and the Generation of the Empty States (Starting point for GW/BSE) Andrew Canning Computational Research Division, LBNL and Chemical Engineering.
GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.
DFT Calculations Shaun Swanson.
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
Chapter 3.6 Game Architecture. 2 Overall Architecture The code for modern games is highly complex With code bases exceeding a million lines of code, a.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
JGI/NERSC New Hardware Training Kirsten Fagnan, Seung-Jin Sul January 10, 2013.
Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Task Farming on HPCx David Henty HPCx Applications Support
CS 221 – May 13 Review chapter 1 Lab – Show me your C programs – Black spaghetti – connect remaining machines – Be able to ping, ssh, and transfer files.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Network for Computational Nanotechnology (NCN) Purdue, Norfolk State, Northwestern, MIT, Molecular Foundry, UC Berkeley, Univ. of Illinois, UTEP DFT Calculations.
‘Tis not folly to dream: Using Molecular Dynamics to Solve Problems in Chemistry Christopher Adam Hixson and Ralph A. Wheeler Dept. of Chemistry and Biochemistry,
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.
Overview of Module  Three Programs Waste Data Entry Recipe/Item Waste Report Location Cost Summary/Waste Tracking.
Practical quantum Monte Carlo calculations: QWalk
Elastic Applications in the Cloud Dinesh Rajan University of Notre Dame CCL Workshop, June 2012.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Institute for Mathematical Modeling RAS 1 Dynamic load balancing. Overview. Simulation of combustion problems using multiprocessor computer systems For.
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
1 Using Multiple Energy Gears in MPI Programs on a Power- Scalable Cluster Vincent W. Freeh, David K. Lowenthal, Feng Pan, and Nandani Kappiah Presented.
Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array.
Hybrid MPI and OpenMP Parallel Programming
Zhengji Zhao, Nicholas Wright, and Katie Antypas NERSC Effects of Hyper- Threading on the NERSC workload on Edison NUG monthly meeting, June 6, 2013.
Tutorial: Quasiparticle and Optical Excitation Energies of the Benzene Molecule Sahar Sharifzadeh Molecular Foundry, LBNL /global/project/projectdirs/m1694/BGW-2013/7-benzene.
Multilevel Parallelism using Processor Groups Bruce Palmer Jarek Nieplocha, Manoj Kumar Krishnan, Vinod Tipparaju Pacific Northwest National Laboratory.
Electronic state calculation for hydrogenated graphene with atomic vacancy Electronic state calculation of hydrogenated graphene and hydrogenated graphene.
Chapter 3.6 Game Architecture. 2 Overall Architecture The code for modern games is highly complex (can easily exceed 1M LOC) The larger your program,
1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.
Processor Architecture
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
CS 351/ IT 351 Modeling and Simulation Technologies HPC Architectures Dr. Jim Holten.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
Lenovo - Eficiencia Energética en Sistemas de Supercomputación Miguel Terol Palencia Arquitecto HPC LENOVO.
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
The Nuts and Bolts of First-Principles Simulation Durham, 6th-13th December : Testing Testing. Basic procedure to “validate” calculations CASTEP.
Parallel Computing Presented by Justin Reschke
Benchmarking and Applications. Purpose of Our Benchmarking Effort Reveal compiler (and run-time systems) weak points and lack of adequate automatic optimizations.
Concurrency and Performance Based on slides by Henri Casanova.
Advanced topics Cluster Training Center for Simulation and Modeling September 4, 2015.
Getting Started: XSEDE Comet Shahzeb Siddiqui - Software Systems Engineer Office: 222A Computer Building Institute of CyberScience May.
NUMA Control for Hybrid Applications Kent Milfeld TACC May 5, 2015.
First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
1 B3-B1 phase transition in GaAs: A Quantum Monte Carlo Study C N M Ouma 1, 2, M Z Mapelu 1, G. O. Amolo 1, N W Makau 1, and R Maezono 3, 1 Computational.
Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.
2/23/2015PHY 752 Spring Lecture 171 PHY 752 Solid State Physics 11-11:50 AM MWF Olin 107 Plan for Lecture 17: Reading: Chapter 10 in MPM Ingredients.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Introduction to Quantum Espresso
PARADOX Cluster job management
Quantum Espresso code
Electronic and Mechanical Materials Properties from DFT Calculations
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
BY SELLAVEL E (CA15M006) Guided By Prof.B.Viswanathan
Joker: Getting the most out of the slurm scheduler
Paul Driver 3rd Period Ap Physics
Introduction to Quantum ESPRESSO and Computer System
Operating System Concepts
Software Requirements Specification (SRS) Template.
Quick Tutorial on MPICH for NIC-Cluster
Presentation transcript:

Taylor Barnes June 10, 2016 Introduction to Quantum Espresso at NERSC

Introduction Sub-Codes: PWPlane Wave DFT CPCar-Parrinello MD PHPhonons XSPECTRAX-Ray Absorption NEBReaction Pathways Nudged Elastic Band TDDFTTime-Dependent DFT PPPost Processing Supported Features: Both Γ-point only and k-point calculations Norm-conserving and ultrasoft pseudopotentials Most commonly used functionals, including hybrids Spin-polarization, noncolinear magnetism

Sample Script #!/bin/bash –l #SBATCH --partition=debug #SBATCH --nodes=96 #SBATCH --time=0:30:00 #SBATCH --job-name=qe_test module load espresso export OMP_NUM_THREADS=1 srun -n 2304 pw.x -npools 4 -ntg 3 -in infile.in > outfile.out

Sample Script #!/bin/bash –l #SBATCH --partition=debug #SBATCH --nodes=96 #SBATCH --time=0:30:00 #SBATCH --job-name=qe_test module load espresso export OMP_NUM_THREADS=1 srun -n 2304 pw.x -npools 4 –ntg 3 -in input > input.out % module avail espresso /usr/common/usg/Modules/modulefiles espresso/5.0.0 espresso/ (default)espresso/5.2.0espresso/5.0.2 espresso/5.1espresso/5.4.0espresso/ espresso/5.1.1 espresso/5.0.3 espresso/5.1.2

Sample Script #!/bin/bash –l #SBATCH --partition=debug #SBATCH --nodes=96 #SBATCH --time=0:30:00 #SBATCH --job-name=qe_test module load espresso export OMP_NUM_THREADS=1 srun -n 2304 pw.x -npools 4 –ntg 3 -in input > input.out Controls number of OpenMP threads per MPI task.

Sample Script #!/bin/bash –l #SBATCH --partition=debug #SBATCH --nodes=96 #SBATCH --time=0:30:00 #SBATCH --job-name=qe_test module load espresso export OMP_NUM_THREADS=1 srun -n 2304 pw.x -npools 4 -ntg 3 -in infile.in > outfile.out Options to control the way that QE handles parallelization

Parallelization Options Matter!

QE Parallelization Hierarchy QE Parallelization Overview: Pools Parallelization Task Groups Only for local DFT Band Groups Only for hybrid DFT Pools FFTs Parallelization over k-points Nearly perfect efficiency Set by the -npools option. Can use at most 1 pool / k-point Should be a divisor of the number of nodes

QE Parallelization Hierarchy QE Parallelization Overview: Task Group Parallelization Task Groups Only for local DFT Band Groups Only for hybrid DFT Pools FFTs The number of task groups per pool is set by the -ntg option Processors in different task groups work on FFTs corresponding to differnet bands Should be a divisor of the number of nodes / npools. Only works with local/semi-local DFT functionals

QE Parallelization Hierarchy QE Parallelization Overview: Band Group Parallelization Task Groups Only for local DFT Band Groups Only for hybrid DFT Pools FFTs The number of band groups per pool is set by the -ntg option Similar to task group parallelization, except only for the calculation of exact exchange. Should be a divisor of the number of nodes / npools. Only works with hybrid DFT functionals

QE Parallelization Hierarchy QE Parallelization Overview: Band Group Parallelization Task Groups Only for local DFT Band Groups Only for hybrid DFT Pools FFTs All processors within a band group / task group work simultaneously on the same FFT Best for on-node parallelization Best Performance: Use band parallelization between nodes, and FFT parallelization within nodes. Default QE Behavior: Only FFT parallelization is used.

QE Parallelization Hierarchy QE Parallelization Overview: Example Task Groups Only for local DFT Band Groups Only for hybrid DFT Pools FFTs SCF calculation on a box of 64 water molecules, using the PBE functional and 96 Edison nodes. Walltime using “-ntg 6”: 28.4 seconds Walltime with QE Defaults: seconds

MPI vs. OpenMP

#!/bin/bash –l #SBATCH --partition=regular #SBATCH --nodes=96 #SBATCH --time=0:30:00 #SBATCH --job-name=qe_test module load espresso srun -n 2304 pw.x -ntg 12 -in infile.in > outfile.out #!/bin/bash –l #SBATCH --partition=regular #SBATCH --nodes=96 #SBATCH --time=0:30:00 #SBATCH --job-name=qe_test module load espresso export OMP_NUM_THREADS=1 srun -n 2304 pw.x -ntg 12 -in infile.in > outfile.out Walltime: 198 seconds Walltime: 108 seconds Best Performance: OMP_NUM_THREADS=1 Default on Cori: OMP_NUM_THREADS=2

Energy Cutoffs ecutwfc – kinetic energy cutoff for wavefunctions ecutfock – kinetic energy cutoff for exact exchange operator (default: 4*ecutwfc) Above: The default value of ecutfock is often larger than necessary (ecutwfc = 80 Ry). Left: Reducing ecutfock can lead to large speed benefits When using hybrid DFT, it is a good idea to test for the optimal value of ecutfock.

NESAP Efforts QE is a Tier 1 NESAP project, and is currently being prepared for execution on many-core architectures, such as NERSC’s upcoming Cori Phase II. Improved parallelization of a PBE0 calculation on Edison: Independent parallelization of the exact exchange code Can simultaneously use task groups and band groups Improved band parallelization Improved communication within exact exchange code Code Improvements:

Summary of Best Practices Pools – Set npools to the number of k-points, whenever possible. Task Groups – Set ntg when using local or semi-local DFT functionals. A total of one task group per 8 nodes is usually reasonable. Band Groups – Set nbgrp when using a hybrid DFT functional. A total of one band group per node is usually reasonable. MPI vs. OpenMP – When in doubt, use pure MPI by setting OMP_NUM_THREADS=1. Energy Cutoffs – Find the optimal value of all cutoffs, including both ecutwfc and ecutfock (if using hybrid functionals).