PARATEC and the Generation of the Empty States (Starting point for GW/BSE) Andrew Canning Computational Research Division, LBNL and Chemical Engineering.

Slides:



Advertisements
Similar presentations
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Andrew Canning and Lin-Wang Wang Computational Research Division LBNL
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Quantum Theory of Solids
Javier Junquera Code structure: calculation of matrix elements of H and S. Direct diagonalization José M. Soler = N  N N  1.
Cyberinfrastructure for Scalable and High Performance Geospatial Computation Xuan Shi Graduate assistants supported by the CyberGIS grant Fei Ye (2011)
Copyright HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill,
Applications for K42 Initial Brainstorming Paul Hargrove and Kathy Yelick with input from Lenny Oliker, Parry Husbands and Mike Welcome.
May 29, Final Presentation Sajib Barua1 Development of a Parallel Fast Fourier Transform Algorithm for Derivative Pricing Using MPI Sajib Barua.
1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.
Parallel Methods for Nano/Materials Science Applications Andrew Canning Computational Research Division LBNL & UC Davis, Applied Science Dept. (Electronic.
Large-Scale Density Functional Calculations James E. Raynolds, College of Nanoscale Science and Engineering Lenore R. Mullin, College of Computing and.
Parallel Methods for Nano/Materials Science Applications Andrew Canning Computational Research Division LBNL & UC Davis (Electronic Structure Calculations)
Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.
Parallel Performance of Hierarchical Multipole Algorithms for Inductance Extraction Ananth Grama, Purdue University Vivek Sarin, Texas A&M University Hemant.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Iterative and direct linear solvers in fully implicit magnetic reconnection simulations with inexact Newton methods Xuefei (Rebecca) Yuan 1, Xiaoye S.
Physics “Advanced Electronic Structure” Pseudopotentials Contents: 1. Plane Wave Representation 2. Solution for Weak Periodic Potential 3. Solution.
Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.
1 First-Principles Molecular Dynamics for Petascale Computers François Gygi Dept of Applied Science, UC Davis
Practical quantum Monte Carlo calculations: QWalk
J.J.Rehr1, John Vinson1, E.L.Shirley2 J.J. Kas1 and F. Vila1
Norm-conserving pseudopotentials and basis sets in electronic structure calculations Javier Junquera Universidad de Cantabria.
The Nuts and Bolts of First-Principles Simulation Durham, 6th-13th December : DFT Plane Wave Pseudopotential versus Other Approaches CASTEP Developers’
Implementation of Density Functional Theory based Electronic Structure Codes on Advanced Computing Architectures W.A. Shelton, E. Aprá, G.I. Fann and R.J.
Javier Junquera Code structure: calculation of matrix elements of H and S. Direct diagonalization José M. Soler = N  N N  1.
1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.
R. Martin - Pseudopotentials1 African School on Electronic Structure Methods and Applications Lecture by Richard M. Martin Department of Physics and Materials.
3D ASLDA solver - status report Piotr Magierski (Warsaw), Aurel Bulgac (Seattle), Kenneth Roche (Oak Ridge), Ionel Stetcu (Seattle) Ph.D. Student: Yuan.
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.
Feb 23, 2010, Tsukuba-Edinburgh Computational Science Workshop, Edinburgh Large-Scale Density-Functional calculations for nano-meter size Si materials.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
HFODD for Leadership Class Computers N. Schunck, J. McDonnell, Hai Ah Nam.
One-sided Communication Implementation in FMO Method J. Maki, Y. Inadomi, T. Takami, R. Susukita †, H. Honda, J. Ooba, T. Kobayashi, R. Nogita, K. Inoue.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
Introduction to Research 2011 Introduction to Research 2011 Ashok Srinivasan Florida State University Images from ORNL, IBM, NVIDIA.
Comp. Mat. Science School 2001 Lecture 21 Density Functional Theory for Electrons in Materials Richard M. Martin Bands in GaAs Prediction of Phase Diagram.
TBPW: A Modular Framework for Pedagogical Electronic Structure Codes Todd D. Beaudet, Dyutiman Das, Nichols A. Romero, William D. Mattson, Jeongnim Kim.
Density Functional Theory A long way in 80 years L. de Broglie – Nature 112, 540 (1923). E. Schrodinger – 1925, …. Pauli exclusion Principle.
Parallel Solution of the Poisson Problem Using MPI
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
Linear scaling solvers based on Wannier-like functions P. Ordejón Institut de Ciència de Materials de Barcelona (CSIC)
Preliminary CPMD Benchmarks On Ranger, Pople, and Abe TG AUS Materials Science Project Matt McKenzie LONI.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
MODELING MATTER AT NANOSCALES 4. Introduction to quantum treatments Eigenvectors and eigenvalues of a matrix.
Determination of surface structure of Si-rich SiC(001)(3x2) Results for the two-adlayer asymmetric dimer model (TAADM) are the only ones that agree with.
Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
Advanced methods of molecular dynamics 1.Monte Carlo methods 2.Free energy calculations 3.Ab initio molecular dynamics 4.Quantum molecular dynamics 5.Trajectory.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
© copyright 2011 William A. Goddard III, all rights reservedCh121a-Goddard-L14 Periodic Boundary Methods and Applications: Ab-initio Quantum Mechanics.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Comp. Mat. Science School Electrons in Materials Density Functional Theory Richard M. Martin Electron density in La 2 CuO 4 - difference from sum.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.
2/23/2015PHY 752 Spring Lecture 171 PHY 752 Solid State Physics 11-11:50 AM MWF Olin 107 Plan for Lecture 17: Reading: Chapter 10 in MPM Ingredients.
Taylor Barnes June 10, 2016 Introduction to Quantum Espresso at NERSC.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Parallelization of CPAIMD using Charm++
Hybrid Programming with OpenMP and MPI
Presentation transcript:

PARATEC and the Generation of the Empty States (Starting point for GW/BSE) Andrew Canning Computational Research Division, LBNL and Chemical Engineering and Materials Science Dept. UC Davis.

GW/BSE Method Overview DFT Kohn-Sham (SCF and NSCF) {φ DFT nk (r), E DFT nk } Compute Dielectric Function { } GW: Quasiparticle Properties {φ QP nk (r), E QP nk } BSE: Construct Kernel (coarse grid) K(k,c,v,k',c',v') Interpolate Kernel to Fine Grid / Diagonalize BSE Hamiltonian {A s cvk, E s cvk } Expt. G.E. Jellison, M.F. Chisholm, S.M. Gorbatkin, Appl. Phys. Lett. 62, 3348 (1993).

Computational Cost: GW Method for nanotube 80 carbon atoms, 80x80x4.6au 160 occupied (valence) bands, 800 unoccupied (conduction) bands kpoints 1x1x32 (coarse) 1x1x256 (fine) Running on Cray XE6 Hopper Generation of empty states ~30% of computational cost and highest in terms of wall clock time scaling issues for running DFT codes for large number of bands (on relatively small system)

Features of Different Codes for generation of empty states (what to use for GW/BSE ? ) SIESTA (Spanish Initiative for Electronic Simulations with Thousands of Atoms Basis set LCAO (Linear Combination of Atomic Orbitals) Less accurate basis allows larger systems to be studied (thousands of atoms) Good for non-periodic systems, large molecules O(N) algorithms implemented in LCAO basis PARSEC (Pseudopotential Algorithm for Real-Space Electronic structure Calculations) Grid based real space representation finite-difference approach Easy to implement non-periodic boundary conditions Good for large molecules etc. Quantum Espresso Plane Wave basis set (same as BerkeleyGW code) PAW (Projector Augmented Wavefunctions) option Hybrid Functionals PARATEC (PARAllel Total Energy Code) Plane Wave basis set (same as BerkeleyGW code) Good for periodic systems (crystals etc, metallic systems) Hybrid Functionals static-COHSEX OpenMP/MPI Hybrid implementation

PARATEC (PARAllel Total Energy Code) PARATEC performs first-principles quantum mechanical total energy calculation using pseudopotentials & plane wave basis set Written in F90 and MPI Designed to run on large parallel machines Cray, IBM etc. but also runs on PCs PARATEC uses all-band CG approach to obtain wavefunctions of electrons (blocks comms. Specialized 3dffts) Generally obtains high percentage of peak on different platforms (uses BLAS3 and 1d FFT libs) Developed by Louie and Cohen groups (UCB, LBNL) in collaboration with CRD, NERSC

Breakdown of Computational Costs for Solving Kohn-Sham LDA/GGA Equations in PARATEC Computational Task (CG solver)Scaling OrthogonalizationMN 2 Subspace diagonalizationN3N3 3d FFTs (most communications)NMlogM Nonlocal pseudopotentialMN 2 (N 2 real space) N: number of eigenpairs required (lowest in spectrum) M: matrix (Hamiltonian) dimension, basis set size (M ~ N)

Load Balancing, Parallel Data Layout Wavefunctions stored as spheres of points (due to energy cutoff) Data intensive parts (BLAS) proportional to number of Fourier components Pseudopotential calculation, Orthogonalization scales as N 3 (atom system) FFT part scales as N 2 logN FFT Data distribution: load balancing constraints (Fourier Space): each processor should have same number of Fourier coefficients (N 3 calcs.) each processor should have complete columns of Fourier coefficients (3d FFT) Give out sets of columns of data to each processor

PARATEC: Performance  Grid size  All architectures generally achieve high performance due to computational intensity of code (BLAS3, FFT)  ES achieves highest overall performance : 5.5Tflop/s on 2048 procs (5.3 Tflops on XT4 on 2048 procs in single proc. node mode)  FFT used for benchmark for NERSC procurements (run on up to 18K procs on Cray XT4, weak scaling )  Vectorisation directives and multiple 1d FFTs required for NEC SX6 Developed with Louie and Cohen’s groups (UCB, LBNL), also work with L. Oliker, J Carter Problem Proc Bassi NERSC (IBM Power5) Jaquard NERSC (Opteron) Thunder (Itanium2) Franklin NERSC (Cray XT4) NEC ES (SX6) IBM BG/L Gflops /Proc % peak Gflops/ Proc % peak Gflops/ Proc % peak Gflops/ Proc % peak Gflops/ Proc % peak Gflops/ Proc % peak 488 Atom CdSe Quantu m Dot % 2.851% 5.164% %1.9845%2.647% %5.062% % %0.9521%2.444% %4.455% % %1.832% %3.646% %2.735%

Parallelization in PW DFT codes four levels (k-points, bands, PWs, OpenMP) Band parallelization: n nodes divided into groups k-point parallelization: divide k-points among groups of nodes (limited for large systems, molecules, nanostructures etc) PW parallelization: each group parallelizes over PWs OpenMP, Threaded Libs on the node/chip

OpenMP, Threading for on-node/chip parallelism fewer mpi messages avoids communication bottlenecks aggregation of messages per node reduces latency issues smaller memory footprint (from code and mpi buffers) no on-node mpi messaging extra level of parallelism to improve scaling to larger core counts Timing results for threaded version of PARATEC code used to generate VB and CB states for input to GW code PARATEC (Cray XT5 Jaguar) 686 Si atoms Jaguar Cray XT5 at ORNL (224,162 cores) : Node: 2 AMD Istambul 2.6 GHz 6 core chips (Total 12 cores, 2x6cores)

Non-SCF problem to generate empty CB states Solve selfconsistently for N VB valence states Solve non-selfconsistently for N VB + N CB states Output Output for GW/BSE codes Non-SCF problem is like simulation of metallic system (no gap above top of spectrum) Slow convergence requires convergence criteria for empty states N VB + N CB can be very large Operations on subspace matrix can dominate High percent of eigenpairs calculated compared to SCF calc. Typically almost all the time is for the Non-SCF calc.

Breakdown of Computational Costs for Solving Kohn-Sham LDA/GGA Equations in PARATEC Computational Task (CG solver)Scaling OrthogonalizationMN 2 Subspace diagonalizationN3N3 3d FFTs (most communications)NMlogM Nonlocal pseudopotentialMN 2 (N 2 real space) N= N VB + N CB (N VB ): number of eigenpairs required M: matrix (Hamiltonian) dimension, basis set size (M ~10-20N) (M ~ N) NSCF calculation for GW/BSE (compared to standard SCF)

PARATEC features for Non-SCF problem Efficient distributed implementation of operations on subspace matrix using Scalapack Extra states calculated above the required number to improve convergence of CG solver Option for using direct solver on Hamiltonian when percentage of eigenpairs required is high (>10%) can be faster than CG iterative solver (P. Zhang) Scaling of Iterative Solver (e.g. CG)  N 2 M Compared to Direct (Lapack, Scalapack)  M 3 (M = matrix size (basis, number of PWs), N = number of states) Block-block data layout Block size chosen for optimal performance

PARATEC summary and future developments PARATEC optimized for large parallel machines (Cray, IBM) OpenMP/Threaded version under development (important to get more parallelism, particularly for small systems for GW/BSE, gives faster time to solution) Hybrid Functionals, static-COHSEX (starting point for GW/BSE) Some optimization for generation of empty states for GW/BSE Direct diagonalization of H for cases when high % of eigenstates required (to be in released version soon) for GW/BSE