First principles modeling with Octopus: massive parallelization towards petaflop computing and more A. Castro, J. Alberdi and A. Rubio
Outline Theoretical Spectroscopy The octopus code Parallelization 2
Outline Theoretical Spectroscopy The octopus code Parallelization 3
Theoretical Spectroscopy 4
Electronic excitations: Optical absorption Electron energy loss Inelastic X-ray scattering Photoemission Inverse photoemission … 5
Theoretical Spectroscopy Goal: First principles (from electronic structure) theoretical description of the various spectroscopies (“theoretical beamlines”): 6
Theoretical Spectroscopy Role: interpretation of (complex) experimental findings 7
Theoretical Spectroscopy 8 Role: interpretation of (complex) experimental findings Theoretical atomistic structures, and corresponding TEM images.
Theoretical Spectroscopy 9
10
Theoretical Spectroscopy The European Theoretical Spectroscopy Facility (ETSF) 11
Theoretical Spectroscopy 12 The European Theoretical Spectroscopy Facility (ETSF) Networking Integration of tools (formalism, software) Maintenance of tools Support, service, formation
Theoretical Spectroscopy The octopus code is a member of a family of free software codes developed, to a large extent, within the ETSF: abinit octopus dp 13
Outline Theoretical Spectroscopy The octopus code Parallelization 14
The octopus code Targets: Optical absorption spectra of molecules, clusters, nanostructures, solids. Response to lasers (non-perturbative response to high-intensity fields) Dichroic spectra, and other mixed (electric- magnetic responses) Adiabatic and non-adiabatic Molecular Dynamics (for, e.g. infrared and vibrational spectra, or photochemical reactions). Quantum Optimal Control Theory for molecular processes. 15
The octopus code Physical approximations and techniques: Density-Functional Theory, Time-Dependent Density-Functional Theory to describe the electron structure. Comprehensive set of functionals through the libxc library. Mixed quantum-classical systems. Both real-time and frequency domain response (“Casida” and “Sternheimer” formulations). 16
The octopus code 17 Numerics: Basic representation: real space grid. Usually regular and rectangular, occasionally curvilinear. Plane waves for some procedures (especially for periodic systems) Atomic orbitals for some procedures
The octopus code 18 Derivative in a point: sum over neighbor points. C ij depend on the points used: the stencil. More points -> more precision. Semi-local operation.
The octopus code The key equations Ground-state DFT: Kohn-Sham equations. Time-dependent DFT: time-dependent KS eqs: 19
The octopus code Key numerical operations: Linear systems with sparse matrices. Eigenvalue systems with sparse matrices. Non-linear eigenvalue systems. Propagation of “Schrödinger-like” equations. The dimension can go up to 10 million points. The storage needs can go up to 10 Gb. 20
The octopus code Use of libraries: BLAS, LAPACK GNU GSL mathematical library. FFTW NetCDF ETSF input/output library Libxc exchange and correlation library Other optional libraries. 21
22
Outline Theoretical Spectroscopy The octopus code Parallelization 23
Objective Reach petaflops computing, with a scientific code Simulate photosynthesis of the light in chlorophyll 24
Simulation objective Photovoltaic materials Biomolecules 25
The Octopus code Software package for electron dynamics Developed in the UPV/EHU Ground state and excited states properties Realtime, Casida and Sternheimer TDDFT Quantum transport and optimal control Free software: GPL license 26
Octopus simulation strategy Pseudopotential approximation Realspace grids Main operation: the finite difference Laplacian 27
Libraries Intensive use of libraries General libraries: BLAS LAPACK FFT Zoltan/Metis ... Specific libraries Libxc ETSF_IO 28
Multilevel parallelization MPI KohnSham states Realspace domains In Node OpenMP threads OpenCL tasks Vectorization CPUGPU 29
Target systems: Massive number of execution units Multicore processors with vectorial FPUs IBM Blue Gene architecture Graphical processing units 30
High Level Parallelization MPI parallelization 31
Parallelization by states/orbitals Assign each processor a group of states Timepropagation is independent for each state Little communication required Limited by the number of states in the system 32
Domain parallelization Assign each processor a set of grid points Partition libraries: Zoltan or Metis 33
Main operations in domain parallelization Laplacian: copy points in domain boundaries Overlap computation and communication 34 Integration: global sums (reductions) Group reduction operations
Low level paralelization and vectorization OpenMP and GPU
Two approaches OpenMP Thread programming based on compiler directives Innode parallelization Little memory overhead compared to MPI Scaling limited by memory bandwidth Multithreaded Blas and Lapack OpenCL Hundreds of execution units High memory bandwidth but with long latency Behaves like a vector processor (length > 16) Separated memory: copy from/to main memory 36
Supercomputers 37 Corvo cluster X86_64 VARGAS (in IDRIS) Power6 67 teraflops MareNostrum PowerPC 970 94 teraflops Jugene (image) 1 petaflops
Test Results 38
Laplacian operator Comparison in performance of the finite difference Laplacian operator CPU uses 4 threads GPU is 4 times faster Cache effects are visible 39
Time propagation Comparison in performance for a time propagation Fullerene molecule The GPU is 3 times faster Limited by copying and nonGPU code 40
Multilevel parallelization Clorophyll molecule: 650 atoms Jugene Blue Gene/P Sustained throughput: > 6.5 teraflops Peak throughput: 55 teraflops 41
Scaling
Scaling (II) 43 Comparison of two atomic system in Jugene
Target system Jugene all nodes processor cores = nodes Maximum theoretical performance of 1002 MFlops 5879 atoms chlorophyll system Complete molecule of spinach 44
Tests systems Smaller molecules 180 atoms 441 atoms 650 atoms 1365 atoms Partition of machines Jugene and Corvo 45
Profiling Profiled within the code Profiled with Paraver tool
1 TD iteration Poisson
Some “inner” iterations
One “inner” iteration IreceiveIsendIwait
Poisson solver 2 xAlltoallAllgather Scatter
Improvements Memory improvements in GS Split the memory among the nodes Use of ScaLAPACK Improvements in the Poisson solver for TD Pipeline execution Execute Poisson while continues with an approximation Use new algorithms like FFM Use of parallel FFTs 51
Conclusions KohnSham scheme is inherently parallel It can be exploited for parallelization and vectorization Suited to current and future computer architectures Theoretical improvements for large system modeling 52