First principles modeling with Octopus: massive parallelization towards petaflop computing and more A. Castro, J. Alberdi and A. Rubio.

Slides:



Advertisements
Similar presentations
Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.
Advertisements

First Principle Electronic Structure Calculation Prof. Kim Jai Sam ( ) Lab. 공학 ( ) Students : Lee Geun Sik,
ICS 556 Parallel Algorithms Ebrahim Malalla Office: Bldg 22, Room
Ionization of the Hydrogen Molecular Ion by Ultrashort Intense Elliptically Polarized Laser Radiation Ryan DuToit Xiaoxu Guan (Mentor) Klaus Bartschat.
On Attributes and Limitations of Linear Optics in Computing A personal view Joseph Shamir Department of Electrical Engineering Technion, Israel OSC2009.
Claude TADONKI Mines ParisTech – LAL / CNRS / INP 2 P 3 University of Oujda (Morocco) – October 7, 2011 High Performance Computing Challenges and Trends.
CS 584. Review n Systems of equations and finite element methods are related.
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
1 Aug 7, 2004 GPU Req GPU Requirements for Large Scale Scientific Applications “Begin with the end in mind…” Dr. Mark Seager Asst DH for Advanced Technology.
1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Large-Scale Density Functional Calculations James E. Raynolds, College of Nanoscale Science and Engineering Lenore R. Mullin, College of Computing and.
1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.
Fast (finite) Fourier Transforms (FFTs) Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com December 5,
Density Functional Theory And Time Dependent Density Functional Theory
MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Lectures Introduction to computational modelling and statistics1 Potential models2 Density Functional.
Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Direct Self-Consistent Field Computations on GPU Clusters Guochun.
© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March
Lecture 17: Excitations: TDDFT Successes and Failures of approximate functionals Build up to many-body methods Electronic Structure of Condensed Matter,
Parallelism and Robotics: The Perfect Marriage By R.Theron,F.J.Blanco,B.Curto,V.Moreno and F.J.Garcia University of Salamanca,Spain Rejitha Anand CMPS.
PuReMD: Purdue Reactive Molecular Dynamics Package Hasan Metin Aktulga and Ananth Grama Purdue University TST Meeting,May 13-14, 2010.
Sobolev Showcase Computational Mathematics and Imaging Lab.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Parallel Simulation of Continuous Systems: A Brief Introduction
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
SuperLU_DIST on GPU Cluster Sherry Li FASTMath Meeting, Oct. 1-2, /2014 “A distributed CPU-GPU sparse direct solver”, P. Sao, R. Vuduc and X.S. Li, Euro-Par.
Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.
S. Ray Thomas Müller John von Neumann Institute for Computing Central Institute for Applied Mathematics Research Centre Jülich D Jülich,
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.
Computational Aspects of Multi-scale Modeling Ahmed Sameh, Ananth Grama Computing Research Institute Purdue University.
INFSO-RI Enabling Grids for E-sciencE SALUTE – Grid application for problems in quantum transport E. Atanassov, T. Gurov, A. Karaivanova,
TURBOMOLE Lee woong jae.
Domain Decomposition in High-Level Parallelizaton of PDE codes Xing Cai University of Oslo.
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 Final Project Notes.
Preliminary CPMD Benchmarks On Ranger, Pople, and Abe TG AUS Materials Science Project Matt McKenzie LONI.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.
Data Structures and Algorithms in Parallel Computing Lecture 7.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Advanced User Support for MPCUGLES code at University of Minnesota October 09,
Today's Software For Tomorrow's Hardware: An Introduction to Parallel Computing Rahul.S. Sampath May 9 th 2007.
Advanced methods of molecular dynamics 1.Monte Carlo methods 2.Free energy calculations 3.Ab initio molecular dynamics 4.Quantum molecular dynamics 5.Trajectory.
A Parallel Linear Solver for Block Circulant Linear Systems with Applications to Acoustics Suzanne Shontz, University of Kansas Ken Czuprynski, University.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Background Computer System Architectures Computer System Software.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
HPC Design of Ion Based Molecular Clusters A. Bende, C. Varodi, T. M. Di Palma INCDTIM – Istituto Motori-CNR, Napoli, Italy Romania-JINR cooperation framework.
HPC Roadshow Overview of HPC systems and software available within the LinkSCEEM project.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Xing Cai University of Oslo
Computer usage Notur 2007.
G. Castiglia1, P. P. Corso1, R. Daniele1, E. Fiordilino1, F
Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang
Prof. Sanjay. V. Khare Department of Physics and Astronomy,
Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.
Presentation transcript:

First principles modeling with Octopus: massive parallelization towards petaflop computing and more A. Castro, J. Alberdi and A. Rubio

Outline Theoretical Spectroscopy The octopus code Parallelization 2

Outline Theoretical Spectroscopy The octopus code Parallelization 3

Theoretical Spectroscopy 4

Electronic excitations:  Optical absorption  Electron energy loss  Inelastic X-ray scattering  Photoemission  Inverse photoemission  … 5

Theoretical Spectroscopy Goal: First principles (from electronic structure) theoretical description of the various spectroscopies (“theoretical beamlines”): 6

Theoretical Spectroscopy Role: interpretation of (complex) experimental findings 7

Theoretical Spectroscopy 8 Role: interpretation of (complex) experimental findings Theoretical atomistic structures, and corresponding TEM images.

Theoretical Spectroscopy 9

10

Theoretical Spectroscopy The European Theoretical Spectroscopy Facility (ETSF) 11

Theoretical Spectroscopy 12 The European Theoretical Spectroscopy Facility (ETSF)  Networking  Integration of tools (formalism, software)  Maintenance of tools  Support, service, formation

Theoretical Spectroscopy The octopus code is a member of a family of free software codes developed, to a large extent, within the ETSF:  abinit  octopus  dp 13

Outline Theoretical Spectroscopy The octopus code Parallelization 14

The octopus code Targets:  Optical absorption spectra of molecules, clusters, nanostructures, solids.  Response to lasers (non-perturbative response to high-intensity fields)  Dichroic spectra, and other mixed (electric- magnetic responses)  Adiabatic and non-adiabatic Molecular Dynamics (for, e.g. infrared and vibrational spectra, or photochemical reactions).  Quantum Optimal Control Theory for molecular processes. 15

The octopus code Physical approximations and techniques:  Density-Functional Theory, Time-Dependent Density-Functional Theory to describe the electron structure. Comprehensive set of functionals through the libxc library.  Mixed quantum-classical systems.  Both real-time and frequency domain response (“Casida” and “Sternheimer” formulations). 16

The octopus code 17 Numerics:  Basic representation: real space grid.  Usually regular and rectangular, occasionally curvilinear.  Plane waves for some procedures (especially for periodic systems)  Atomic orbitals for some procedures

The octopus code 18 Derivative in a point: sum over neighbor points. C ij depend on the points used: the stencil. More points -> more precision. Semi-local operation.

The octopus code The key equations  Ground-state DFT: Kohn-Sham equations.  Time-dependent DFT: time-dependent KS eqs: 19

The octopus code Key numerical operations:  Linear systems with sparse matrices.  Eigenvalue systems with sparse matrices.  Non-linear eigenvalue systems.  Propagation of “Schrödinger-like” equations.  The dimension can go up to 10 million points.  The storage needs can go up to 10 Gb. 20

The octopus code Use of libraries:  BLAS, LAPACK  GNU GSL mathematical library.  FFTW  NetCDF  ETSF input/output library  Libxc exchange and correlation library  Other optional libraries. 21

22

Outline Theoretical Spectroscopy The octopus code Parallelization 23

Objective Reach petaflops computing, with a scientific code Simulate photosynthesis of the light in chlorophyll 24

Simulation objective Photovoltaic materials Biomolecules 25

The Octopus code Software package for electron dynamics Developed in the UPV/EHU Ground state and excited states properties Real­time, Casida and Sternheimer TDDFT Quantum transport and optimal control Free software: GPL license 26

Octopus simulation strategy Pseudopotential approximation Real­space grids Main operation: the finite difference Laplacian 27

Libraries Intensive use of libraries General libraries:  BLAS  LAPACK  FFT  Zoltan/Metis ... Specific libraries  Libxc  ETSF_IO 28

Multi­level parallelization MPI Kohn­Sham states Real­space domains In Node OpenMP threads OpenCL tasks Vectorization CPUGPU 29

Target systems: Massive number of execution units  Multi­core processors with vectorial FPUs  IBM Blue Gene architecture  Graphical processing units 30

High Level Parallelization MPI parallelization 31

Parallelization by states/orbitals Assign each processor a group of states Time­propagation is independent for each state Little communication required Limited by the number of states in the system 32

Domain parallelization Assign each processor a set of grid points Partition libraries: Zoltan or Metis 33

Main operations in domain parallelization Laplacian: copy points in domain boundaries Overlap computation and communication 34 Integration: global sums (reductions) Group reduction operations

Low level paralelization and vectorization OpenMP and GPU

Two approaches OpenMP Thread programming based on compiler directives In­node parallelization Little memory overhead compared to MPI Scaling limited by memory bandwidth Multithreaded Blas and Lapack OpenCL Hundreds of execution units High memory bandwidth but with long latency Behaves like a vector processor (length > 16) Separated memory: copy from/to main memory 36

Supercomputers 37 Corvo cluster  X86_64 VARGAS (in IDRIS)  Power6  67 teraflops MareNostrum  PowerPC 970  94 teraflops Jugene (image)  1 petaflops

Test Results 38

Laplacian operator Comparison in performance of the finite difference Laplacian operator CPU uses 4 threads GPU is 4 times faster Cache effects are visible 39

Time propagation Comparison in performance for a time propagation Fullerene molecule The GPU is 3 times faster Limited by copying and non­GPU code 40

Multi­level parallelization Clorophyll molecule: 650 atoms Jugene ­ Blue Gene/P Sustained throughput: > 6.5 teraflops Peak throughput: 55 teraflops 41

Scaling

Scaling (II) 43 Comparison of two atomic system in Jugene

Target system Jugene all nodes  processor cores = nodes  Maximum theoretical performance of 1002 MFlops 5879 atoms chlorophyll system  Complete molecule of spinach 44

Tests systems Smaller molecules  180 atoms  441 atoms  650 atoms  1365 atoms Partition of machines  Jugene and Corvo 45

Profiling Profiled within the code Profiled with Paraver tool 

1 TD iteration Poisson

Some “inner” iterations

One “inner” iteration IreceiveIsendIwait

Poisson solver 2 xAlltoallAllgather Scatter

Improvements Memory improvements in GS  Split the memory among the nodes  Use of ScaLAPACK Improvements in the Poisson solver for TD  Pipeline execution Execute Poisson while continues with an approximation  Use new algorithms like FFM  Use of parallel FFTs 51

Conclusions Kohn­Sham scheme is inherently parallel It can be exploited for parallelization and vectorization Suited to current and future computer architectures Theoretical improvements for large system modeling 52