Accelerating Quantum Chemistry with Batched and Vectorized Integrals

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
A Dynamic Binary Hash Scheme for IPv6 Lookup Q. Sun 1, X. Huang 1, X. Zhou 1, and Y. Ma 1,2 1. School of Computer Science and Technology 2. Beijing Key.
E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 1 Enzo Papandrea COMPUTING PERFORMANCE.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Direct Self-Consistent Field Computations on GPU Clusters Guochun.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Telecommunications and Signal Processing Seminar Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Quantum Chemistry and First Principles Molecular Dynamics on GPUs Todd J. Martinez Dept of Chemistry and PULSE Institute, Stanford University SLAC National.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:
Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.
Single Node Optimization Computational Astrophysics.
A parallel High Level Trigger benchmark (using multithreading and/or SSE)‏ Håvard Bjerke.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Sunpyo Hong, Hyesoon Kim
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
NFV Compute Acceleration APIs and Evaluation
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Early Results of Deep Learning on the Stampede2 Supercomputer
Ioannis E. Venetis Department of Computer Engineering and Informatics
Geant4 MT Performance Soon Yung Jun (Fermilab)
Parallelized JUNO simulation software based on SNiPER
OCR on Knights Landing (Xeon-Phi)
Presented by: Tim Olson, Architect
Exploiting Parallelism
Constructing a system with multiple computers or processors
Real-Time Ray Tracing Stefan Popov.
Morgan Kaufmann Publishers
Vector Processing => Multimedia
COMP4211 : Advance Computer Architecture
Accelerated Single Ray Tracing for Wide Vector Units
What is Parallel and Distributed computing?
Software Cache Coherent Control by Parallelizing Compiler
Multi-core CPU Computing Straightforward with OpenMP
Linchuan Chen, Peng Jiang and Gagan Agrawal
Department of Computer Science University of California, Santa Barbara
Mattan Erez The University of Texas at Austin
Early Results of Deep Learning on the Stampede2 Supercomputer
Compiler Back End Panel
Compiler Back End Panel
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Peng Jiang, Linchuan Chen, and Gagan Agrawal
Hybrid Programming with OpenMP and MPI
Many-Core Graph Workload Analysis
Samuel Larsen Saman Amarasinghe Laboratory for Computer Science
Multi-Core Programming Assignment
Multicore and GPU Programming
Department of Computer Science University of California, Santa Barbara
6- General Purpose GPU Programming
Multicore and GPU Programming
Parallel computing in Computational chemistry
Presentation transcript:

Accelerating Quantum Chemistry with Batched and Vectorized Integrals Hua Huang, Edmond Chow Georgia Institute of Technology November 14th, 2018

Quantum Chemistry Calculation: “Hotspot” Material Science Chemistry Source: NERSC 2017 annual report: https://www.nersc.gov/assets/Uploads/2017NERSC-AnnualReport.pdf

Vectorization Software must exploit SIMD vectorization to obtain high performance ISA SIMD width FP32 / Cycle SSE 4 128 bits 8 (no FMA) AVX 256 bits 16 (no FMA) AVX-2 32 (with FMA) AVX-512 512 bits 64 (with FMA) +AVX2 +AVX Source: https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/

Vectorization Vectorizable loop: “Horizontal vectorization” i=0 …… One SIMD word x: double *x, *y, alpha; for (int i = 0; i < 1024; i++) x[i] += alpha * y[i]; “Horizontal vectorization” double *x0, *x1, *x2, *x3; double *y0, *y1, *y2, *y3; for (int i = 0; i < 1024; i++) { x0[i] += alpha * y0[i]; x1[i] += alpha * y1[i]; x2[i] += alpha * y2[i]; x3[i] += alpha * y3[i]; } i=0 i=1023 x0: x1: x2: x3: …… One SIMD word

Electron Repulsive Integral (ERI) ERI calculation: one of the main kernels in quantum chemistry ERI : basis functions (known) primitive functions contracted integral primitive integrals 8-way symmetries: 𝐴𝐵 𝐶𝐷 = 𝐵𝐴 𝐶𝐷 = 𝐴𝐵 𝐷𝐶 = 𝐶𝐷 𝐴𝐵 Number of primitive integrals for a contracted integral: 𝐾 𝐴 𝐾 𝐵 𝐾 𝐶 𝐾 𝐷 Lightly / heavily contracted basis set: 𝐾 𝐴 𝐾 𝐵 𝐾 𝐶 𝐾 𝐷 is small / large

Electron Repulsive Integral (ERI) Angular momentum (AM) number: Shell: a group of basis functions with same AM & center coordinate AM = 0, 1, 2, 3, …  s, p, d, f, … shell Shell quartet (MN|PQ): a set of integrals defined by four shells M, N, P, Q

Vectorizing ERI Calculation Huge amount ( 10 7 ~ 10 10 ) of ERIs  need vectorization Previous attempts to improve the vectorization of an existing ERI code: Shan, Austin, De Jong, et al., 2013 improved TEXAS Chow, Liu, Misra, et al., 2015 optimized ERD  OptERD vectorization target Resulted in better cache performance but SIMD performance was still poor!

Vectorizing ERI Calculation Simint: Based on the Obara-Saika (OS) recurrence relations Large SIMD vectorization speedup Integrating into Psi4, NWChem & GAMESS Simint vectorized the computation of multiple primitive integrals Simint vectorization target (horizontal vectorization)

Batching ERIs in Quantum Chemistry Codes Challenge: construct lists of shell quartets that: Are unique under the 8-way symmetry property Will not be screened (the absolute ERI values are large enough) Are not a precomputed list of all shell quartets with the same AM class Maximize the number of shell quartets that can be batched together “micro-benchmarks” “real-world applications”

Batching ERIs in Quantum Chemistry Codes GTFock and Simint: GTFock: library for distributed memory parallel Fock matrix computation with a demo HF-SCF program GTFock previous used OptERD, now uses Simint for ERI calculation Protein-ligand system simulated using GTFock GTFock strong scaling on Tianhe-2

Batching ERIs in Quantum Chemistry Codes A compute task in GTFock:

Performance Test Setting Test platform: NERSC Cori supercomputer Intel Xeon Phi 7250 68C / 272T @ 1.4GHz, fixed to quadrant clustering mode 16GB MCDRAM, fixed to cache mode Intel C/C++/Fortran compiler 2017v4 Cray Aries interconnect, Cray MPI 7.6.2 and Cray LibSci (math library)

Performance Test Setting Test molecules: from a protein-ligand complex consisting of a HIV drug molecule bound to HIV II protease Basis sets: aug-cc-pVTZ: A lightly contracted basis set cc-pVDZ: A moderately contracted basis set, has few high AM shells ANO-DZ: A heavily contracted basis set

Batching ERIs in Quantum Chemistry Codes Why batching ERIs is important for vectorization: Table: Average Number of Primitive Integrals of Simint Call Basis Set w/o Batching w/ Batching Ratio aug-cc-pVTZ 2.7 40.0 14.81 cc-pVDZ 7.4 71.5 9.66 ANO-DZ 79.3 1184.8 14.94 Test molecule: protein_28 ERI batching: greatly increases the SIMD loop length

Batching ERIs in Quantum Chemistry Codes ERI calculations speedup via ERI batching: Table: Average ERI calculation times (in second) per SCF iteration Basis Set OptERD Scalar Simint w/o Batching Vectorized Simint w/o Batching Vectorized Simint w/ Batching aug-cc-pVTZ 251 128 143 39.4 cc-pVDZ 2.03 0.75 0.67 0.33 ANO-DZ 3511 1677 393 406 Test molecule: protein_28 Vectorized Simint with ERI batching: 6~8x faster compared to OptERD 2~3x faster compared to no ERI batching

Efficient Parallel Fock Matrix Accumulation Suppose a shell quartet (MN|PQ) is computed, then:

Efficient Parallel Fock Matrix Accumulation After speeding up ERI calculations, the Fock matrix accumulation becomes the new performance bottleneck: Test molecule: protein_28

Efficient Parallel Fock Matrix Accumulation Suppose we have computed the shell quartet (MN|PQ), then Number of atomic operations after moving lines 6-8 outside the 4th loop: 𝐴𝑂 1 = dimM × dimN + 2 × dimM × dimN × dimP + 3 × dimM × dimN × dimP × dimQ

Efficient Parallel Fock Matrix Accumulation Approach 1: split Algorithm 2 into six four-fold loops s.t. each four-fold loop only update one block of the Fock matrix. Number of atomic operations: 𝐴𝑂 2 = dimP × dimQ + dimN × (dimP × dimQ) + dimM × (dimN + dimP + dimQ) Problem: the ERI array is read six times and has discontinuous memory access

Efficient Parallel Fock Matrix Accumulation Approach 2: Important observation: size of each update block is usually ≤15×15 Each thread accumulates ERI results on its 𝐴𝑂 2 size private buffer Accumulate thread-private buffers to the shared Fock matrix later Further optimization enabled via ERI batching: 𝑀𝑁 𝑃 𝑄 1 : 𝐹 𝑀𝑁 , 𝐹 𝑀𝑃 , 𝐹 𝑁𝑃 , 𝐹 𝑀 𝑄 1 , 𝐹 𝑁 𝑄 1 , 𝐹 𝑃 𝑄 1 𝑀𝑁 𝑃 𝑄 2 : 𝐹 𝑀𝑁 , 𝐹 𝑀𝑃 , 𝐹 𝑁𝑃 , 𝐹 𝑀 𝑄 2 , 𝐹 𝑁 𝑄 2 , 𝐹 𝑃 𝑄 2 𝑀𝑁 𝑃 𝑄 3 : 𝐹 𝑀𝑁 , 𝐹 𝑀𝑃 , 𝐹 𝑁𝑃 , 𝐹 𝑀 𝑄 3 , 𝐹 𝑁 𝑄 3 , 𝐹 𝑃 𝑄 3 𝑀𝑁 𝑃 𝑄 𝑘 : 𝐹 𝑀𝑁 , 𝐹 𝑀𝑃 , 𝐹 𝑁𝑃 , 𝐹 𝑀 𝑄 𝑘 , 𝐹 𝑁 𝑄 𝑘 , 𝐹 𝑃 𝑄 𝑘 𝑀𝑁 𝑃 𝑄 𝑘−1 : 𝐹 𝑀𝑁 , 𝐹 𝑀𝑃 , 𝐹 𝑁𝑃 , 𝐹 𝑀 𝑄 𝑘−1 , 𝐹 𝑁 𝑄 𝑘−1 , 𝐹 𝑃 𝑄 𝑘−1 … … “Lazy Update” 𝐹 𝑀𝑁 , 𝐹 𝑀𝑃 , 𝐹 𝑁𝑃

Efficient Parallel Fock Matrix Accumulation Table: Fock Matrix Accumulation Timings (in seconds) and Speedup After Optimization Basis Set Batched & Vectorized ERI Fock Accum. w/o Optimization Fock Accum. w/ Optimization Fock Accum. Speedup aug-cc-pVTZ 39.4 136.3 36.5 3.73 cc-pVDZ 0.336 0.432 0.197 2.19 ANO-DZ 405.5 13.3 6.15 2.16 Test molecule: protein_28 Optimized Fock matrix accumulation: 2~3x speedup

Overall Results Fock build using vectorized Simint with ERI batching: Table: Fock Build Timings (in seconds) Basis Set OptERD Scalar Simint w/o Batching Vectorized Simint w/o Batching Vectorized Simint w/ Batching aug-cc-pVTZ 345 221 250 92.5 cc-pVDZ 2.98 1.66 1.56 0.68 ANO-DZ 3578 1722 427 436 Test molecule: protein_28 Fock build using vectorized Simint with ERI batching: 4~8x faster compared to OptERD 2.3~2.7x faster compared to no ERI batching

Overall Results Multi-thread efficiency (9 KNL nodes, 1hsg_28 molecule): Circle: ERI calculation Cross: Fock accumulation Diamond: Fock build

Software Availability Simint: https://github.com/simint-chem GTFock: https://github.com/gtfock-chem IPCC at Georgia Tech: https://www.cc.gatech.edu/~echow/ipcc

Thank You!