Beyond GEMM: How Can We Make Quantum Chemistry Fast? or: Why Computer Scientists Don’t Like Chemists Devin Matthews 9/25/142014 BLIS Retreat1.

Slides:



Advertisements
Similar presentations
Matrix Representation
Advertisements

1 Anatomy of a High- Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
The FLAME Project Faculty: Robert van de Geijn (CS/ICES) Don Batory (CS) Maggie Myers (SDS) John Stanton (Chem) Victor (TACC) Research Staff: Field Van.
OpenFOAM on a GPU-based Heterogeneous Cluster
A Framework for Distributed Tensor Computations Martin Schatz Bryan Marker Robert van de Geijn The University of Texas at Austin Tze Meng Low Carnegie.
Maths for Computer Graphics
CSCI 317 Mike Heroux1 Sparse Matrix Computations CSCI 317 Mike Heroux.
Search and Recursion pt. 2 CS221 – 2/25/09. How to Implement Binary Search Take a sorted data-set to search and a key to search for Start at the mid-point.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Tirgul 9 Amortized analysis Graph representation.
ISPDC 2007, Hagenberg, Austria, 5-8 July On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors Alexey Lastovetsky School of.
October 14-15, 2005Conformal Computing Geometry of Arrays: Mathematics of Arrays and  calculus Lenore R. Mullin Computer Science Department College.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
CS 240A: Solving Ax = b in parallel °Dense A: Gaussian elimination with partial pivoting Same flavor as matrix * matrix, but more complicated °Sparse A:
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
CS240A: Conjugate Gradients and the Model Problem.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
Row 1 Row 2 Row 3 Row m Column 1Column 2Column 3 Column 4.
ME751 Advanced Computational Multibody Dynamics Review Calculus Starting Chapter 9 of Haug book January 26, 2010 © Dan Negrut, 2010 ME751, UW-Madison "Motivation.
4.2 Operations with Matrices Scalar multiplication.
Conjugate gradients, sparse matrix-vector multiplication, graphs, and meshes Thanks to Aydin Buluc, Umit Catalyurek, Alan Edelman, and Kathy Yelick for.
Columbus Program System for Molecular Electronic Structure Relativistic Quantum Chemistry Capabilities Russell M. Pitzer Department of Chemistry Ohio State.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Distributed Linear Algebra Peter L. Montgomery Microsoft Research, Redmond, USA RSA 2000 January 17, 2000.
College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.
Fast Low-Frequency Impedance Extraction using a Volumetric 3D Integral Formulation A.MAFFUCCI, A. TAMBURRINO, S. VENTRE, F. VILLONE EURATOM/ENEA/CREATE.
Review of Matrices Or A Fast Introduction.
Matrix Algebra and Applications
After step 2, processors know who owns the data in their assumed partitions— now the assumed partition defines the rendezvous points Scalable Conceptual.
6.375 Final Presentation Jeff Simpson, Jingwen Ouyang, Kyle Fritz FPGA Implementation of Whirlpool and FSB Hash Algorithms.
SINGULAR VALUE DECOMPOSITION (SVD)
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.
When data from a table (or tables) needs to be manipulated, easier to deal with info in form of a matrix. Matrices FreshSophJunSen A0342 B0447 C2106 D1322.
Parallel Solution of the Poisson Problem Using MPI
Copyright © Cengage Learning. All rights reserved. 7 Linear Systems and Matrices.
Introduction to Linear Algebra Mark Goldman Emily Mackevicius.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
BASIC MATHEMATICAL Session 2 Course: S Introduction to Finite Element Method Year: 2010.
Warm Up Perform the indicated operations. If the matrix does not exist, write impossible
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
THE UNIVERSITY OF TEXAS AT AUSTIN Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on Many-Core Architectures Ernie Chan.
Computer Graphics Matrices
On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de.
Adding Algorithm Based Fault-Tolerance to BLIS Tyler Smith, Robert van de Geijn, Mikhail Smelyanskiy, Enrique Quintana-Ortí 1.
3.6 Multiplying Matrices Homework 3-17odd and odd.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
4-3 Matrix Multiplication Objective: To multiply a matrix by a scalar multiple.
Ch. 12 Vocabulary 1.) matrix 2.) element 3.) scalar 4.) scalar multiplication.
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Generating Families of Practical Fast Matrix Multiplication Algorithms
10.4 The Algebra of Matrices
Linear Algebra review (optional)
Using BLIS Building Blocks:
Algebraic Bethe ansatz for the XXZ Heisenberg spin chain
High-Performance Matrix Multiplication
Section 7.4 Matrix Algebra.
Adaptive Strassen and ATLAS’s DGEMM
Quantum One.
Using BLIS Building Blocks:
Elementary Linear Algebra
Dimensions matching Rows times Columns
3.5 Perform Basic Matrix Operations Algebra II.
Computational issues Issues Solutions Large time scale
Presentation transcript:

Beyond GEMM: How Can We Make Quantum Chemistry Fast? or: Why Computer Scientists Don’t Like Chemists Devin Matthews 9/25/ BLIS Retreat1

A Motivating Example Equation-of-Motion Coupled Cluster Theory: what is the difference in energy between the ground and excited states of some molecule? “matrix”: Describes the interactions in the system. The bar means it is “dressed” (i.e. tuned to a specific ground state). ? E S1S1 S0S0 9/25/ BLIS Retreat2 “vector”: Describes the excited state. Should be an eigenvector of H. scalar: The energy difference.

This is Linear Algebra, But… 9/25/ BLIS Retreat3 R1R1 R2R2 R3R3 R4R4 Tensors!

This is Linear Algebra, But… 9/25/ BLIS Retreat4 (+ all permutations!)

…It’s Really Multi-(non)-linear Algebra 9/25/ BLIS Retreat5 Hundreds of tensor contractions in a single “matrix- vector multiply”…

Oh Yeah, It’s Sparse Too… 9/25/ BLIS Retreat6 O2O2 ~0.002% non-zero… ~0.39% non-zero…

Oh Yeah, It’s Sparse Too… 9/25/ BLIS Retreat7,,… Spin-orbital +Symmetry +Spin-integration +Non-orthogonal spin-adaptation +More symmetry 100.0% 0.174% 0.047% 0.016%

Oh Yeah, It’s Sparse Too… 9/25/ BLIS Retreat8 This symmetry is very unwieldy to use and maintain when using GEMM. This tensor may be very large and need to be split amongst several processors or be cached to disk. A B E F … ijkl= Blocks may be distributed to disk or other processors. No symmetry makes using GEMM easier.

Oh Yeah, It’s Sparse Too… 9/25/ BLIS Retreat9 The final reduction from 0.016% to ~0.002% in the previous example is due to point group symmetry:

Oh Yeah, It’s Sparse Too… 9/25/ BLIS Retreat10 The final reduction from 0.016% to ~0.002% in the previous example is due to point group symmetry: ab ij b a

Adding It All Up 9/25/ BLIS Retreat11 1 matrix-vector multiply 1 complicated tensor Point group symmetry Column symmetry Solution of eigenproblem 100s-1000s of tensor contractions 100s-1000s of simpler tensors Multiple GEMMs per contraction 10s of permutations 10s of iterations XXXXXXXX Potentially billions (!!) of calls to GEMM

Adding It All Up 9/25/ BLIS Retreat12

The Big Picture 9/25/ BLIS Retreat13, Chemistry Linear Algebra “Simple” eigenproblem… In terms of tensors… In terms of other tensors… With structured sparsity… With symmetry… With slicing (or blocking etc.)… With more sparsity… In terms of matrices.

Status Quo (CFOUR) 9/25/ BLIS Retreat14, Layer 4 Layer 3 Layer 2 Layer 1 Me Someone Else “Simple” eigenproblem… In terms of tensors… In terms of other tensors… With structured sparsity… With symmetry… With slicing (or blocking etc.)… With more sparsity… In terms of matrices. MPI OMP +

Dealing With Chemistry: Large Scale 9/25/ BLIS Retreat15 Node 1 Node 2Node 3 Node 4 Node 5Node 6 Node 7Node 8Node 9 Pros: Each block has little to no symmetry/sparsity. Blocks can be distributed in many ways. Load balancing can be static or dynamic. Cons: Blocks require padding for edge case. Padding can be excessive for many dimensions or short edge lengths. To avoid padding, some blocks must keep complex structure.

Dealing With Chemistry: Large Scale 9/25/ BLIS Retreat16 Node 1 Node 2Node 3 Node 4 Node 5Node 6 Node 7Node 8Node 9 Pros: Load balancing is automatic. Communication is regular. Little to no padding needed. Can be composed with blocking. Cons: Complex structure is retained at all levels. Communication and local computation needs to take this structure into account.

Dealing With Chemistry: Small Scale 9/25/ BLIS Retreat17 ck em ai The Old WayThe New Way? BLIS: BLAS: = Memory movement

Dealing With Chemistry: Small Scale 9/25/ BLIS Retreat18 AXPY! BLIS: W W kl mn abcd mn abcd kl R Z

Flexibility Through Interfaces 9/25/ BLIS Retreat19 Tensor, Basic Operator Similarity-transform operator Spin-orbital operator Index permutation symmetry Distributed Point group symmetry (Basic tensor functionality) Capabilities: Commutator expansion Factorization, operator resolution Tensor Spin-integration or spin-adaptation Blocking/packing Tensor

Summary Chemistry is hard. A fast GEMM implementation is nice, but doesn’t go far enough. Complex structure can be dealt with – By breaking the problem into simple blocks, – By incorporating the structure into communication and computation, – By relating a complex object to a simpler one (a matrix) bit by bit. Layered and composable interfaces are important. – Implementations written at a “high level” can use “low level” interfaces through intermediate ones. – Adapters can go from one well-defined interface to another. 9/25/ BLIS Retreat20

Thanks! 9/25/ BLIS Retreat21 BLIS: Field van Zee Tyler Smith Many others… CTF/AQ: Edgar Solomonik Jeff Hammond Tensormental: Martin Schatz Bryan Marker Tensor packing: Woody Austin Martin Schatz Robert van de Geijn John Stanton The CFOUR developers