Compiler Support for Optimizing Tensor Contraction Expressions in Quantum Chemistry Computations * Gerald Baumgartner, Ohio State University Daniel Cociorva,

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

Course Outline Traditional Static Program Analysis Software Testing

Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

Code Generation Steve Johnson. May 23, 2005Copyright (c) Stephen C. Johnson The Problem Given an expression tree and a machine architecture, generate.

Dynamic Programming.

Program Representations. Representing programs Goals.

AUTOMATIC GENERATION OF CODE OPTIMIZERS FROM FORMAL SPECIFICATIONS Vineeth Kumar Paleri Regional Engineering College, calicut Kerala, India. (Currently,

Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.

1 CS 201 Compiler Construction Lecture 7 Code Optimizations: Partial Redundancy Elimination.

An Extensible Global Address Space Framework with Decoupled Task and Data Abstractions * Sriram KrishnamoorthyJarek Nieplocha Umit Catalyurek Atanas Rountev.

CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.

Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.

Cpeg421-08S/final-review1 Course Review Tom St. John.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

Global optimization. Data flow analysis To generate better code, need to examine definitions and uses of variables beyond basic blocks. With use- definition.

Memory Constrained Data Locality Optimization for Tensor Contractions* Alina Bibireata, Sandhya Krishnan, Gerald Baumgartner, Daniel Cociorva, Chi-Chung.

Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.

9. Optimization Marcus Denker. 2 © Marcus Denker Optimization Roadmap  Introduction  Optimizations in the Back-end  The Optimizer  SSA Optimizations.

Cache Conscious Indexing for Decision-Support in Main Memory Pradip Dhara.

SING* and ToNC * Scientific Foundations for Internet’s Next Generation Sirin Tekinay Program Director Theoretical Foundations Communication Research National.

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.

Parallelizing Compilers Presented by Yiwei Zhang.

A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.

Topic 6 -Code Generation Dr. William A. Maniatty Assistant Prof. Dept. of Computer Science University At Albany CSI 511 Programming Languages and Systems.

Improving Code Generation Honors Compilers April 16 th 2002.

Global optimization. Data flow analysis To generate better code, need to examine definitions and uses of variables beyond basic blocks. With use- definition.

PSUCS322 HM 1 Languages and Compiler Design II IR Code Optimization Material provided by Prof. Jingke Li Stolen with pride and modified by Herb Mayer PSU.

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

Query Processing Presented by Aung S. Win.

Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.

Compiler Code Optimizations. Introduction Introduction Optimized codeOptimized code Executes faster Executes faster efficient memory usage efficient memory.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.

Optimal Parallelogram Selection for Hierarchical Tiling Authors: Xing Zhou, Maria J. Garzaran, David Padua University of Illinois Presenter: Wei Zuo.

Introduction For some compiler, the intermediate code is a pseudo code of a virtual machine. Interpreter of the virtual machine is invoked to execute the.

Database Management 9. course. Execution of queries.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.

1 Performance and Productivity: NWChem Robert J. Harrison Oak Ridge National Laboratory, University of Tennessee NWChem is funded by the U.S. Department.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 March 01, 2005 Session 14.

Tensor contraction engine & extensible many-electron theory module in NWChem So Hirata Pacific Northwest National Laboratory MSS group meeting (24 Oct,

USC Search Space Properties for Pipelined FPGA Applications University of Southern California Information Sciences Institute Heidi Ziegler, Mary Hall,

The Truth About Parallel Computing: Fantasy versus Reality William M. Jones, PhD Computer Science Department Coastal Carolina University.

1 June 4, June 4, 2016June 4, 2016June 4, 2016 Azusa, CA Sheldon X. Liang Ph. D. Azusa Pacific University, Azusa, CA 91702, Tel: (800)

Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1 J Ramanujam 2 Atanas Rountev 1 P Sadayappan 1 1 Department of Computer Science & Engineering.

High-Level Transformations for Embedded Computing

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.

©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.

Code Optimization Overview and Examples

High-level optimization Jakub Yaghob

Parallel Programming By J. H. Wang May 2, 2017.

Optimization Code Optimization ©SoftMoore Consulting.

Dept of Computer Science

Martin Rinard Laboratory for Computer Science

Florin Balasa University of Illinois at Chicago

Compiler Code Optimizations

Code Optimization Overview and Examples Control Flow Graph

Peng Jiang, Linchuan Chen, and Gagan Agrawal

Architectural-Level Synthesis

Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.

Gerald Baumgartner, Ohio State Daniel Cociorva, Ohio State

Code Generation Part II

TensorFlow: A System for Large-Scale Machine Learning

Identifying Cost-Effective Common Subexpressions to Reduce Operation Count in Tensor Contraction Evaluations Albert Hartono1, Qingda Lu1, Xiaoyang Gao1,

Introduction to Optimization

Rohan Yadav and Charles Yuan (rohany) (chenhuiy)

Multidisciplinary Optimization

Daniel Cociorva, Ohio State Gerald Baumgartner, Ohio State

Presentation transcript:

Compiler Support for Optimizing Tensor Contraction Expressions in Quantum Chemistry Computations * Gerald Baumgartner, Ohio State University Daniel Cociorva, Ohio State University Chi-Chung Lam, Ohio State University P. Sadayappan, Ohio State University J. Ramanujam, Louisiana State University * Supported by the National Science Foundation (NSF) through the ITR program

Accurate computational models of electronic structure are very time consuming –Time/effort to develop code, due to complexity of models for multiple-electron interactions –Execution time of code, even on high-end systems Goal of multi-disciplinary project: develop a program synthesis tool for quantum chemists –Reduce model development time from weeks/months to hours/days –Improve execution time of quantum chemistry code Focus on a particular computational form of great importance in accurate modeling of electronic structure: tensor contractions Introduction/Motivation

The Computational Context Accurate electronic structure models often involve tensor contractions: multi-dimensional summations over products of large arrays, e.g. Many compile-time optimization optimization issues: – Operation minimization via algebraic transformations – Memory minimization via loop fusion and array contraction – Space-time trade-off: Store and re-use versus redundant re-computation of temporary intermediate arrays quantities – Data Locality Optimization: Tile-size selection and partial array expansion for tiling imperfectly nested loops – Data and Computation Partitoning: Minimize communication overhead – Semantic information from high-level expression of computation enables global optimizations across multiple imperfect loop nests

hbar[a,b,i,j] == sum[f[b,c]*t[i,j,a,c],{c}] -sum[f[k,c]*t[k,b]*t[i,j,a,c],{k,c}] +sum[f[a,c]*t[i,j,c,b],{c}] -sum[f[k,c]*t[k,a]*t[i,j,c,b],{k,c}] - sum[f[k,j]*t[i,k,a,b],{k}] -sum[f[k,c]*t[j,c]*t[i,k,a,b],{k,c}] -sum[f[k,i]*t[j,k,b,a],{k}] -sum[f[k,c]*t[i,c]*t[j,k,b,a],{k,c}] +sum[t[i,c]*t[j,d]*v[a,b,c,d],{c,d}] +sum[t[i,j,c,d]*v[a,b,c,d],{c,d}] +sum[t[j,c]*v[a,b,i,c],{c}] -sum[t[k,b]*v[a,k,i,j],{k}] +sum[t[i,c]*v[b,a,j,c],{c}] -sum[t[k,a]*v[b,k,j,i],{k}] -sum[t[k,d]*t[i,j,c,b]*v[k,a,c,d],{k,c,d}] -sum[t[i,c]*t[j,k,b,d]*v[k,a,c,d],{k,c,d}] - sum[t[j,c]*t[k,b]*v[k,a,c,i],{k,c}] +2*sum[t[j,k,b,c]*v[k,a,c,i],{k,c}] -sum[t[j,k,c,b]*v[k,a,c,i],{k,c}] - sum[t[i,c]*t[j,d]*t[k,b]*v[k,a,d,c],{k,c,d}] +2*sum[t[k,d]*t[i,j,c,b]*v[k,a,d,c],{k,c,d}] -sum[t[k,b]*t[i,j,c,d]*v[k,a,d,c],{k,c,d}] - sum[t[j,d]*t[i,k,c,b]*v[k,a,d,c],{k,c,d}] +2*sum[t[i,c]*t[j,k,b,d]*v[k,a,d,c],{k,c,d}] -sum[t[i,c]*t[j,k,d,b]*v[k,a,d,c],{k,c,d}] - sum[t[j,k,b,c]*v[k,a,i,c],{k,c}] -sum[t[i,c]*t[k,b]*v[k,a,j,c],{k,c}] -sum[t[i,k,c,b]*v[k,a,j,c],{k,c}] -sum[t[i,c]*t[j,d]*t[k,a]*v[k,b,c,d],{k,c,d}] -sum[t[k,d]*t[i,j,a,c]*v[k,b,c,d],{k,c,d}] -sum[t[k,a]*t[i,j,c,d]*v[k,b,c,d],{k,c,d}] +2*sum[t[j,d]*t[i,k,a,c]*v[k,b,c,d],{k,c,d}] - sum[t[j,d]*t[i,k,c,a]*v[k,b,c,d],{k,c,d}] -sum[t[i,c]*t[j,k,d,a]*v[k,b,c,d],{k,c,d}] -sum[t[i,c]*t[k,a]*v[k,b,c,j],{k,c}] +2*sum[t[i,k,a,c]*v[k,b,c,j],{k,c}] -sum[t[i,k,c,a]*v[k,b,c,j],{k,c}] +2*sum[t[k,d]*t[i,j,a,c]*v[k,b,d,c],{k,c,d}] - sum[t[j,d]*t[i,k,a,c]*v[k,b,d,c],{k,c,d}] -sum[t[j,c]*t[k,a]*v[k,b,i,c],{k,c}] -sum[t[j,k,c,a]*v[k,b,i,c],{k,c}] -sum[t[i,k,a,c]*v[k,b,j,c],{k,c}] +sum[t[i,c]*t[j,d]*t[k,a]*t[l,b]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[k,b]*t[l,d]*t[i,j,a,c]*v[k,l,c,d],{k,l,c,d}] - 2*sum[t[k,a]*t[l,d]*t[i,j,c,b]*v[k,l,c,d],{k,l,c,d}] +sum[t[k,a]*t[l,b]*t[i,j,c,d]*v[k,l,c,d],{k,l,c,d}] - 2*sum[t[j,c]*t[l,d]*t[i,k,a,b]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[j,d]*t[l,b]*t[i,k,a,c]*v[k,l,c,d],{k,l,c,d}] +sum[t[j,d]*t[l,b]*t[i,k,c,a]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[i,c]*t[l,d]*t[j,k,b,a]*v[k,l,c,d],{k,l,c,d}] +sum[t[i,c]*t[l,a]*t[j,k,b,d]*v[k,l,c,d],{k,l,c,d}] +sum[t[i,c]*t[l,b]*t[j,k,d,a]*v[k,l,c,d],{k,l,c,d}] +sum[t[i,k,c,d]*t[j,l,b,a]*v[k,l,c,d],{k,l,c,d}] +4*sum[t[i,k,a,c]*t[j,l,b,d]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[i,k,c,a]*t[j,l,b,d]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[i,k,a,b]*t[j,l,c,d]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[i,k,a,c]*t[j,l,d,b]*v[k,l,c,d],{k,l,c,d}] +sum[t[i,k,c,a]*t[j,l,d,b]*v[k,l,c,d],{k,l,c,d}] +sum[t[i,c]*t[j,d]*t[k,l,a,b]*v[k,l,c,d],{k,l,c,d}] +sum[t[i,j,c,d]*t[k,l,a,b]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[i,j,c,b]*t[k,l,a,d]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[i,j,a,c]*t[k,l,b,d]*v[k,l,c,d],{k,l,c,d}] +sum[t[j,c]*t[k,b]*t[l,a]*v[k,l,c,i],{k,l,c}] +sum[t[l,c]*t[j,k,b,a]*v[k,l,c,i],{k,l,c}] -2*sum[t[l,a]*t[j,k,b,c]*v[k,l,c,i],{k,l,c}] +sum[t[l,a]*t[j,k,c,b]*v[k,l,c,i],{k,l,c}] -2*sum[t[k,c]*t[j,l,b,a]*v[k,l,c,i],{k,l,c}] +sum[t[k,a]*t[j,l,b,c]*v[k,l,c,i],{k,l,c}] +sum[t[k,b]*t[j,l,c,a]*v[k,l,c,i],{k,l,c}] +sum[t[j,c]*t[l,k,a,b]*v[k,l,c,i],{k,l,c}] +sum[t[i,c]*t[k,a]*t[l,b]*v[k,l,c,j],{k,l,c}] +sum[t[l,c]*t[i,k,a,b]*v[k,l,c,j],{k,l,c}] -2*sum[t[l,b]*t[i,k,a,c]*v[k,l,c,j],{k,l,c}] +sum[t[l,b]*t[i,k,c,a]*v[k,l,c,j],{k,l,c}] +sum[t[i,c]*t[k,l,a,b]*v[k,l,c,j],{k,l,c}] +sum[t[j,c]*t[l,d]*t[i,k,a,b]*v[k,l,d,c],{k,l,c,d}] +sum[t[j,d]*t[l,b]*t[i,k,a,c]*v[k,l,d,c],{k,l,c,d}] +sum[t[j,d]*t[l,a]*t[i,k,c,b]*v[k,l,d,c],{k,l,c,d}] -2*sum[t[i,k,c,d]*t[j,l,b,a]*v[k,l,d,c],{k,l,c,d}] -2*sum[t[i,k,a,c]*t[j,l,b,d]*v[k,l,d,c],{k,l,c,d}] +sum[t[i,k,c,a]*t[j,l,b,d]*v[k,l,d,c],{k,l,c,d}] +sum[t[i,k,a,b]*t[j,l,c,d]*v[k,l,d,c],{k,l,c,d}] +sum[t[i,k,c,b]*t[j,l,d,a]*v[k,l,d,c],{k,l,c,d}] +sum[t[i,k,a,c]*t[j,l,d,b]*v[k,l,d,c],{k,l,c,d}] +sum[t[k,a]*t[l,b]*v[k,l,i,j],{k,l}] +sum[t[k,l,a,b]*v[k,l,i,j],{k,l}] +sum[t[k,b]*t[l,d]*t[i,j,a,c]*v[l,k,c,d],{k,l,c,d}] +sum[t[k,a]*t[l,d]*t[i,j,c,b]*v[l,k,c,d],{k,l,c,d}] +sum[t[i,c]*t[l,d]*t[j,k,b,a]*v[l,k,c,d],{k,l,c,d}] -2*sum[t[i,c]*t[l,a]*t[j,k,b,d]*v[l,k,c,d],{k,l,c,d}] +sum[t[i,c]*t[l,a]*t[j,k,d,b]*v[l,k,c,d],{k,l,c,d}] +sum[t[i,j,c,b]*t[k,l,a,d]*v[l,k,c,d],{k,l,c,d}] +sum[t[i,j,a,c]*t[k,l,b,d]*v[l,k,c,d],{k,l,c,d}] - 2*sum[t[l,c]*t[i,k,a,b]*v[l,k,c,j],{k,l,c}] +sum[t[l,b]*t[i,k,a,c]*v[l,k,c,j],{k,l,c}] +sum[t[l,a]*t[i,k,c,b]*v[l,k,c,j],{k,l,c}] +v[a,b,i,j] Realistic Quantum Chemistry Example

Operation Minimization 2N 6 Ops

Requires 4N 10 operations if indices a – l have range N Using associative, commutative and distributive laws, constant sub-expressions can be factored to reduce operations Optimal formula sequence requires only 6N 6 operations Operation Minimization 2N 6 Ops But additional space is required for the temporary intermediates T1 and T2; often this is much larger than input/output arrays

Loop Fusion for Memory Reduction S = 0 for b, c T1f = 0; T2f = 0 for d, f for e, l T1f += B befl D cdel for j, k T2f jk += T1f C dfjk for a, i, j, k S abij += T2f jk A acik T1 = 0; T2 = 0; S = 0 for b, c, d, e, f, l T1 bcdf += B befl D cdel for b, c, d, f, j, k T2 bcjk += T1 bcdf C dfjk for a, b, c, i, j, k S abij += T2 bcjk A acik Formula sequenceUnfused codeFused code

Tensor Contraction Engine Operation Minimization Memory Minimization Space-Time Trade-Offs Communication Minimization Data Locality Optimization Tensor contraction expressions Sequence of loop nests (expression tree/DAG) (Storage requirements within limits) (Storage required exceeds limits) Imperfectly nested loops Imperfectly nested parallel loops Partitioned, tiled, parallel Fortran/C code

Example to Illustrate Space-Time Trade-Offs for a, e, c, f for i, j X aecf += T ijae T ijcf for c, e, b, k T1 cebk = f1(c, e, b, k) for a, f, b, k T2 afbk = f2(a, f, b, k) for c, e, a, f for b, k Y ceaf += T1 cebk T2 afbk for c, e, a, f E += X aecf Y ceaf array space time X V 4 V 4 O 2 T1V 3 OC f1 V 3 O T2V 3 OC f2 V 3 O Y V 4 V 5 O E 1 V 4 a.. f: range V = i.. k: range O =

Memory-Minimal Form for a, f, b, k T2 afbk = f2(a, f, b, k) for c, e for b, k T1 bk = f1(c, e, b, k) for a, f for i, j X += T ijae T ijcf for b, k Y += T1 bk T2 afbk E += X Y array space time X 1 V 4 O 2 T1 VOC f1 V 3 O T2V 3 OC f2 V 3 O Y 1 V 5 O E 1 V 4 a.. f: range V = 3000 i.. k: range O = 100

Redundant Computation to Allow Full Fusion for a, e, c, f for i, j X += T ijae T ijcf for b, k T1 = f1(c, e, b, k) T2 = f2(a, f, b, k) Y += T1 T2 E += X Y array space time X 1 V 4 O 2 T1 1C f1 V 5 O T2 1C f2 V 5 O Y 1 V 5 O E 1 V 4

Tiling for Reducing Recomputation for a t, e t, c t, f t for a, e, c, f for i, j X aecf += T ijae T ijcf for b, k for c, e T1 ce = f1(c, e, b, k) for a, f T2 af = f2(a, f, b, k) for c, e, a, f Y ceaf += T1 ce T2 af for c, e, a, f E += X aecf Y ceaf array space time X B 4 V 4 O 2 T1 B 2 C f1 (V/B) 2 V 3 O T2 B 2 C f2 (V/B) 2 V 3 O Y B 4 V 5 O E 1 V 4

The Fusion Graph A(i,j)B(j,k) * ∑ ijjk j

Making Fused Loops Explicit A(i,j)B(j,k) * ∑ ijjk j i: range 10 j: range 10 k: range 100

Adding Recomputation Loops A(i,j)B(j,k) * ∑ ijijk j { (, 3, 9000),... } i: range 10 j: range 10 k: range 100

Tiling Recomputation Loops A(i t,i,j)B(j,k) * ∑ ijitit jk j i: range 10 j: range 10 k: range 100 itit

For e1(i,j,k) * e2(i,j) consider recomputation of e2 in k loop For each expression tree node in bottom-up traversal –Construct set of solutions { (fusion, mem cost, rec cost),... } –Prune inferior solutions For all solutions for the root of the expression tree –Split recomputation loops into tiling/intra-tile loops –Move intra-tile loops inside tiling loops where needed –Search for tile sizes that minimize recomputation cost Space-Time Trade-Off Algorithm

Computational domain with exploitable structure Several performance optimization issues being addressed: –algebraic transformations for minimizing # operations –loop fusion –loop tiling (with partial array expansion) –space-time trade-off –data partitioning First prototype of synthesis system by end of 2002 Future extensions –Sparse arrays, symmetry, common subexpressions Conclusions and Status