999999-1 XYZ 11/21/2015 MIT Lincoln Laboratory Monolithic Compiler Experiments Using C++ Expression Templates* Lenore R. Mullin** Edward Rutledge Robert.

Slides:



Advertisements
Similar presentations
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Advertisements

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Intermediate Code Generation
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Why Systolic Architecture ?. Motivation & Introduction We need a high-performance, special-purpose computer system to meet specific application. I/O and.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
AUTOMATIC GENERATION OF CODE OPTIMIZERS FROM FORMAL SPECIFICATIONS Vineeth Kumar Paleri Regional Engineering College, calicut Kerala, India. (Currently,
U Albany SUNY PETE code Review Xingmin Luo 6/12/2003.
Data Parallel Algorithms Presented By: M.Mohsin Butt
October 14-15, 2005Conformal Computing Geometry of Arrays: Mathematics of Arrays and  calculus Lenore R. Mullin Computer Science Department College.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
The application of Conformal Computing techniques to problems in computational physics: The Fast Fourier Transform James E. Raynolds, College of Nanoscale.
C++ for Engineers and Scientists Third Edition
Query Processing Presented by Aung S. Win.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Chapter 1 Algorithm Analysis
Chapter 7: Arrays. In this chapter, you will learn about: One-dimensional arrays Array initialization Declaring and processing two-dimensional arrays.
February 3rd, 2010 KS BRMS. Discalaimer The GUI for the BRMS is currently not running, and was developed using a outdated framework so fixing is not an.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.
Operator Precedence First the contents of all parentheses are evaluated beginning with the innermost set of parenthesis. Second all multiplications, divisions,
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Object-Oriented Program Development Using Java: A Class-Centered Approach, Enhanced Edition.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
Data Structures Using C++ 2E Chapter 10 Sorting Algorithms.
Compiler Principles Fall Compiler Principles Lecture 0: Local Optimizations Roman Manevich Ben-Gurion University.
Lrm-1 lrm 11/15/2015 University at Albany, SUNY Efficient Radar Processing Via Array and Index Algebras Lenore R. Mullin, Daniel J. Rosenkrantz, and Harry.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
XYZ 11/16/2015 MIT Lincoln Laboratory AltiVec Extensions to the Portable Expression Template Engine (PETE)* Edward Rutledge HPEC September,
Standard Template Library The Standard Template Library was recently added to standard C++. –The STL contains generic template classes. –The STL permits.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Tetris Agent Optimization Using Harmony Search Algorithm
Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work.
Slide-1 Parallel MATLAB MIT Lincoln Laboratory Multicore Programming in pMatlab using Distributed Arrays Jeremy Kepner MIT Lincoln Laboratory This work.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Algorithm Analysis (Big O)
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Ada, Scheme, R Emory Wingard. Ada History Department of Defense in search of high level language around Requirements drafted for the language.
U Albany SUNY 1 Outline Notes (not in presentation) Intro Overview of PETE (very high level) –What pete does. –Files – say what each file does Explain.
U Albany SUNY PETE code Review Xingmin Luo 6/12/2003.
Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.
University at Albany,SUNY lrm-1 lrm 6/28/2016 Levels of Processor/Memory Hierarchy Can be Modeled by Increasing Dimensionality of Data Array. –Additional.
©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.
Auburn University
Analysis of Algorithms
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Optimization Code Optimization ©SoftMoore Consulting.
Outline Notes (not in presentation)
Object - Oriented Programming Language
A Practical Stride Prefetching Implementation in Global Optimizer
Monolithic Compiler Experiments Using C++ Expression Templates*
Monolithic Compiler Experiments Using C++ Expression Templates*
Building the Support for Radar Processing Across Memory Hierarchies:
Building the Support for Radar Processing Across Memory Hierarchies:
(HPEC-LB) Outline Notes
Building the Support for Radar Processing Across Memory Hierarchies:
Building the Support for Radar Processing Across Memory Hierarchies:
Memory Efficient Radar Processing
<PETE> Shape Programmability
Outline Notes (not in presentation)
Object - Oriented Programming Language
Object - Oriented Programming Language
Presentation transcript:

XYZ 11/21/2015 MIT Lincoln Laboratory Monolithic Compiler Experiments Using C++ Expression Templates* Lenore R. Mullin** Edward Rutledge Robert Bond HPEC September, 2002 Lexington, MA * This work is sponsored by the Department of Defense, under Air Force Contract F C Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the Department of Defense. ** Dr. Mullin participated in this work while on sabbatical leave from the Dept. of Computer Science, University of Albany, State University of New York, Albany, NY.

MIT Lincoln Laboratory er-2 KAM 11/21/2015 Outline Overview –Motivation –The Psi Calculus –Expression Templates Implementing the Psi Calculus with Expression Templates Experiments Future Work and Conclusions

MIT Lincoln Laboratory er-3 KAM 11/21/2015 Motivation: The Mapping Problem Math and indexing operations in same expression Framework for design space search –Rigorous and provably correct –Extensible to complex architectures Approach Mathematics of Arrays Example:“raising” array dimensionality y= conv intricate math intricate memory accesses (indexing) (x) Memory Hierarchy Parallelism Main Memory L2 Cache L1 Cache Map x: Map:

MIT Lincoln Laboratory er-4 KAM 11/21/2015 Basic Idea PETE Style Array Operations Implementation Theory Combining Expression Templates and Psi Calculus yields an optimal implementation of array operations Expression Templates –Efficient high-level container operations –C++ Expression Templates –Efficient high-level container operations –C++ Psi Calculus – Array operations that compose efficiently – Minimum number of memory reads/writes Psi Calculus – Array operations that compose efficiently – Minimum number of memory reads/writes Benefits Theory based High level API Efficient

MIT Lincoln Laboratory er-5 KAM 11/21/2015 Psi Calculus 1 Key Concepts Denotational Normal Form (DNF): Minimum number of memory reads/writes for a given array expression Independent of data storage order Psi Calculus rules are applied mechanically to produce the DNF, which is optimal in terms of memory accesses The Gamma function is applied to the DNF to produce the ONF, which is easily translated to an efficient implementation Gamma function: Specifies data storage order Operational Normal Form (ONF): Like DNF, but takes data storage into account For 1-d expressions, consists of one or more loops of the form: x[ i ]=y[stride* i +offset], l ≤ i < u Easily translated into an efficient implementation 1 L. M. R. Mullin. A Mathematics of Arrays. PhD thesis, Syracuse University, December 1988.

MIT Lincoln Laboratory er-6 KAM 11/21/2015 Some Psi Calculus Operations Operations take drop rotate cat unaryOmega binaryOmega reshape iota Arguments Vector A, int N Vector A, Vector B Operation Op, dimension D, Array A Operation Op, Dimension Adim. Array A, Dimension Bdim, Array B Vector A, Vector B int N Definition Forms a Vector of the first N elements of A Forms a Vector of the last (A.size-N) elements of A Forms a Vector of the last N elements of A concatenated to the other elements of A Forms a Vector that is the concatenation of A and B Applies unary operator Op to D-dimensional components of A (like a for all loop) Applies binary operator Op to Adim-dimensional components of A and Bdim-dimensional components of B (like a for all loop) Reshapes B into an array having A.size dimensions, where the length in each dimension is given by the corresponding element of A Forms a vector of size N, containing values 0.. N-1 = index permutation= operators= restructuring= index generation

MIT Lincoln Laboratory er-7 KAM 11/21/2015 Convolution: Psi Calculus Decomposition Psi Calculus reduces this to DNF with minimum memory accesses Definition of y=conv(h,x) y[n]=where x has N elements, h has M elements, 0≤n<N+M-1, and x’ is x padded by M-1 zeros on either end Algorithm and Psi Calculus Decomposition Algorithm step Psi Calculus sum Y=unaryOmega (sum, 1, Prod) Initial step x= h= rotate x’ (N+M-1) times x’ rot =binaryOmega(rotate,0,iota(N+M-1), 1 x’) Form x’ x’=cat(reshape(, ), cat(x, reshape(, )))= take the “interesting” part of x’ rot x’ final =binaryOmega(take,0,reshape(, ),1,x’ rot ) multiply Prod=binaryOmega (*,1, h,1,x’ final ) x= h= x’ rot = x’ final = Prod= Y= x’=

MIT Lincoln Laboratory er-8 KAM 11/21/2015 Typical C++ Operator Overloading temp B+Ctemp temp copyA Main Operator + Operator = 1. Pass B and C references to operator + 2. Create temporary result vector 3. Calculate results, store in temporary 4.Return copy of temporary 5. Pass results reference to operator= 6. Perform assignment temp copy temp copy & Example: A=B+C vector add B&, C& Additional Memory Use Additional Execution Time Static memory Dynamic memory (also affects execution time) Cache misses/ page faults Time to create a new vector Time to create a copy of a vector Time to destruct both temporaries 2 temporary vectors created

MIT Lincoln Laboratory er-9 KAM 11/21/2015 C++ Expression Templates and PETE Parse trees, not vectors, created Reduced Memory Use Reduced Execution Time Parse tree only contains references Better cache use Loop fusion style optimization Compile-time expression tree manipulation PETE: PETE, the Portable Expression Template Engine, is available from the Advanced Computing Laboratory at Los Alamos National Laboratory PETE provides: –Expression template capability –Facilities to help navigate and evaluating parse trees A=B+C BinaryNode, Reference > Expression Templates Expression Expression TypeParse Tree B+CA Main Operator + Operator = + B& C& 1. Pass B and C references to operator + 4. Pass expression tree reference to operator 2. Create expression parse tree 3. Return expression parse tree 5. Calculate result and perform assignment copy & copy B&, C& Parse trees, not vectors, created + B C

MIT Lincoln Laboratory er-10 KAM 11/21/2015 Outline Overview –Motivation –The Psi Calculus –Expression Templates Implementing the Psi Calculus with Expression Templates Experiments Future Work and Conclusions

MIT Lincoln Laboratory er-11 KAM 11/21/2015 Implementing Psi Calculus with Expression Templates Example: A=take(4,drop(3,rev(B))) B= A= 1. Form expression tree take drop rev B 4 3

MIT Lincoln Laboratory er-12 KAM 11/21/2015 Implementing Psi Calculus with Expression Templates Example: A=take(4,drop(3,rev(B))) B= A= 1. Form expression tree 2. Add size information take drop rev B 4 3 B size=10 Size info

MIT Lincoln Laboratory er-13 KAM 11/21/2015 Implementing Psi Calculus with Expression Templates Example: A=take(4,drop(3,rev(B))) B= A= 1. Form expression tree 2. Add size information take drop rev B 4 3 B size=10 Size info

MIT Lincoln Laboratory er-14 KAM 11/21/2015 Implementing Psi Calculus with Expression Templates Example: A=take(4,drop(3,rev(B))) B= A= 1. Form expression tree 2. Add size information take drop rev B 4 3 drop rev B 3 size=7 size=10 Size info

MIT Lincoln Laboratory er-15 KAM 11/21/2015 Implementing Psi Calculus with Expression Templates Example: A=take(4,drop(3,rev(B))) B= A= 1. Form expression tree 2. Add size information take drop rev B 4 3 take drop rev B 4 3 size=4 size=7 size=10 Size info

MIT Lincoln Laboratory er-16 KAM 11/21/2015 Implementing Psi Calculus with Expression Templates Example: A=take(4,drop(3,rev(B))) B= A= 1. Form expression tree 2. Add size information 3. Apply Psi Reduction rules take drop rev B 4 3 take drop rev B 4 3 size=4 size=7 size=10 A[ i ]=B[ i ] Size info Reduction

MIT Lincoln Laboratory er-17 KAM 11/21/2015 Implementing Psi Calculus with Expression Templates Example: A=take(4,drop(3,rev(B))) B= A= 1. Form expression tree 2. Add size information 3. Apply Psi Reduction rules take drop rev B 4 3 take drop rev B 4 3 size=4 size=7 size=10 A[ i ]=B[- i +B.size-1] =B[- i +9] size=10A[ i ]=B[ i ] Size info Reduction

MIT Lincoln Laboratory er-18 KAM 11/21/2015 Implementing Psi Calculus with Expression Templates Example: A=take(4,drop(3,rev(B))) B= A= 1. Form expression tree 2. Add size information 3. Apply Psi Reduction rules take drop rev B 4 3 take drop rev B 4 3 size=4 size=7 size=10 size=7A[ i ]=B[-( i +3)+9] =B[- i +6] size=10A[ i ]=B[- i +B.size-1] =B[- i +9] size=10A[ i ]=B[ i ] Size info Reduction

MIT Lincoln Laboratory er-19 KAM 11/21/2015 Implementing Psi Calculus with Expression Templates Example: A=take(4,drop(3,rev(B))) B= A= Recall: Psi Reduction for 1-d arrays always yields one or more expressions of the form: x[ i ]=y[stride* i + offset] l ≤ i < u 1. Form expression tree 2. Add size information 3. Apply Psi Reduction rules take drop rev B 4 3 take drop rev B 4 3 size=4 size=7 size=10 size=4A[ i ]=B[- i +6] size=7A[ i ]=B[-( i +3)+9] =B[- i +6] size=10A[ i ]=B[- i +B.size-1] =B[- i +9] size=10A[ i ]=B[ i ] Size info Reduction

MIT Lincoln Laboratory er-20 KAM 11/21/2015 Implementing Psi Calculus with Expression Templates Example: A=take(4,drop(3,rev(B))) B= A= Recall: Psi Reduction for 1-d arrays always yields one or more expressions of the form: x[ i ]=y[stride* i + offset] l ≤ i < u 1. Form expression tree 2. Add size information 3. Apply Psi Reduction rules 4. Rewrite as sub-expressions with iterators at the leaves, and loop bounds information at the root take drop rev B 4 3 take drop rev B 4 3 size=4 size=7 size=10 size=4 iterator: offset=6 stride=-1 size=4A[ i ]=B[- i +6] size=7A[ i ]=B[-( i +3)+9] =B[- i +6] size=10A[ i ]=B[- i +B.size-1] =B[- i +9] size=10A[ i ]=B[ i ] Size info Reduction Iterators used for efficiency, rather than recalculating indices for each i One “for” loop to evaluate each sub-expression

MIT Lincoln Laboratory er-21 KAM 11/21/2015 Outline Overview –Motivation –The PSI Calculus –Expression Templates Implementing the Psi Calculus with Expression Templates Experiments Future Work and Conclusions

MIT Lincoln Laboratory er-22 KAM 11/21/2015 Experiments Results Loop implementation achieves good performance, but is problem specific and low level Traditional C++ operator implementation is general and high level, but performs poorly when composing many operations PETE/Psi array operators perform almost as well as the loop implementation, compose well, are general, and are high level Convolution (Projected) Execution Time Normalized to Loop Implementation (vector size = 1024) Test ability to compose operations A=rev(B) A=rev(take(N, drop(M,rev(B)))) A=cat(B+C,D+E)

MIT Lincoln Laboratory er-23 KAM 11/21/2015 Experimental Platform and Method Hardware DY4 CHAMP-AV Board –Contains 4 MPC7400’s and 1 MPC 8420 MPC7400 (G4) –450 MHz –32 KB L1 data cache –2 MB L2 cache –64 MB memory/processor Software VxWorks 5.2 –Real-time OS GCC (non-official release) –GCC with patches for VxWorks –Optimization flags: -O3 -funroll-loops -fstrict-aliasing Method Run many iterations, report average, minimum, maximum time –From 10,000,000 iterations for small data sizes, to 1000 for large data sizes All approaches run on same data Only average times shown here Only one G4 processor used Use of the VxWorks OS resulted in very low variability in timing High degree of confidence in results Use of the VxWorks OS resulted in very low variability in timing High degree of confidence in results

MIT Lincoln Laboratory er-24 KAM 11/21/2015 Experiment 1: A=rev(B) PETE/Psi implementation performs nearly as well as hand coded loop, and much better than regular C++ implementation Some overhead associated with expression tree manipulation PETE/Psi implementation performs nearly as well as hand coded loop, and much better than regular C++ implementation Some overhead associated with expression tree manipulation

MIT Lincoln Laboratory er-25 KAM 11/21/2015 Experiment 2: a=rev(take(N,drop(M,rev(b))) Larger gap between regular C++ performance and performance of other implementations  regular C++ operators do not compose efficiently Larger overhead associated with expression-tree manipulation due to more complex expression Larger gap between regular C++ performance and performance of other implementations  regular C++ operators do not compose efficiently Larger overhead associated with expression-tree manipulation due to more complex expression

MIT Lincoln Laboratory er-26 KAM 11/21/2015 Experiment 3: a=cat(b+c, d+e) Still larger overhead associated with tree manipulation due to cat() Overhead can be mitigated by “setup” step prior to assignment Still larger overhead associated with tree manipulation due to cat() Overhead can be mitigated by “setup” step prior to assignment

MIT Lincoln Laboratory er-27 KAM 11/21/2015 Outline Overview –Motivation –The PSI Calculus –Expression Templates Implementing the PSI Calculus with Expression Templates Experiments Future Work and Conclusions

MIT Lincoln Laboratory er-28 KAM 11/21/2015 Future Work Multiple Dimensions: Extend this work to N-dimensional arrays (N is any non-negative integer) Parallelism: Explore dimension lifting to exploit multiple processors Memory Hierarchy: Explore dimension lifting to exploit levels of memory Mechanize Index Decomposition: Currently a time consuming process done by hand Program Block Optimizations: PETE-style optimizations across statements to eliminate unnecessary temporaries

MIT Lincoln Laboratory er-29 KAM 11/21/2015 Conclusions Psi calculus provides rules to reduce array expressions to the minimum of number of reads and writes Expression templates provide the ability to perform compiler preprocessor-style optimizations (expression tree manipulation) Combining Psi calculus with expression templates results in array operators that –Compose efficiently –Are high performance –Are high level The C++ template mechanism can be applied to a wide variety of problems (e.g. tree traversal ala PETE, graph traversal, list traversal) to gain run-time speedup at the expense of compile time/space