Robin Hogan Department of Meteorology School of Mathematical and Physical Sciences University of Reading Can operator-overloading ever have a speed approaching.

Slides:

Advertisements

Similar presentations

Fast reverse-mode automatic differentiation using expression templates in C++ Robin Hogan University of Reading.

Advertisements

Overview of programming in C C is a fast, efficient, flexible programming language Paradigm: C is procedural (like Fortran, Pascal), not object oriented.

MATH 224 – Discrete Mathematics

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

Intermediate Code Generation

Programming and Data Structure

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

The Assembly Language Level

Chapter 7: User-Defined Functions II

Recursion. Recursion is a powerful technique for thinking about a process It can be used to simulate a loop, or for many other kinds of applications In.

MATH 685/ CSI 700/ OR 682 Lecture Notes

(1) ICS 313: Programming Language Theory Chapter 10: Implementing Subprograms.

CPSC Compiler Tutorial 9 Review of Compiler.

1 Homework Turn in HW2 at start of next class. Starting Chapter 2 K&R. Read ahead. HW3 is on line. –Due: class 9, but a lot to do! –You may want to get.

Object (Data and Algorithm) Analysis Cmput Lecture 5 Department of Computing Science University of Alberta ©Duane Szafron 1999 Some code in this.

Chapter 6. 2 Objectives You should be able to describe: Function and Parameter Declarations Returning a Single Value Pass by Reference Variable Scope.

Guide To UNIX Using Linux Third Edition

MOHAMMAD IMRAN DEPARTMENT OF APPLIED SCIENCES JAHANGIRABAD EDUCATIONAL GROUP OF INSTITUTES.

C++ Functions. 2 Agenda What is a function? What is a function? Types of C++ functions: Types of C++ functions: Standard functions Standard functions.

Pointers (Continuation) 1. Data Pointer A pointer is a programming language data type whose value refers directly to ("points to") another value stored.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

LECTURE LECTURE 17 More on Templates 20 An abstract recipe for producing concrete code.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

Chapter 5. Loops are common in most programming languages Plus side: Are very fast (in other languages) & easy to understand Negative side: Require a.

Chapter 6: User-Defined Functions I Instructor: Mohammad Mojaddam

Semantic Analysis Legality checks –Check that program obey all rules of the language that are not described by a context-free grammar Disambiguation –Name.

Identifying Reversible Functions From an ROBDD Adam MacDonald.

College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

Operator Precedence First the contents of all parentheses are evaluated beginning with the innermost set of parenthesis. Second all multiplications, divisions,

Multi-Dimensional Arrays

A First Book of C++: From Here To There, Third Edition2 Objectives You should be able to describe: Function and Parameter Declarations Returning a Single.

DEPARTMENT OF COMPUTER SCIENCE & TECHNOLOGY FACULTY OF SCIENCE & TECHNOLOGY UNIVERSITY OF UWA WELLASSA 1 CST 221 OBJECT ORIENTED PROGRAMMING(OOP) ( 2 CREDITS.

1 CSC 222: Computer Programming II Spring 2004 Pointers and linked lists  human chain analogy  linked lists: adding/deleting/traversing nodes  Node.

Analysis of Algorithms

IT253: Computer Organization Lecture 3: Memory and Bit Operations Tonga Institute of Higher Education.

Overloading Binary Operators Two ways to overload –As a member function of a class –As a friend function As member functions –General syntax Data Structures.

Hello.java Program Output 1 public class Hello { 2 public static void main( String [] args ) 3 { 4 System.out.println( “Hello!" ); 5 } // end method main.

1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

CSC 221: Recursion. Recursion: Definition Function that solves a problem by relying on itself to compute the correct solution for a smaller version of.

CS536 Semantic Analysis Introduction with Emphasis on Name Analysis 1.

Texas A&M University, Department of Aerospace Engineering AN EMBEDDED FUNCTION TOOL FOR MODELING AND SIMULATING ESTIMATION PROBLEMS IN AEROSPACE ENGINEERING.

COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.

A FIRST BOOK OF C++ CHAPTER 6 MODULARITY USING FUNCTIONS.

FORTRAN History. FORTRAN - Interesting Facts n FORTRAN is the oldest Language actively in use today. n FORTRAN is still used for new software development.

C++ Programming: Program Design Including Data Structures, Fourth Edition Chapter 6: User-Defined Functions I.

Functions. Motivation What is a function? A function is a self-contained unit of program code designed to accomplish a particular task. We already used.

 2008 Pearson Education, Inc. All rights reserved. 1 Arrays and Vectors.

Functions Math library functions Function definition Function invocation Argument passing Scope of an variable Programming 1 DCT 1033.

Chapter 5 Linked List by Before you learn Linked List 3 rd level of Data Structures Intermediate Level of Understanding for C++ Please.

ICOM 4035 – Data Structures Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 – August 23, 2001.

Chapter 3: User-Defined Functions I

STL CSSE 250 Susan Reeder. What is the STL? Standard Template Library Standard C++ Library is an extensible framework which contains components for Language.

Assembly - Arrays תרגול 7 מערכים.

C++ Programming: From Problem Analysis to Program Design, Fourth Edition Chapter 6: User-Defined Functions I.

C++ How to Program, 7/e © by Pearson Education, Inc. All Rights Reserved.

Memory-Aware Compilation Philip Sweany 10/20/2011.

1 ENERGY 211 / CME 211 Lecture 4 September 29, 2008.

Learners Support Publications Constructors and Destructors.

Module 9: Operator overloading #1 2000/01Scientific Computing in OOCourse code 3C59 Module 9: Operator Overloading In this module we will cover Overloading.

Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.

CMSC 202 Computer Science II for Majors. CMSC 202UMBC Topics Templates Linked Lists.

Constructors and Destructors

User-Defined Functions

Compiler Construction

Objective of This Course

Constructors and Destructors

OPERATORS in C Programming

OPERATORS in C Programming

Presentation transcript:

Robin Hogan Department of Meteorology School of Mathematical and Physical Sciences University of Reading Can operator-overloading ever have a speed approaching source-code transformation for reverse-mode automatic differentiation?

Source-code transformation versus operator overloading Source-code transformation –Generates quite efficient code (3-4 times original algorithm?) –Most/all good tools are non-free (?) –Limited or no support for modern language features (e.g. classes and C++ templates) Operator overloading –In principle can work with any language features –Free C++ tools (e.g. ADOL-C, CppAD, Sacado) –Not much available for Fortran for reverse mode –Typically times slower than the original algorithm! This talk is about how to speed-up operator overloading in C++

Free C++ operator overloading tools ADOL-C and CppAD for reverse-mode –In the forward pass they store the whole algorithm symbolically –Every operator and function needs to be stored symbolically (e.g. 0 for plus, 1 for minus, 42 for atan etc) –Adjoint function (and higher-order derivatives) can then be generated –Flexibility comes at the cost of speed Sacado::Rad for reverse-mode –Differential statements (only) are stored as a tree of elemental operations linked by pointers Sacado::ELRFad for forward-mode –(ELR = Expression-level reverse mode, Fad = Forward-mode auto. diff.) –Use expression templates to optimize the processing of each expression –But only works in forward-mode automatic differentiation (for n independent variables x, each intermediate variable q is replaced by an object containing the vector

Overview Optimizing reverse-mode operator-overloading implementations –Efficient tape structure to store the differential statements –Efficient adjoint calculation from the tape –Using expression templates to efficiently build the tape –Other optimizations Benchmark of a new, free tool “Adept” (Automatic Differentiation using Expression Templates) against ADOL-C, CppAD and Sacado –Optimizing the computation of full Jacobian matrices Remaining challenges

Simple example Consider simple algorithm y(x 0, x 1 ) contrived for didactic purposes: We want the automatic differentiation code to look like this: double algorithm(const double x[2]) { double y = 4.0; double s = 2.0*x[0] + 3.0*x[1]*x[1]; y *= sin(s); return y; } function algorithm(x) result(y) implicit none real, intent(in) :: x(2) real :: y real :: s y = 4.0 s = 2.0*x(1) + 3.0*x(2)*x(2) y = y * sin(s) return endfunction adouble algorithm(const adouble x[2]) { adouble y = 4.0; adouble s = 2.0*x[0] + 3.0*x[1]*x[1]; y *= sin(s); return y; } // Main code Stack stack; // Object where info will be stored adouble x[2] = {…, …}// Set algorithm inputs adouble y = algorithm(x);// Run algorithm and store info in stack y.set_gradient(y_AD);// Set dJ/dy stack.reverse();// Run adjoint code from stored info x_AD[0] = x[0].get_gradient();// Save resulting values of dJ/dx0 x_AD[1] = x[1].get_gradient();//... and dJ/dx1 Simple change: label “active” variables as a new type

Minimum necessary storage What is minimum necessary storage for the equivalent differential statements? If each gradient is labelled by a unique integer (since they’re unknown in forward pass) then we need to build two stacks: Total of 120 bytes in this case Can then run backwards through stack to compute adjoints Index to LHS gradient (unsigned int) Index to first operation (unsigned int) 2 2 (y) (s) (y) 2 …… # Multiplier (double) Index to RHS gradient (unsigned int) (x 0 ) 16.0x (x 1 ) 2sin(s) 2 2 (y) 3y cos(s) 3 3 (s) 4…… Statement stack Operation stack

Adjoint algorithm is simple Need to cope with three different types of differential statement: Forward mode: Reverse mode: General differential statement: Equivalent adjoint statements: for i = 0 to n:

…which can be coded as follows This does the right thing in our three cases: –Zero on RHS –One or more gradients on RHS –Same gradient on LHS and RHS 1. Loop over differential statements in reverse order 2. Save gradient 3. Skip if gradient equals 0 (big optimization) 4. Loop over operations 5. Update a gradient

Computational graphs Standard operator overloading can only pass information from the most nested operation outwards: operator* siny s Pass value of sin(s) Pass y sin(s) to be new y operator* siny s Pass y Pass y cos(s) Pass sin(s) Add sin(s)  y to stack Add y cos(s)  s to stack Differentiation involves passing information in opposite sense: A node f(x) takes real number w and passes wdf/dx down the chain

Solution using expression templates C++ supports class templates –A class template is a generic recipe for a class that works with an arbitrary type –Veldhuizen (1995) used this feature to introduce Expression Templates to optimize array operations and make C++ as fast as Fortran-90 for array-wise operations We use it as a way to pass information in both directions through the expression tree: –sin(A) for an argument of arbitrary type A is overloaded to return an object of type Sin –operator*(A,B) for arguments of arbitrary type A and B is overloaded to return an object of type Multiply

Expression templates continued The following types are passed up the chain at compile time: operator* siny s Sin Multiply > adouble Now when we compile the statement “y=y*sin(x)”: - The right-hand-side resolves to an object “ RHS ” of type Multiply > –The overloaded assignment operator first calls RHS.value() to get y –It then calls RHS.calc_gradient(), to add entries to operation stack –Multiply and Sin are defined with calc_gradient() member functions so that they can correctly pass information up and down the expression tree

Implementation of Sin Implementation of Sin …Adept library has done this for all operators and functions // Definition of Sin class template class Sin : public Expression > { public: // Member functions // Constructor: store reference to a and its numerical value Sin(const Expression & a) : a_(a), a_value_(a.value()) { } // Return the value double value() const { return sin(a_value_); } // Compute derivative and pass to a void calc_gradient(Stack& stack, double multiplier) const { a_.calc_gradient(stack, cos(a_value_)*multiplier); } private: // Data members const A& a_; // A reference to the object double a_value_; // The numerical value of object }; // Overload the sin function: it returns a Sin object template inline Sin sin(const Expression & a) { return Sin (a); }

Optimizations Why are expression templates fast? –Compound types representing complex expressions are known at compile time –C++ automatically inlines function calls between objects in an expression, leaving little more than the operations you would put in a hand-coded application of the chain rule Further optimizations: –Stack object keeps memory allocated between calls to avoid time spent allocating incrementally more memory –The current stack is accessed by a global but thread-local variable, rather than storing a link to the stack in every adouble object (as in CppAD and ADOL-C)

Algorithms 1 & 2: linear advection One simple PDE (the speed c is a constant):

Algorithm 1: Lax-Wendroff Lax and Wendroff (Comm. Pure Appl. Math. 1950): #define NX 100 void lax_wendroff(int nt, double c, const adouble q_init[NX], adouble q[NX]) { adouble flux[NX-1]; // Fluxes between boxes for (int i=0; i<NX; i++) q[i] = q_init[i]; // Initialize q for (int j=0; j<nt; j++) { // Main loop in time for (int i=0; i<NX-1; i++) flux[i] = 0.5*c*(q[i]+q[i+1] + c*(q[i]-q[i+1])); for (int i=1; i<NX-1; i++) q[i] += flux[i-1]-flux[i]; q[0] = q[NX-2]; q[NX-1] = q[1]; // Treat boundary conditions } This algorithm is linear and uses no mathematical functions This algorithm has 100 inputs (independent variables) corresponding to the initial distribution of q, and 100 outputs (dependent variables) corresponding to the final distribution of q

Algorithm 2: Toon et al. Toon et al. (J. Atmospheric Sci. 1988): #define NX 100 void toon_et_al (int nt, double c, const adouble q_init[NX], adouble q[NX]) { adouble flux[NX-1]; // Fluxes between boxes for (int i=0; i<NX; i++) q[i] = q_init[i]; // Initialize q for (int j=0; j<nt; j++) { // Main loop in time for (int i=0; i<NX-1; i++) flux[i] = (exp(c*log(q[i]/q[i+1]))-1.0) * q[i]*q[i+1] / (q[i]-q[i+1]); for (int i=1; i<NX-1; i++) q[i] += flux[i-1]-flux[i]; q[0] = q[NX-2]; q[NX-1] = q[1]; // Treat boundary conditions } This algorithm assumes exponential variation of q between gridpoints (appropriate for certain types of tracer transport) It is non-linear and calls the mathematical functions exp and log from within the main loop Same number of independents and dependents as Algorithm 1

Real-world algorithms Hogan & Battaglia (J. Atmos. Sci. 2008) –Treats wide-angle scattering –Solve four coupled PDEs –Efficiency O(N 2 ) –4N independent variables –N dependent variables –We use N = 50 Algorithm 3: Photon Variance- Covariance method (PVC) Algorithm 4: Time-dependent two-stream method (TDTS) Hogan (J. Atmos. Sci. 2008) –Treats small-angle scattering –Solve four coupled ODEs –Efficiency O(N ) where N is the number of points in the vertical –5N independent variables –N dependent variables –We use N = 50 How does a lidar/radar pulse spread through a cloud?

Computational cost: 1 & 2 Time relative to original code for Linux, gcc-4.4, O3 optimization, Pentium 2.5 GHz, 2 MB cache Lax-Wendroff: all AD tools are much slower than hand-coding! –Because there are no mathematical functions, the compiler can aggressively optimize the loops in the original algorithm Toon et al.: Adept is only a little slower than hand-coding, and significantly faster than ADOL-C, CppAD and Sacado::Rad Algorithm 1: Lax-Wendroff Algorithm 2: Toon et al

Computational cost: 3 & 4 Similar results for the real-world algorithms as for Toon et al., since their loops also contain mathematical functions Note that ADOL-C and CppAD can reuse the same tape but with different inputs (reverse pass only), while Adept and Sacado::Rad cannot –Adept is typically still faster than the reverse-pass-only for ADOL-C and CppAD –Note that tapes cannot be reused for any algorithm containing “if” statements or look-up tables Algorithm 3: PVC Algorithm 4: TDTS

Memory usage per operation For each mathematical operation (+, *, sin etc.), Adept stores the equivalent of around 1.75 double-precision numbers Hand-coded adjoint can be much more efficient, and for linear algorithms like Lax-Wendroff, no data need to be stored! ADOL-C and CppAD store the entire algorithm so require a bit more Like Adept, Sacado::Rad stores only the differential information, but stores the equivalent of double-precision numbers

Jacobian matrices For n independent and m dependent variables, Jacobian is m×n If m<n: –Run the algorithm once to create the tape, followed by m reverse accumulations, one for each row of the matrix –Optimization: if a strip of rows are accumulated together, compiler can optimize to take advantage of vectorization (SSE2) and loop unrolling –Further optimization: parallelize the reverse accumulations If m>n with a tape: –Run the algorithm once to create the tape, followed by n forward accumulations, one for each column of the matrix –The same optimizations are possible If m>n without a tape (e.g. Sacado::ELRFad): –Each intermediate variable q replaced by vector containing –Jacobian matrix generated in a single pass

Consider Toon et al. algorithm: 100x100 Jacobian matrix Adept and Sacado::ELRFad are fastest overall CppAD and Sacado::Rad treat one strip of the matrix at a time –Their reverse accumulations are 100 times the cost of one adjoint Adept and ADOL-C treat multiple strips at once –They achieve a 3-5 times speed-up compared to the naive approach Sacado::ELRFad is a very fast tapeless implementation –Although Adept is faster for m < n Benchmark using Toon et al (Sacado::Rad) (Sacado::ELRFad)

Summary and outlook Can operator overloading compete with source-code transformation? Yes, for loops containing mathematical functions –An optimized operator-overloading implementation found to be times slower than original algorithm (hand-coding was ) Not yet, for loops free of mathematical functions –32 times slower (at best); one tool 240 times slower Adept: free at –Significantly faster than other free operator-overloading tools tested –No knowledge of templates required to use it! Future work –Merge Adept with matrix library using expression templates: potentially overcome slowness with loops containing mathematical functions? –Complex numbers, higher-order derivatives –Will Fortran have templates one day? Hogan, R. J., 2014: Fast reverse-mode automatic differentiation using expression templates in C++. ACM Trans. Math. Softw., in review

Differentiate the algorithm: Write each statement in matrix form: Transpose the matrix to get equivalent adjoint statement: Creating the adjoint code 1 –Consider   y as dJ/dy –Consider  y as the derivative of y with respect to something

What is a template? Templates are a key ingredient to generic programming in C++ Imagine we have a function like this: We want it to work with any numerical type (single precision, complex numbers etc) but don’t want to laboriously define a new overloaded function for each possible type Can use a function template: double cube(const double x) { double y = x*x*x; return y; } template Type cube(Type x) { Type y = x*x*x; return y; } double a = 1.0; b = cube(a); // compiler creates function cube complex c(1.0, 2.0); // c = 1 + 2i d = cube(c); // compiler creates function cube >

Implementing the chain rule Differentiate multiply operator Differentiate sine function

Computational graph Differentiation most naturally involves passing information in the opposite sense operator* siny s Pass y Pass y cos(s) Pass sin(s) Add sin(s)  y to stack Add y cos(s)  s to stack Each node representing arbitrary function or operator y(a) needs to be able to take a real number w and pass wdy/da down the chain Binary function or operator y(a,b) would pass wdy/da to one argument and wdy/db to other At the end of the chain, store the result on the stack But how do we implement this?