Automatic Differentiation Tutorial

Slides:

Advertisements

Similar presentations

ATMOSPHERIC TRANSMISSION AND RADIANCE MODELS

Advertisements

Yi Heng Second Order Differentiation Bommerholz – Summer School 2006.

Agenda Definitions Evolution of Programming Languages and Personal Computers The C Language.

The C ++ Language BY Shery khan. The C++ Language Bjarne Stroupstrup, the language’s creator C++ was designed to provide Simula’s facilities for program.

Beginning Programming for Engineers

Jaewook Shin, Priyadarshini Malusare and Paul D. Hovland Mathematics and Computer Science Division Argonne National Laboratory Design and Implementation.

1 EMT 101 – Engineering Programming Dr. Farzad Ismail School of Aerospace Engineering Universiti Sains Malaysia Nibong Tebal Pulau Pinang Week 10.

July 31, 2013 Jason Su. Background and Tools Cramér-Rao Lower Bound (CRLB) Automatic Differentiation (AD) Applications in Parameter Mapping Evaluating.

Semantic Analysis Chapter 6. Two Flavors  Static (done during compile time) –C –Ada  Dynamic (done during run time) –LISP –Smalltalk  Optimization.

Chapter 7 Introduction to Procedures. So far, all programs written in such way that all subtasks are integrated in one single large program. There is.

Chapter 3: Beginning Problem Solving Concepts for the Computer Programming Computer Programming Skills /1436 Department of Computer Science.

Chapter 9 Subprogram Control Consider program as a tree- –Each parent calls (transfers control to) child –Parent resumes when child completes –Copy rule.

Telescoping Languages: A Compiler Strategy for Implementation of High-Level Domain-Specific Programming Systems Ken Kennedy Rice University.

ECIV 301 Programming & Graphics Numerical Methods for Engineers Lecture 3 Programming and Software.

Richard Fateman CS 282 Lecture 18b1 Automatic Differentiation Lecture 18b.

Templates. Objectives At the conclusion of this lesson, students should be able to Explain how function templates are used Correctly create a function.

In conclusion our tool: can be used with any operator overloading AD package replaces the manual process, which is slow and overestimates the number active.

 2007 Pearson Education, Inc. All rights reserved C++ as a Better C; Introducing Object Technology.

CSE 1301 J Lecture 2 Intro to Java Programming Richard Gesick.

AceGen and AceFEM packages

© 2003 G. Drew Kessler and William M. Pottenger1 Subroutines (Part 1) CSE 262, Spring 2003.

Computer Science 210 Computer Organization Introduction to C.

Robin Hogan Department of Meteorology School of Mathematical and Physical Sciences University of Reading Can operator-overloading ever have a speed approaching.

5.3 Machine-Independent Compiler Features

Introduction to FORTRAN

McLab Tutorial Part 6 – Introduction to the McLab Backends MATLAB-to-MATLAB MATLAB-to-Fortran90 (McFor) McVM with JIT 6/4/2011Backends-

A Brief Overview of Methods for Computing Derivatives Wenbin Yu Department of Mechanical & Aerospace Engineering Utah State University, Logan, UT.

C++ Code Analysis: an Open Architecture for the Verification of Coding Rules Paolo Tonella ITC-irst, Centro per la Ricerca Scientifica e Tecnologica

Data Flow in Static Profiling Cathal Boogerd, Delft University, The Netherlands Leon Moonen, Simula Research Lab, Norway ?

Michelle Mills Strout OpenAnalysis: Representation- Independent Program Analysis CCA Meeting January 17, 2008.

Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.

Combinatorial Aspects of Automatic Differentiation Paul D. Hovland Mathematics & Computer Science Division Argonne National Laboratory.

C Tokens Identifiers Keywords Constants Operators Special symbols.

The string data type String. String (in general) A string is a sequence of characters enclosed between the double quotes "..." Example: Each character.

Aerodynamic Design Using VLM Gradient Generation Using ADIFOR (Automatic Differentiation in Fortran) Santosh N. Abhyankar Prof. K. Sudhakar.

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 Chapter 2.

8-1 Embedded Systems Fixed-Point Math and Other Optimizations.

CSC3315 (Spring 2008)1 CSC 3315 Subprograms Hamid Harroud School of Science and Engineering, Akhawayn University

MT311 Java Application Development and Programming Languages Li Tak Sing( 李德成 )

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University IWPSE 2003 Program.

Texas A&M University, Department of Aerospace Engineering AN EMBEDDED FUNCTION TOOL FOR MODELING AND SIMULATING ESTIMATION PROBLEMS IN AEROSPACE ENGINEERING.

Arithmetic expressions containing Mathematical functions.

FORTRAN History. FORTRAN - Interesting Facts n FORTRAN is the oldest Language actively in use today. n FORTRAN is still used for new software development.

Engineering Optimization Most engineering design involves using optimization software Minimizes or maximizes a merit function Applies functional constraints.

Computer Science A 1. Course plan Introduction to programming Basic concepts of typical programming languages. Tools: compiler, editor, integrated editor,

PYTHON FOR HIGH PERFORMANCE COMPUTING. OUTLINE  Compiling for performance  Native ways for performance  Generator  Examples.

ECE 103 Engineering Programming Chapter 30 C Functions Herbert G. Mayer, PSU CS Status 8/9/2014 Initial content copied verbatim from ECE 103 material developed.

CSCAPES Mission Research and development Provide load balancing and parallelization toolkits for petascale computation Develop advanced automatic differentiation.

Texas A&M University, Department of Aerospace Engineering AUTOMATIC GENERATION AND INTEGRATION OF EQUATIONS OF MOTION BY OPERATOR OVER- LOADING TECHNIQUES.

Computer Science 210 Computer Organization

APPENDIX a WRITING SUBROUTINES IN C

Lecture: MATLAB Chapter 1 Introduction

MatLab Programming By Kishan Kathiriya.

Code-Based Sensitivities for Verification and Validation

Organization of Programming Languages

Introduction to C Programming Language

Semantic Analysis Chapter 6.

INTRODUCTION TO BASIC MATLAB

Computer Science 210 Computer Organization

CSC 533: Programming Languages Spring 2015

Semantic Analysis Chapter 6.

Activation records Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section

Simulation And Modeling

Verification and Validation Using Code-Based Sensitivity Techniques

Activation records Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section

CSC 533: Programming Languages Spring 2018

CSC 533: Programming Languages Spring 2019

Activation records Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section

Activation records Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section

Presentation transcript:

Automatic Differentiation Tutorial Paul Hovland Andrew Lyons Boyana Norris Jean Utke Mathematics & Computer Science Division Argonne National Laboratory

AD in a Nutshell Technique for computing analytic derivatives of programs (millions of loc) Derivatives used in optimization, nonlinear PDEs, sensitivity analysis, inverse problems, etc. AD = analytic differentiation of elementary functions + propagation by chain rule Every programming language provides a limited number of elementary mathematical functions Thus, every function computed by a program may be viewed as the composition of these so-called intrinsic functions Derivatives for the intrinsic functions are known and can be combined using the chain rule of differential calculus Associativity of the chain rule leads to two main modes: forward and reverse Can be implemented using source transformation or operator overloading

Modes of AD Forward Mode Propagates derivative vectors, often denoted u or g_u Derivative vector u contains derivatives of u with respect to independent variables Time and storage proportional to vector length (# indeps) Reverse Mode Propagates adjoints, denoted ū or u_bar Adjoint ū contains derivatives of dependent variables with respect to u Propagation starts with dependent variables—must reverse flow of computation Time proportional to adjoint vector length (# dependents) Storage proportional to number of operations Because of this limitation, often applied to subprograms

Which mode to use? Use forward mode when # independents is very small Only a directional derivative (Jv) is needed Reverse mode is not tractable Use reverse mode when # dependents is very small Only JTv is needed

Ways of Implementing AD Operator Overloading Use language features to implement differentiation rules or to generate trace (“tape”) of computation Implementation can be very simple Difficult to go beyond one operation/statement at a time in doing AD Potential overhead due to compiler-generated temporaries, e.g. w = x*y*z è tmp = x*y; w = tmp*z. Examples: ADOL-C, ADMAT, TFAD<>,SACADO Source Transformation Requires significant (compiler) infrastructure More flexibility in exploiting chain rule associativity Examples: ADIFOR, ADIC, OpenAD, TAF, TAPENADE

Operator Overloading: simple example (implementation) class a_double{ private: double value, grad; public: /* constructors */ a_double(double v=0.0, double g=0.0){value=v; grad = g;} /* operators */ friend a_double operator+(const a_double &g1, const a_double &g2) { return a_double(g1.value+g2.value,g1.grad+g2.grad); } friend a_double operator-(const a_double &g1, const a_double &g2) { return a_double(g1.value-g2.value,g1.grad-g2.grad); friend a_double operator*(const a_double &g1, const a_double &g2) { return a_double(g1.value*g2.value,g2.value*g1.grad+g1.value*g2.grad); friend a_double sin(const a_double &g1) { return a_double(sin(g1.value),cos(g1.value)*g1.grad); friend a_double cos(const a_double &g1) { return a_double(cos(g1.value),-sin(g1.value)*g1.grad); // ...

Operator Overloading: simple example (use) #include <math.h> #include <stream.h> #include "adouble.hxx" void func(a_double *f, a_double x, a_double y){ a_double a,b; if (x > y) { a = cos(x); b = sin(y)*y*y; } else { a = x*sin(x)/y; b = exp(y); } *f = exp(a*b);

Source Transformation: simple example C Generated by TAPENADE (INRIA, Tropics team) C ... SUBROUTINE FUNC_D(f, fd, x, xd, y, yd) DOUBLE PRECISION f, fd, x, xd, y, yd DOUBLE PRECISION a, ad, arg1, arg1d, b, bd INTRINSIC COS, EXP, SIN C IF (x .GT. y) THEN ad = -(xd*SIN(x)) a = COS(x) bd = yd*COS(y)*y*y + SIN(y)*(yd*y+y*yd) b = SIN(y)*y*y ELSE ad = ((xd*SIN(x)+x*xd*COS(x))*y-x*SIN(x)*yd)/y**2 a = x*SIN(x)/y bd = yd*EXP(y) b = EXP(y) END IF arg1d = ad*b + a*bd arg1 = a*b fd = arg1d*EXP(arg1) f = EXP(arg1) RETURN END

The ADIFOR/ADIC Process # subroutine to be differentiated AD_TOP = deba # number of derivatives to be computed AD_PMAX = 5 # independent variables AD_IVARS = b,c,d,nu100,nu210 # dependent variables AD_OVARS = hmin,pmaxe,cfhp # Name of file listing names of files # containing deba and all subroutines called by it. AD_PROG = deba.cmp ANSI-C or Fortran Code Control Files ADIC or ADIFOR Code with Derivatives Compile & Link User’s Derivative Driver Libraries, e.g., SparsLinC ADIntrinsics Derivative Program After initial setup, usually only input code changes. http://www.mcs.anl.gov/ autodiff/adic-demo and …/adifor-demo

Tools Fortran 95 C/C++ Fortran 77 MATLAB Other languages: Ada, Python, ... More tools at http://www.autodiff.org/

Tools: Fortran 95 TAF (FastOpt) Commercial tool Support for (almost) all of Fortran 95 Used extensively in geophysical sciences applications Tapenade (INRIA) Support for many Fortran 95 features Developed by a team with extensive compiler experience OpenAD/F (Argonne/UChicago/Rice) Developed by a team with expertise in combintorial algorithms, compilers, software engineering, and numerical analysis Development driven by climate model & astrophysics code All three: forward and reverse; source transformation

Tools: C/C++ ADOL-C (Dresden) Mature tool Support for all of C++ Operator overloading; forward and reverse modes ADIC (Argonne/UChicago) Support for all of C, some C++ Source transformation; forward mode (reverse under development) New version (2.0) based on industrial strength compiler infrastructure Shares some infrastructure with OpenAD/F SACADO: See Phipps presentation TAC++ (FastOpt) Commercial tool (under development) Support for much of C/C++ Source transformation; forward and reverse modes Shares some infrastructure with TAF

Tools: Fortran 77 Tools: MATLAB ADIFOR (Rice/Argonne) Mature and very robust tool Support for all of Fortran 77 Forward and (adequate) reverse modes Hundreds of users; ~150 citations AdiMat (Aachen): source transformation MAD (Cranfield/TOMLAB): operator overloading Various research prototypes Tools: MATLAB

Simple example revisited #include <math.h> #include <stream.h> #include "adouble.hxx" void func(double *f, double x, double y){ double a,b; if (x > y) { a = cos(x); b = sin(y)*y*y; } else { a = x*sin(x)/y; b = exp(y); } *f = exp(a*b);

ADIFOR 2.0: simple example (x only) C This file was generated on 09/15/03 by the version of C ADIFOR compiled on June, 1998. C ... subroutine g_func(f, g_f, x, g_x, y) double precision f, x, y, a, b double precision d1_w, d2_v, d1_p, d2_b, g_a, g_x, g_d1_w, g_f save g_a, g_d1_w if (x .gt. y) then d2_v = cos(x) d1_p = -sin(x) g_a = d1_p * g_x a = d2_v C-------- b = sin(y) * y * y else d2_v = sin(x) d1_p = cos(x) d2_b = 1.0d0 / y g_a = (d2_b * d2_v + d2_b * x * d1_p) * g_x a = x * d2_v / y b = exp(y) endif g_d1_w = b * g_a d1_w = a * b d2_v = exp(d1_w) d1_p = d2_v g_f = d1_p * g_d1_w f = d2_v C return end

ADIFOR 3.0: simple example subroutine g_func_proc(ad_p_, f, g_f, x, g_x, y, g_y) include 'g_maxs.h' c+declarations here integer ad_p_ double precision d_tmp_s_0_a double precision g_a(ad_pmax_) double precision g_b(ad_pmax_) double precision g_f(ad_pmax_) double precision g_X2_dtmp1(ad_pmax_) double precision g_x(ad_pmax_) double precision g_y(ad_pmax_) double precision d_tmp_ipar_0 double precision d_tmp_val_0 double precision d_tmp_val_1 double precision f, x, y, a, b double precision X2_dtmp1 C+AD4 Start of executable statements. d_tmp_ipar_0 = (-sin(x)) call g_accum_d_2(ad_p_, g_a, d_tmp_ipar_0, g_x) a = cos(x) d_tmp_val_0 = sin(y) d_tmp_val_1 = d_tmp_val_0 * y d_tmp_ipar_0 = cos(y) d_tmp_s_0_a = d_tmp_val_1 + y * d_tmp_val_0 + y * y * d_tmp_ipar +_0 call g_accum_d_2(ad_p_, g_b, d_tmp_s_0_a, g_y) b = d_tmp_val_1 * y call g_accum_d_4(ad_p_, g_X2_dtmp1, a, g_b, b, g_a) X2_dtmp1 = a * b d_tmp_ipar_0 = exp(X2_dtmp1) call g_accum_d_2(ad_p_, g_f, d_tmp_ipar_0, g_X2_dtmp1) f = exp(X2_dtmp1) C return end

TAPENADE reverse mode: simple example C Generated by TAPENADE (INRIA, Tropics team) C Version 2.0.6 - (Id: 1.14 vmp Stable - Fri Sep 5 14:08:23 MEST 2003) C ... SUBROUTINE FUNC_B(f, fb, x, xb, y, yb) DOUBLE PRECISION f, fb, x, xb, y, yb DOUBLE PRECISION a, ab, arg1, arg1b, b, bb INTEGER branch INTRINSIC COS, EXP, SIN C IF (x .GT. y) THEN a = COS(x) b = SIN(y)*y*y CALL PUSHINTEGER4(0) ELSE a = x*SIN(x)/y b = EXP(y) CALL PUSHINTEGER4(1) END IF arg1 = a*b f = EXP(arg1) arg1b = EXP(arg1)*fb ab = b*arg1b bb = a*arg1b CALL POPINTEGER4(branch) IF (branch .LT. 1) THEN yb = yb + (SIN(y)*y+y*SIN(y)+y*y*COS(y))*bb xb = xb - SIN(x)*ab yb = yb + EXP(y)*bb - x*SIN(x)*ab/y**2 xb = xb + (x*COS(x)/y+SIN(x)/y)*ab fb = 0.D0 END

OpenAD system architecture

OpenAD: simple example SUBROUTINE func(F, X, Y) use w2f__types use active_module C Declarations C ... IF (X%v .GT. Y%v) THEN OpenAD_Symbol_1 = COS(X%v) OpenAD_Symbol_0 = (-SIN(X%v)) A%v = OpenAD_Symbol_1 OpenAD_Symbol_5 = SIN(Y%v) OpenAD_Symbol_2 = (Y%v*OpenAD_Symbol_5) OpenAD_Symbol_9 = (Y%v*OpenAD_Symbol_2) OpenAD_Symbol_3 = OpenAD_Symbol_2 OpenAD_Symbol_6 = OpenAD_Symbol_5 OpenAD_Symbol_8 = COS(Y%v) OpenAD_Symbol_7 = Y%v OpenAD_Symbol_4 = Y%v B%v = OpenAD_Symbol_9 OpenAD_Symbol_25 = (OpenAD_Symbol_6 + OpenAD_Symbol_8 * > OpenAD_Symbol_7) OpenAD_Symbol_26 = (OpenAD_Symbol_3 + OpenAD_Symbol_25 * > OpenAD_Symbol_4) OpenAD_Symbol_28 = OpenAD_Symbol_0 CALL setderiv(OpenAD_Symbol_29,X) CALL setderiv(OpenAD_Symbol_27,Y) CALL sax(OpenAD_Symbol_26,OpenAD_Symbol_27,B) CALL sax(OpenAD_Symbol_28,OpenAD_Symbol_29,A) ELSE

ADIC: simple example #include "ad_deriv.h" #include <math.h> #include "adintrinsics.h" void ad_func(DERIV_TYPE *f,DERIV_TYPE x,DERIV_TYPE y) { DERIV_TYPE a, b, ad_var_0, ad_var_1, ad_var_2; double ad_adji_0,ad_loc_0,ad_loc_1,ad_adj_0,ad_adj_1,ad_adj_2,ad_adj_3; if (DERIV_val(x) > DERIV_val(y)) { DERIV_val(a) = cos( DERIV_val(x)); /*cos*/ ad_adji_0 = -sin( DERIV_val(x)); { ad_grad_axpy_1(&(a), ad_adji_0, &(x)); } DERIV_val(ad_var_0) = sin( DERIV_val(y)); /*sin*/ ad_adji_0 = cos( DERIV_val(y)); ad_grad_axpy_1(&(ad_var_0), ad_adji_0, &(y)); ad_loc_0 = DERIV_val(ad_var_0) * DERIV_val(y); ad_loc_1 = ad_loc_0 * DERIV_val(y); ad_adj_0 = DERIV_val(ad_var_0) * DERIV_val(y); ad_adj_1 = DERIV_val(y) * DERIV_val(y); ad_grad_axpy_3(&(b), ad_adj_1, &(ad_var_0), ad_adj_0, &(y), ad_loc_0, &(y)); DERIV_val(b) = ad_loc_1; else { // ...

ADIC-Generated Code: Interpretation Original code: y = x1*x2*x3*x4; typedef struct { double value; double grad[ad_GRAD_MAX]; } DERIV_TYPE; DERIV_val(y): value of program variable y DERIV_grad(y):derivative object associated with y ad_loc_0 = DERIV_val(x1) * DERIV_val(x2); ad_loc_1 = ad_loc_0 * DERIV_val(x3); dy/dx4 ad_loc_2 = ad_loc_1 * DERIV_val(x4); y ad_adj_0 = ad_loc_0 * DERIV_val(x4); dy/dx3 ad_adj_1 = DERIV_val(x3) * DERIV_val(x4); ad_adj_2 = DERIV_val(x1) * ad_adj_1; dy/dx2 ad_adj_3 = DERIV_val(x2) * ad_adj_1; dy/dx1 reverse (or adjoint) mode of AD The forward mode: Associate a gradient with every program variable (of type double) and apply the rules of differential calculus. The computation of p directional derivatives requires gradients of length p, therefore, many vector linear combinations of length p, e.g., grad(x) = (0,0,…,1,…0) leads to grad(t) = dt/dx(1:p). The reverse (adjoint) mode: Associate an “adjoint” alpha with every program variable and apply the following rule to the inverted program: s = f(v,w) ® a_v += ds/dv* a_s; a_w += ds/dw* a_s; 1st step: Reversible program: single assignment form 2nd step: Adjoint computation for(i=0;i<nindeps;i++) DERIV_grad(y)[i] = ad_adj_3*DERIV_grad(x1)[i]+ ad_adj_2*DERIV_grad(x2)[i]+ad_adj_0*DERIV_grad(x3)[i]+ ad_loc_1*DERIV_grad(x4)[i]; forward mode of AD DERIV_val(y) = ad_loc_2; original value

Matrix Coloring Jacobian matrices are often sparse The forward mode of AD computes J × S, where S is usually an identity matrix or a vector Can “compress” Jacobian by choosing S such that structurally orthogonal columns are combined A set of columns are structurally orthogonal if no two of them have nonzeros in the same row Equivalent problem: color the graph whose adjacency matrix is JTJ Equivalent problem: distance-2 color the bipartite graph of J

Compressed Jacobian

What is feasible & practical Key point: forward mode computes JS at cost proportional to number of columns in S; reverse mode computes JTW at cost proportional to number of columns in W Jacobians of functions with small number (1—1000) of independent variables (forward mode, S=I) Jacobians of functions with small number (1—100) of dependent variables (reverse/adjoint mode, S=I) Very (extremely) large, but (very) sparse Jacobians and Hessians (forward mode plus coloring) Jacobian-vector products (forward mode) Transposed-Jacobian-vector products (adjoint mode) Hessian-vector products (forward + adjoint modes) Large, dense Jacobian matrices that are effectively sparse or effectively low rank (e.g., see Abdel-Khalik et al., AD2008)

A B C p k n m Scenarios N small: use forward mode on full computation M small: use reverse mode on full computation M & N large, P small: use reverse mode on A, forward mode on B&C M & N large, K small: use reverse mode on A&B, forward mode on C N, P, K, M large, Jacobians of A, B, C sparse: compressed forward mode N, P, K, M large, Jacobians of A, B, C low rank: scarce forward mode A B C n m

A B C p k n m Scenarios N small: use forward mode on full computation M small: use reverse mode on full computation M & N large, P small: use reverse mode on A, forward mode on B&C M & N large, K small: use reverse mode on A&B, forward mode on C N, P, K, M large, Jacobians of A, B, C sparse: compressed forward mode N, P, K, M large, Jacobians of A, B, C low rank: scarce forward mode N, P, K, M large: Jacobians of A, B, C dense: what to do? A B C n m

Automatic Differentiation: What Can Go Wrong (and what to do about it)

Issues with Black Box Differentiation Source code may not be available or may be difficult to work with Simulation may not be (chain rule) differentiable Feedback due to adaptive algorithms Nondifferentiable functions Noisy functions Convergence rates Etc. Accurate derivatives may not be needed (FD might be cheaper) Differentiation and discretization do not commute

Difficulties in ADIFOR 2.0 processing of FLUENT Dubious programming techniques: Type mismatches in actual & declared parameters Bugs: inconsistent number of arguments in subroutine calls Not conforming to Fortran77 standard while statement in one subroutine ADIFOR2.0 limitations: I/O statements containing function invocations

Note: The value of „zero“ is a small number, not 0.0 Overflows in Fluent.AD Dynamic range of derivative code often is larger than that of original code. This may lead to overflows in the derivative code, in particular in 32-bit arithmetic. AD-generated code: r4_v = ap / cendiv r4_b = 1.0 / cendiv r5_b = (-r4_v) / cendiv do g_i_ = 1, g_pmax_ g_axp(g_i_) = g_axp(g_i_) + r4_b * g_ap(g_i_) + r5_b * g_cendiv(g_i_) enddo axp = axp + r4_v Original Code: if (cendiv.eq.0.0) cendiv = zero endif axp = axp+ap/cendiv Note: The value of „zero“ is a small number, not 0.0

Wisconsin Sea Ice Model

Chain rule (non-)differentiability if (x .eq. 1.0) then a = y else if ((x .eq. 0.0) then a = 0 else a = x*y endif b = sqrt(x**4 + y**4)

Mathematical Challenges Derivatives of intrinsic functions at points of non-differentiability Derivatives of implicitly defined functions Derivatives of functions computed using numerical methods

Points of Nondifferentiability Due to intrinsic functions Several intrinsic functions are defined at points where their derivatives are not, e.g.: abs(x), sqrt(x) at x=0 max(x,y) at x=y Requirements: Record/report exceptions Optionally, continue computation using some generalized gradient ADIFOR/ADIC approach User-selected reporting mechanism User-defined generalized gradients, e.g.: [1.0,0.0] for max(x,0) [0.5,0.5] for max(x,y) Various ways of handling Verbose reports (file, line, type of exception) Terse summary (like IEEE flags) Ignore Due to conditional branches May be able to handle using trust regions

Template at F77 intrinsic call site (gen. by AD Tool) ADIntrinsics System Template at F77 intrinsic call site (gen. by AD Tool) Translation Blueprints (default or by user) Reporting Flavor (default or user) PURSE Template Instantiator Directives in Code (optional by user) ADIntrinsics Runtime Library F77 Derivative Code Library Calls (optional by user) Compile and Link Executable

Implicitly Defined Functions Implicitly defined functions often computed using iterative methods Function and derivatives may converge at different rates Derivative may not be “accurate” if iteration halted upon function convergence Solutions: Tighten function convergence criteria Add derivative convergence to stopping criteria Compute derivatives directly, e.g. A x = b Verify that implicit function theorem is the correct thing to be using.

Derivatives of Functions Computed Using Numerical Methods Differentiation and approximation may not commute Need to be careful about how derivatives of numerical approximations are used E.g., differentiating through an ODE integrator can provide unexpected results due to feedback induced by adaptive stepsize control:

F(y,p) p ODE Solver y|t=t1, t2, ... (CVODE) y|t=0 Sensitivity CVODE_S: Objective F(y,p) p ODE Solver (CVODE) y|t=t1, t2, ... y|t=0 automatically Sensitivity Solver y|t=t1, t2, ... p dy/dp|t=t1, t2, ... y|t=0

ad_F(y,ad_y,p,ad_p) p,ad_p ad_CVODE y, ad_y|t=t1, t2, ... y, ad_y|t=0 Possible Approaches Apply AD to CVODE: ad_F(y,ad_y,p,ad_p) p,ad_p ad_CVODE y, ad_y|t=t1, t2, ... y, ad_y|t=0 Solve sensitivity eqns: ad_F(y,ad_y,p,ad_p) p CVODE_S y, dy/dp |t=t1, t2, ... y|t=0

CVODE as ODE + sensitivity solver Augmented ODE initial-value problem:

CVODE_S: Test Problem Diurnal kinetics advection-diffusion equation 100x100 structured grid 16 Pentium III nodes

SensPVODE: Number of Timesteps

SensPVODE: Time/Timestep

Addressing Limitations in Black Box AD Detect points of nondifferentiability proceed with a subgradient currently supported for intrinsic functions, but not conditional statements Exploit mathematics to avoid differentiating through an adaptive algorithm Modify termination criterion for implicitly defined functions Tighten tolerance Add derivatives to termination test (preferred)

Conclusions There are some potential “gotchas” when applying AD in a black box fashion Some care should be taken to ensure that the desired quantity is computed There are usually workarounds

AD Intro: Follow Up

Practical Matters: constructing computational graphs At compile time (source transformation) Structure of graph is known, but edge weights are not: in effect, implement inspector (symbolic) phase at compile time (offline), executor (numeric) phase at run time (online) In order to assemble graph from individual statements, must be able to resolve aliases, be able to match variable definitions and uses Scope of computational graph construction is usually limited to statements or basic blocks Computational graph usually has O(10)—O(100) vertices At run time (operator overloading) Structure and weights both discovered at runtime Completely online—cannot afford polynomial time algorithms to analyze graph Computational graph may have O(10,000) vertices

Checkpointing real applications In practice, need a combination of all of these techniques At the timestep level, 2- or 3-level checkpointing is typical: too many timesteps to checkpoint every timestep At the call tree level, some mixture of joint and split mode is desirable Pure split mode consumes too much memory Pure joint mode wastes time recomputing at the lowest levels of the call tree Currently, OpenAD provides a templating mechanism to simplify the use of mixed checkpointing strategies Future research will attempt to automate some of the checkpointing strategy selection, including dynamic adaptation

What to do for very large, dense Jacobian matrices? Jacobian matrix might be large and dense, but “effectively sparse” Many entries below some threshold ε (“almost zero”) Can tolerate errors up to δ in entries greater than ε Example: advection-diffusion for finite time step: nonlocal terms fall off exponentially Solution: do a partial coloring to compress this dense Jacobian: requires solving a modified graph coloring problem Jacobian might be large and dense, but “effectively low rank” Can be well approximated by a low rank matrix Jacobian-vector products (and JTv) are cheap Adaptation of efficient subspace method uses Jv and JTv in random directions to build low rank approximation (Abdel-Khalik et al., AD08)

Application highlights Atmospheric chemistry Breast cancer biostatistical analysis CFD: CFL3D, NSC2KE, (Fluent 4.52: Aachen) ... Chemical kinetics Climate and weather: MITGCM, MM5, CICE Semiconductor device simulation Water reservoir simulation

Parameter tuning: sea ice model - Simulated (yellow) and observed (green) March ice thickness (m) Tuned parameters Standard parameters

Differentiated Toolkit: CVODES (nee SensPVODE) Diurnal kinetics advection-diffusion equation 100x100 structured grid 16 Pentium III nodes

Sensitivity Analysis: mesoscale weather model Add one or two applications

Conclusions & Future Work Automatic differentiation research involves a wide range of combinatorial problems AD is a powerful tool for scientific computing Modern automatic differentiation tools are robust and produce efficient code for complex simulation codes Robustness requires an industrial-strength compiler infrastructure Efficiency requires sophisticated compiler analysis Effective use of automatic differentiation depends on insight into problem structure Future Work Further develop and test techniques for computing Jacobians that are effectively sparse or effectively low rank Develop techniques to automatically generate complex and adaptive checkpointing strategies

About Argonne First national laboratory 25 miles southwest of Chicago ~3000 employees (~1000 scientists) Mathematics and Computer Science (MCS) Division ~100 scientific staff Research in parallel programming models, grid computing, scientific computing, component-based software engineering, applied math, … Summer programs for undergrads and grad students (Jan/Feb deadlines) Research semester and co-op opportunities for undergrads Postdoc, programmer, and scientist positions available AD group has 1 opening

For More Information Andreas Griewank, Evaluating Derivatives, SIAM, 2000. Griewank, “On Automatic Differentiation”; this and other technical reports available online at: http://www.mcs.anl.gov/autodiff/tech_reports.html AD in general: http://www.mcs.anl.gov/autodiff/, http://www.autodiff.org/ ADIFOR: http://www.mcs.anl.gov/adifor/ ADIC: http://www.mcs.anl.gov/adic/ OpenAD: http://www.mcs.anl.gov/openad/ Other tools: http://www.autodiff.org/ E-mail: hovland@mcs.anl.gov