LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA Lawrence Livermore National Security, LLC Considerations on efficient implementation of Anderson acceleration on parallel architectures ICERM workshop on methods for large-scale nonlinear problems J. Loffeld and C.S. Woodward 9/3/2015
Lawrence Livermore National Laboratory LLNL-PRES We want to implement Anderson acceleration efficiently in SUNDIALS SUite of Nonlinear and DIfferential/ALgebraic Solvers Kernels implemented on top of abstract vector library in SIMD fashion Supports serial, thread-parallel, and MPI-parallel vectors Starting work on accelerators such as GPUs and the Intel Phi Currently being tested on two large-scale applications ParaDiS – Parallel dislocation dynamics Ardra - Deterministic neutron transport
Lawrence Livermore National Laboratory LLNL-PRES Quick review of Anderson acceleration Least-squares problem solved through QR factorization New vectors added to QR using MGS Oldest vectors removed using Givens rotations
Lawrence Livermore National Laboratory LLNL-PRES Only matters if cost of the function evaluation does not dominate When QR problem is tall and skinny but not too skinny m > 1 # Unknowns per processor is high enough to exceed cache Machine balances now exceeding 100 flops/double # of processors is high enough for communication to matter Two costs: Communication Local BLAS Level 1 operations An “efficient” implementation of AA is about minimizing the cost of QR
Lawrence Livermore National Laboratory LLNL-PRES Data reuse is key to efficiency on modern architectures BLAS Level-1 ops have no data reuse = + BLAS Level-2 ops have some reuse = * BLAS Level-3 ops have lots of reuse = *
Lawrence Livermore National Laboratory LLNL-PRES TSQR is an example of a blocked QR algorithm Q 00 Q 01 Q 02 Q 03 R 00 R 10 A0A0 A1A1 A2A2 A3A3 R 20 R 30 Q 01 Q 11 R 01 R 11 Q 02 R 02 Local QR computed in LAPACK Communication avoiding
Lawrence Livermore National Laboratory LLNL-PRES Blocked QR factorization would mean acceleration only applied after every k iterations Two problems from Walker-Ni (2011): 1D Convection-Diffusion using restricted additive Schwarz 1024 unknowns, m = 5 Expectation-maximization 100,000 unknowns, m = 4 Blocking can have a modest impact on convergence Blocking compromises acceleration
Lawrence Livermore National Laboratory LLNL-PRES MPI cost of Anderson is low Test problem to measure MPI and local costs Dummy function evaluation 16 iterations, m=16 Vulcan – IBM BG/Q 5D torus, optimized reductions 8000 processors give similar resutls Cab – x86 cluster Two-stage federated fat tree
Lawrence Livermore National Laboratory LLNL-PRES Blocked QR reduces local cost TSQR by Solomonik, et al. Matrix 4 columns wide
Lawrence Livermore National Laboratory LLNL-PRES Nodes now include accelerators Global memory CPU memory CPU Local mem Local mem Core Scheduler Local mem Core PCI-E
Lawrence Livermore National Laboratory LLNL-PRES GPU programming model is similar to vector machines SIMD programming model Unit stride memory access highly preferred Need very high concurrency to hide memory latency Local memories are small + Thread 1 Thread 2 Thread 3
Lawrence Livermore National Laboratory LLNL-PRES Initial GPU implementation using CuBLAS Assumes function evaluation on GPU Maintains approach of Walker (2011) Modified Gram-Schmidt based QR factorization CuBLAS-based implementation Main loop on CPU R matrix on CPU, Q matrix on GPU 10 µs latency each kernel launch Expect to be highly bandwidth bound
Lawrence Livermore National Laboratory LLNL-PRES Speedup equals the ratio of memory bandwidths 4 iterations, m = 4 16 iterations, m = 16 Tested on two machines: Tesla K40m + dual 8-core Xeon — Bandwidth ratio of 5.6x GTX core Xeon — Bandwidth ratio of 9x Dummy test problem 4 and 16 iterations, m = 4, 8, 16 Processor utilization extremely low Higher bandwidth of GPU gives higher performance, but less than 50% peak
Lawrence Livermore National Laboratory LLNL-PRES Classical Gram-Schmidt allows more concurrency and data reuse Can be implemented with BLAS level-2 operations Instability countered using reorthogonolization As long as full rank, only need one reorthogonalization Various cheap tests to determine if correction needed
Lawrence Livermore National Laboratory LLNL-PRES Transposed multiply Block 1 Block 2 Cached Threads Vector is loaded only once over m columns Coalesced data access maximizes bandwidth Bandwidth slightly improved over Ddot.
Lawrence Livermore National Laboratory LLNL-PRES Conclusions MPI cost is not a bottleneck for QR implementation of Anderson On GPUs, kernels completely memory bound High bandwidth of device means high performance relative to CPU CGS can minimize redundant loads of vectors Need to test if reorthogonalization ruins benefit Kernels still need to be refined Having Anderson close to function evaluation is likely the real benefit
Lawrence Livermore National Laboratory LLNL-PRES The QR update is done in-place New vectors added using modified Gram Schmidt Implemented using dot products Communication from MPI_Allreduce Oldest vector removed from QR through Givens rotations No MPI communication, but non-trivial local cost
Lawrence Livermore National Laboratory LLNL-PRES Open questions How often is reorthogonalization needed? Maintains approach of Walker (2011) Modified Gram-Schmidt based QR factorization CuBLAS-based implementation Main loop on CPU R matrix on CPU, Q matrix on GPU 10 µs latency each kernel launch Expect to be highly bandwidth bound
Lawrence Livermore National Laboratory LLNL-PRES Restarting not compelling versus “sliding” 2D Poisson using RAS 4 subdomains per direction 3 cells overlap 16 processors Non-blocked linear algebra