LLNL-PRES-668437 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)

OpenFOAM on a GPU-based Heterogeneous Cluster

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

1 Aug 7, 2004 GPU Req GPU Requirements for Large Scale Scientific Applications “Begin with the end in mind…” Dr. Mark Seager Asst DH for Advanced Technology.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Interactive k-D Tree GPU Raytracing Daniel Reiter Horn, Jeremy Sugerman, Mike Houston and Pat Hanrahan.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

Slide 1/8 Performance Debugging for Highly Parallel Accelerator Architectures Saurabh Bagchi ECE & CS, Purdue University Joint work with: Tsungtai Yeh,

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

After step 2, processors know who owns the data in their assumed partitions— now the assumed partition defines the rendezvous points Scalable Conceptual.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Parallelization of the Classic Gram-Schmidt QR-Factorization

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

SuperLU_DIST on GPU Cluster Sherry Li FASTMath Meeting, Oct. 1-2, /2014 “A distributed CPU-GPU sparse direct solver”, P. Sao, R. Vuduc and X.S. Li, Euro-Par.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.

08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.

Martin Kruliš by Martin Kruliš (v1.1)1.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Toward an Automatically Tuned Dense Symmetric Eigensolver for Shared Memory Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya.

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.

Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

NFV Compute Acceleration APIs and Evaluation

Task Scheduling for Multicore CPUs and NUMA Systems

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs

Lecture 5: GPU Compute Architecture

Nathan Grabaskas: Batched LA and Parallel Communication Optimization

Department of Computer Science University of California, Santa Barbara

Lecture 5: GPU Compute Architecture for the last time

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Hybrid Programming with OpenMP and MPI

Memory System Performance Chapter 3

Department of Computer Science University of California, Santa Barbara

Multicore and GPU Programming

Presentation transcript:

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA Lawrence Livermore National Security, LLC Considerations on efficient implementation of Anderson acceleration on parallel architectures ICERM workshop on methods for large-scale nonlinear problems J. Loffeld and C.S. Woodward 9/3/2015

Lawrence Livermore National Laboratory LLNL-PRES We want to implement Anderson acceleration efficiently in SUNDIALS  SUite of Nonlinear and DIfferential/ALgebraic Solvers Kernels implemented on top of abstract vector library in SIMD fashion Supports serial, thread-parallel, and MPI-parallel vectors Starting work on accelerators such as GPUs and the Intel Phi  Currently being tested on two large-scale applications ParaDiS – Parallel dislocation dynamics Ardra - Deterministic neutron transport

Lawrence Livermore National Laboratory LLNL-PRES Quick review of Anderson acceleration Least-squares problem solved through QR factorization  New vectors added to QR using MGS  Oldest vectors removed using Givens rotations

Lawrence Livermore National Laboratory LLNL-PRES  Only matters if cost of the function evaluation does not dominate  When QR problem is tall and skinny but not too skinny m > 1 # Unknowns per processor is high enough to exceed cache Machine balances now exceeding 100 flops/double # of processors is high enough for communication to matter  Two costs: Communication Local BLAS Level 1 operations An “efficient” implementation of AA is about minimizing the cost of QR

Lawrence Livermore National Laboratory LLNL-PRES Data reuse is key to efficiency on modern architectures BLAS Level-1 ops have no data reuse = + BLAS Level-2 ops have some reuse = * BLAS Level-3 ops have lots of reuse = *

Lawrence Livermore National Laboratory LLNL-PRES TSQR is an example of a blocked QR algorithm Q 00 Q 01 Q 02 Q 03 R 00 R 10 A0A0 A1A1 A2A2 A3A3 R 20 R 30 Q 01 Q 11 R 01 R 11 Q 02 R 02  Local QR computed in LAPACK  Communication avoiding

Lawrence Livermore National Laboratory LLNL-PRES  Blocked QR factorization would mean acceleration only applied after every k iterations Two problems from Walker-Ni (2011):  1D Convection-Diffusion using restricted additive Schwarz 1024 unknowns, m = 5  Expectation-maximization 100,000 unknowns, m = 4 Blocking can have a modest impact on convergence Blocking compromises acceleration

Lawrence Livermore National Laboratory LLNL-PRES MPI cost of Anderson is low  Test problem to measure MPI and local costs Dummy function evaluation 16 iterations, m=16  Vulcan – IBM BG/Q 5D torus, optimized reductions 8000 processors give similar resutls  Cab – x86 cluster Two-stage federated fat tree

Lawrence Livermore National Laboratory LLNL-PRES Blocked QR reduces local cost  TSQR by Solomonik, et al.  Matrix 4 columns wide

Lawrence Livermore National Laboratory LLNL-PRES Nodes now include accelerators Global memory CPU memory CPU Local mem Local mem Core Scheduler Local mem Core PCI-E

Lawrence Livermore National Laboratory LLNL-PRES GPU programming model is similar to vector machines  SIMD programming model  Unit stride memory access highly preferred  Need very high concurrency to hide memory latency  Local memories are small + Thread 1 Thread 2 Thread 3

Lawrence Livermore National Laboratory LLNL-PRES Initial GPU implementation using CuBLAS  Assumes function evaluation on GPU  Maintains approach of Walker (2011) Modified Gram-Schmidt based QR factorization  CuBLAS-based implementation Main loop on CPU R matrix on CPU, Q matrix on GPU 10 µs latency each kernel launch Expect to be highly bandwidth bound

Lawrence Livermore National Laboratory LLNL-PRES Speedup equals the ratio of memory bandwidths 4 iterations, m = 4 16 iterations, m = 16  Tested on two machines: Tesla K40m + dual 8-core Xeon — Bandwidth ratio of 5.6x GTX core Xeon — Bandwidth ratio of 9x  Dummy test problem 4 and 16 iterations, m = 4, 8, 16  Processor utilization extremely low Higher bandwidth of GPU gives higher performance, but less than 50% peak

Lawrence Livermore National Laboratory LLNL-PRES Classical Gram-Schmidt allows more concurrency and data reuse  Can be implemented with BLAS level-2 operations  Instability countered using reorthogonolization  As long as full rank, only need one reorthogonalization  Various cheap tests to determine if correction needed

Lawrence Livermore National Laboratory LLNL-PRES Transposed multiply Block 1 Block 2 Cached Threads  Vector is loaded only once over m columns  Coalesced data access maximizes bandwidth  Bandwidth slightly improved over Ddot.

Lawrence Livermore National Laboratory LLNL-PRES Conclusions  MPI cost is not a bottleneck for QR implementation of Anderson  On GPUs, kernels completely memory bound High bandwidth of device means high performance relative to CPU CGS can minimize redundant loads of vectors Need to test if reorthogonalization ruins benefit Kernels still need to be refined  Having Anderson close to function evaluation is likely the real benefit

Lawrence Livermore National Laboratory LLNL-PRES The QR update is done in-place  New vectors added using modified Gram Schmidt Implemented using dot products Communication from MPI_Allreduce  Oldest vector removed from QR through Givens rotations No MPI communication, but non-trivial local cost

Lawrence Livermore National Laboratory LLNL-PRES Open questions  How often is reorthogonalization needed?  Maintains approach of Walker (2011) Modified Gram-Schmidt based QR factorization  CuBLAS-based implementation Main loop on CPU R matrix on CPU, Q matrix on GPU 10 µs latency each kernel launch Expect to be highly bandwidth bound

Lawrence Livermore National Laboratory LLNL-PRES Restarting not compelling versus “sliding”  2D Poisson using RAS 4 subdomains per direction 3 cells overlap  16 processors  Non-blocked linear algebra