The Library Approach to GPU Computations of Initial Value Problems Dave Yuen University of Minnesota, U.S.A. with Larry Hanyk and Radek Matyska Charles.

Slides:



Advertisements
Similar presentations
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
Advertisements

Yafeng Yin, Lei Zhou, Hong Man 07/21/2010
Introduction to the CUDA Platform
GPU Programming using BU Shared Computing Cluster
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
© NVIDIA Corporation 2013 CUDA Libraries. © NVIDIA Corporation 2013 Why Use Library No need to reprogram Save time Less bug Better Performance = FUN.
Computing with Accelerators: Overview ITS Research Computing Mark Reed.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Appendix A. Appendix A — 2 FIGURE A.2.1 Historical PC. VGA controller drives graphics display from framebuffer memory. Copyright © 2009 Elsevier, Inc.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.
Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.
Contemporary Languages in Parallel Computing Raymond Hummel.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
© NVIDIA Corporation 2011 The ‘Super’ Computing Company From Super Phones to Super Computers CUDA 4.0.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
CUDA Linear Algebra Library and Next Generation
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
CS6235 L16: Libraries, OpenCL and OpenAcc. L16: Libraries, OpenACC, OpenCL CS6235 Administrative Remaining Lectures -Monday, April 15: CUDA 5 Features.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
GPU Computing April GPU Outpacing CPU in Raw Processing GPU NVIDIA GTX cores 1.04 TFLOPS CPU GPU CUDA Architecture Introduced DP HW Introduced.
Computación algebraica dispersa con GPUs y su aplicación en tomografía electrónica Non-linear iterative optimization method for locating particles using.
1 © 2012 The MathWorks, Inc. Parallel computing with MATLAB.
Initial experience on openCL pragamming and develop GPU solver for OpenFoam Presented by: Qingfeng Xia School of MACE University of Manchester Date:
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Robert Liao Tracy Wang CS252 Spring Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009.
CUDA-based Volume Rendering in IGT Nobuhiko Hata Benjamin Grauer.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
Martin Kruliš by Martin Kruliš (v1.0)1.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Linear Algebra Operators for GPU Implementation of Numerical Algorithms J. Krüger R. Westermann computer graphics & visualization Technical University.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
Lecture 14 Introduction to OpenACC Kyu Ho Park May 12, 2016 Ref: 1.David Kirk and Wen-mei Hwu, Programming Massively Parallel Processors, MK and NVIDIA.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
GPU's for event reconstruction in FairRoot Framework Mohammad Al-Turany (GSI-IT) Florian Uhlig (GSI-IT) Radoslaw Karabowicz (GSI-IT)
1/24 UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Appendix C Graphics and Computing GPUs
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
A survey of Exascale Linear Algebra Libraries for Data Assimilation
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.0)
6- General Purpose GPU Programming
Multicore and GPU Programming
Presentation transcript:

The Library Approach to GPU Computations of Initial Value Problems Dave Yuen University of Minnesota, U.S.A. with Larry Hanyk and Radek Matyska Charles University Prague, Czech Republic

NVIDIA GPUs CUDA: Compute Unified Device Architecture the most popular GPGPU parallel programming model today from notebooks and personal desktops to high performance computing the GPU with many cores runs the kernel concurrently by many threads two-level hardware parallelism on a GPU: SIMD and MIMD a programming model reflects the hardware parallelism: blocks and grids

G80 (since 2006): compute capability 1.0, 1.1 features 1.1: 8 cores/multiprocessor, single-precision real arithmetic GT200 (since 2008): compute capability 1.2, 1.3 features 1.3: 8 cores/MP, max. 30 x 8 = 240 cores, first double precision Fermi (since 2010): compute capability 2.0, 2.1 features 2.0: 32 cores/MP, max. 16 x 32 = 512 cores, fast double precision, cache features 2.1: 48 cores/MP, max. 8 x 48 = 384 cores, slower double precision Kepler (since 2012): compute capability 3.0 features 3.0: 192 cores/MP, max. 8 x 192 = 1536 cores slower double precision NVIDIA GPUs Architectures and compute capability

NVIDIA GPUs Product lines (classes) GeForce for games and PC graphics, most often found in desktops and notebooks Quadro for professional graphics Tesla for high-performance computing (CC 1.3, 2.0)

GeForce for desktops fast double precision (troughput lower than in Tesla) compute capability 2.0 most appropriate: in DP, better than 2.1, 3.0 up to 2 x 16 multiprocessors, 32 x 32 = 1024 cores (e.g., GTX 590) GeForce for notebooks most often armed with compute capability 2.1 up to 8 multiprocessors, 8 x 48 = 384 cores (e.g., GTX 675M) NVIDIA GPUs GeForce class and GPU computing

CUDA Developer Tools Library approach Languages vs. Libraries 1. Languages: intrinsic difficulty in learning CUDA low-level model NVIDIA toolkit, PGI compiler suite (Fortran and C), OpenCL 2. Directives: higher-level OpenMP-like model PGI and Cray directives, standardized OpenACC directives 3. Libraries: fundamental libraries by NVIDIA (CUBLAS, CUFFT,...) libraries for linear algebra (MAGMA, CULA Tools,...) MATLAB approach (Jacket, Parallel Computing Toolbox,...) GPU support in NAG and IMSL Overview: With directives and libraries, programmer’s effort simplified. Great benefits without being CUDA experts.

CUDA Developer Tools Fundamental libraries by NVIDIA CUDA Toolkit C/C++ low-level compiler (nvcc) and API library debugger, profiler numerical libraries: CUBLAS (accelerated BLAS) CUSPARSE (accelerated Sparse BLAS) CUFFT (accelerated FFT in 1-2-3D) CURAND (accelerated random-number generators) CUDA Math, Performance Primitives, Thrust,... current version: 4.2 (incl. Kepler support), coming soon: 5.0

CUDA Developer Tools OpenCL and MATLAB tools OpenCL (Open Computing Language) C based language Jacket and ArrayFire Jacket: GPU platform for MATLAB („better than the Parallel Computing Toolbox“ by MathWorks) ArrayFire: GPU library for C, C++, Fortran, Python

MAGMA (Matrix Algebra on GPU and Multicore Archs) Hybrid library for dense algebra (LAPACK) for multicore CPU and GPU CULA Tools (CUDA Linear Algebra) CULA Dense: dense solvers (LU, QR, LSQ, EV, SVD) CULA Sparse: sparse solvers (CG, BCG, GMRES, preconditioners) FLAME, SuperLU, Iterative CUDA and more CUDA Developer Tools Linear algebra libraries

PGI compiler suite PGI Workstation, Server, CUDA Fortran, CUDA-x86, Accelerator current version 12.5 (5th minor version in 2012, 10 versions in 2011) for Linux, Mac OS X and Microsoft Windows, 32/64 bit with OpenMP, MPI, parallel debugger and profiler with ACML and IMSL libraries, linkable with Intel MKL with Eclipse IDE (Linux) or Microsoft Visual Studio (Windows) CUDA Developer Tools Languages and directives by Portland Group

PGI compilers CUDA Fortran and C extensions CUDA-x86: emulation of CUDA Fortran/C codes on multicore CPUs NVIDIA compiler and libraries included Directives PGI Accelerator OpenMP-like prototype directives OpenACC directives: a new (2012) standardized specification by NVIDIA, PGI, Cray and CAPS CUDA Developer Tools Languages and directives by Portland Group

OpenACC example CPU serial code do i=1,imax a(i)=a(i)+z enddo CUDA Developer Tools OpenACC directives GPU accelerated code !$ACC KERNEL !$ACC LOOP do i=1,imax a(i)=a(i)+z enddo !$ACC END KERNELS For finer control: gang (~ grid), worker (~ warp), vector (~ thread), sequential run !$ACC LOOP GANG WORKER VECTOR SEQ

A Library GPU Approach to Initial Value Problems L. Hanyk, D. Yuen and C. Matyska for Lecture Notes in Earth Sciences Springer Verlag Chapter 1 Basic framework Chapter 2 Introduction to GPU Chapter 3 PGI CUDA Fortran Chapter 4 OpenACC and PGI Accelerator directives Chapter 5 Libraries for linear algebra and FFT Chapter 6 Elliptic PDEs. FFT-based solvers. Eigenvalue problems Chapter 7 Initial value problems for ODEs Chapter 8 Method of lines Chapter 9 Initial value problems for diffusion PDEs Chapter 10 Initial value problems for advection-diffusion PDEs Chapter 11 Initial value problems for hyperbolic PDEs Chapter 12 Future perspectives of GPUs in geosciences

A Library GPU Approach to Initial Value Problems Also included: Introductory examples CUDA templates. Mandelbrot set. Lyapunov fractals Linear algebra and static solvers Linear algebra and FFT with CPU libraries and GPU libraries Classical and fast Poisson solvers Eigenmodes of a 1D string and 2D drum Initial value solvers ODEs: Lorenz attractor by Runge-Kutta methods and BDFs Method of lines and implicit ODE solvers Heat equation by Crank-Nicolson and ADI schemes Stokes problem Wave equation Multi-GPU solutions Approaches by OpenMP and MPI

Conclusions Languages high expertise required fast code possible Directives low workload on users demanding implementation (work in progress) Libraries highly optimized building blocks, most often linear algebra and FFT general algorithms like ODEs and PDEs solvers based on standard building blocks repay best in hands of both experts and regular users