Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs Case Study: Accelerating Full Waveform Inversion via OpenCL™ on AMD GPUs ©2014.

Slides:

Advertisements

Similar presentations

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Advertisements

An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Reference: Message Passing Fundamentals.

Presented by Rengan Xu LCPC /16/2014

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.

GPGPU platforms GP - General Purpose computation using GPU

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

An Introduction to Programming with CUDA Paul Richmond

OpenCL Introduction A TECHNICAL REVIEW LU OCT

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Accelerating Simulation of Agent-Based Models on Heterogeneous Architectures.

Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

GPU Programming with CUDA – Optimisation Mike Griffiths

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

Instructor Notes GPU debugging is still immature, but being improved daily. You should definitely check to see the latest options available before giving.

Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.

Chapter 1: Introduction. 1.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 1: Introduction What Operating Systems Do Computer-System.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

GPU Architecture and Programming

OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.

1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

Martin Kruliš by Martin Kruliš (v1.0)1.

Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

My Coordinates Office EM G.27 contact time:

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.

Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,

Heterogeneous Computing using openCL lecture 3 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)

CS427 Multicore Architecture and Parallel Computing

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Linchuan Chen, Xin Huo and Gagan Agrawal

Introduction to OpenCL 2.0

6- General Purpose GPU Programming

Presentation transcript:

Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs Case Study: Accelerating Full Waveform Inversion via OpenCL™ on AMD GPUs ©2014 Acceleware Ltd. All rights reserved. Chris Mason, Acceleware Product Manager March 5, 2014

Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs About Acceleware  Software and services company specializing in HPC product development, developer training and consulting services  OpenCL training for AMD GPUs –Progressive lectures and hands-on lab exercises –Experienced instructors –Delivered worldwide  High performance consulting –Feasibility studies –Porting and optimization –Code commercialization 1

Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs Outline  What is Full Waveform Inversion?  The Project  OpenCL  Optimizations –Coalescing –Iterative kernel for stencil operations –Fusing kernels together to eliminate redundant memory accesses  Key Performance Results 2

Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs What is Full Waveform Inversion?  Seismic inversion technique  Used to build Earth models from recorded seismic data  Uses a finite-difference solution to the acoustic wave equation  Computationally expensive 3

Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs What is FWI? From a basic starting point to an accurate velocity model 4

Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs FWI Algorithm Initial Model Estimate Residuals Forward Propagate Source → Residuals Gradient Back Propagate Residuals → Gradient Step Length Forward Propagation(s) → Step Length Update Model Increase Frequency Loop over shots Loop over frequencies Loop until convergence 5

Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs FWI Compute Cost  Cluster size of 10s to 100s of CPU nodes  Many days of runtime  Accuracy and quality reduced to keep runtime acceptable 6

Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs The Project  GeoTomo develops high-end geophysical software products that help geophysicists around the world to image beneath the subsurface  GeoTomo had pre-existing cluster-ready multi-threaded (OpenMP based) CPU FWI solution  GeoTomo required their FWI application to run faster so they could deliver the results quicker to their clients –Looked to AMD GPUs to potentially accelerate their FWI and approached Acceleware for our help to make it happen 7

Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs Why use GPUs? Performance! 8

Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs OpenCL Overview  Parallel computing architecture standardized by the Khronos Group  OpenCL: –Is a royalty free standard –Provides an API to coordinate parallel computation across heterogeneous processors  Of interest because heterogeneous devices can significantly accelerate certain (primarily data-parallel) workloads –Defines a cross-platform programming language –Used on handheld/embedded devices through supercomputers 9

Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs OpenCL Programming Model  Heterogeneous model, including provisions for a host connected to one or more devices –Example: GPUs, CPUs Host Device 1 GPU Device 2 GPU … Device N GPU 10

Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs The OpenCL Programming Model  Data-parallel portions of an algorithm are executed on the device as kernels –Kernels are C functions with some restrictions and a few language extensions –Many (parallel) work-items execute the kernel  The host executes serial code between device kernel launches –Memory management –Data exchange to/from device (usually) –Error handling 11 Work-Group (0,0) Work-Group (1,0 ) Work-Group (0,1)Work-Group (1,1) Work-Group (0,2)Work-Group( 1,2) ND Range Work-Group (0,0) Work-Group (1,0) Work-Group (2,0) Work-Group (0,1) Work-Group (1,1) Work-Group (2,1) ND Range Host Device Host Device

Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs OpenCL Memory Model  OpenCL kernels have access to four distinct memory regions: –Global  Allows read/write access from all work-items in all work-groups  Persistent across kernels –Local  Memory that is local to all work-items within a work-group –Constant  Region of memory that remains constant (read-only) during the execution of a kernel –Private  Memory that is private to a work-item  OpenCL vendors map memory regions into physical resources –Local/constant/private memory usually several orders of magnitude lower capacity but orders of magnitude faster than global memory 12

Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs OpenCL Syntax – Memory Spaces  Host and device have separate memory spaces –Data is explicitly moved between them  Typically over PCIe bus  Host functions to allocate, copy, and free memory on device, eg. –clCreateBuffer() –clEnqueueReadBuffer() –clEnqueueWriteBuffer() –clReleaseMemoryObject() 13

Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs Putting It All Together 14 A0A0 A0A0 A1A1 A1A1 A2A2 A2A2 A3A3 A3A3 A4A4 A4A4 A5A5 A5A5 A6A6 A6A6 A7A7 A7A7 B0B0 B0B0 B1B1 B1B1 B2B2 B2B2 B3B3 B3B3 B4B4 B4B4 B5B5 B5B5 B6B6 B6B6 B7B7 B7B7 C0C0 C0C0 C1C1 C1C1 C2C2 C2C2 C3C3 C3C3 C4C4 C4C4 C5C5 C5C5 C6C6 C6C6 C7C7 C7C7 C x = A x + B x One work-item per element Operation __kernel void VectorAdd(__global float* a, __global float* b, __global float* c) { int idx = get_global_id(0); c[idx] = a[idx] + b[idx]; } __kernel void VectorAdd(__global float* a, __global float* b, __global float* c) { int idx = get_global_id(0); c[idx] = a[idx] + b[idx]; } Each work-item has a unique index, typically used to index into arrays

Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs Vector Add – Host Code 15 void VectorAdd(float* aH, float* bH, float* cH, int N) { int N_BYTES = N * sizeof(float); // Device management code … cl_mem aD = clCreateBuffer(…,N_BYTES, …); cl_mem bD = clCreateBuffer(…,N_BYTES, …); cl_mem cD = clCreateBuffer(…,N_BYTES, …); clEnqueueWriteBuffer(...,aD,…,N_BYTES,aH,…); clEnqueueWriteBuffer(...,bD,…,N_BYTES,bH,…); // Pass kernel arguments and launch kernel … clEnqueueNDRangeKernel(…, &N, …); clEnqueueReadBuffer(...,cD,…,N_BYTES,cH,…); } void VectorAdd(float* aH, float* bH, float* cH, int N) { int N_BYTES = N * sizeof(float); // Device management code … cl_mem aD = clCreateBuffer(…,N_BYTES, …); cl_mem bD = clCreateBuffer(…,N_BYTES, …); cl_mem cD = clCreateBuffer(…,N_BYTES, …); clEnqueueWriteBuffer(...,aD,…,N_BYTES,aH,…); clEnqueueWriteBuffer(...,bD,…,N_BYTES,bH,…); // Pass kernel arguments and launch kernel … clEnqueueNDRangeKernel(…, &N, …); clEnqueueReadBuffer(...,cD,…,N_BYTES,cH,…); } Allocate memory on device Transfer input arrays to device Launch kernel Transfer output array to host

Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs Project Steps  1) Profiling –Acquired code, datasets and reference benchmarks from GeoTomo –Set up local machines with near-equivalent hardware, compiled code and confirmed reference benchmark numbers –Augmented code with timers to determine time spent in parallel regions, areas of interest 16

Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs Project Steps  2) Feasibility Analysis –Investigated memory footprint for FWI jobs  GPU memory limited to 6GB per card –Investigated potential speedup / time to port code  Maximum speed up determined by time spent in parallel regions (Amdahl’s Law)  Time to port dependent on feature set –E.g. domain decomposition across multiple GPUs 17

Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs Project Steps  3) Implementation –Creating testing harnesses –Kernel implementation –Resolving hardware driver issues –Enabling multi-GPU device support –Optimization iterations  4) Wrapup –Delivery of port, along with installation documentation –Trained GeoTomo developer on OpenCL 18

Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs Key GeoTomo Optimizations  1) Coalescing –Changing memory access patterns in the kernels to those best suited for GPUs  Global memory is accessed via a request for a multi-byte word  Combine load/store requests from consecutive work-items to reduce the number of requested words –Fewer requests  less contention to global memory  Make one big multi-word burst request to global memory whenever possible –Contiguous bursts -> less global memory overhead 19

Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs Key GeoTomo Optimizations  2) Iterative kernel for stencil operations Input Volumes Stencil Kernels * Outputs are weighted combinations of surrounding elements from input volumes Off-axis weights are zero Acknowledgement: Paulius Micikevicius,

Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs Key GeoTomo Optimizations  Naïve implementation would have each work-item read all of its neighboring elements directly from global memory –Possible to hit maximum GPU memory bandwidth but redundant reads hurt performance 21

Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs Key GeoTomo Optimizations  Alternative: Iterating over 2D slices along slowest dimension –Single items responsible for column of output array –Work-group caches 2D plane of input in local memory –Work-items store inputs in direction of iteration in registers –Reduces required number of global memory reads significantly Single Work- item View RegisterLocal memory Acknowledgement: Paulius Micikevicius,

Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs Key GeoTomo Optimizations  3) Kernel Fusion –Reduce redundant memory accesses by fusing kernels that operate on the same volume together –Improves performance by reducing redundant global memory reads  4) Kernel Fission –Improve occupancy by lowering kernel resource requirements (registers) via kernel simplification –Allows for more work-items to run concurrently on GPU, improving masking of global memory latency 23

Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs Performance Results  FWI 15 Hz, 15 shots –GPU version 7997 seconds –CPU (5 cores per shot) seconds [8.4X] –CPU (30 cores per shot) seconds [20.9X]  GPU: Sapphire Radeon HD 7970 GHz Edition –6GB model 24

Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs Performance Results “Using GPU’s we can use higher frequencies and more if not all of the shots to improve the resolution and coverage.” James Jackson, President, GeoTomo 25

Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs Questions? OpenCL Courses  June 3-6, 2014, Calgary, Canada  Private onsite classes also available  Acceleware.com/opencl-training Acceleware.com/opencl-training OpenCL Consulting  Feasibility studies  Code commercialization  Porting and optimization  Mentoring  Acceleware.com/services Acceleware.com/services Contact Us  Tel:  26