GAIN: GPU Accelerated Intensities Ahmed F. Al-Refaie, S. N. Yurchenko, J. Tennyson Department of Physics Astronomy - University College London - Gower.

Slides:

Advertisements

Similar presentations

Copyright 2011, Data Mining Research Laboratory Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, Srinivasan.

Advertisements

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

More on threads, shared memory, synchronization

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Dan Iannuzzi Kevin Pine CS 680. Outline The Problem Recap of CS676 project Goal of this GPU Research Approach Parallelization attempts Results Difficulties.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.

CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.

GCSE Computing - The CPU

All-Pairs-Shortest-Paths for Large Graphs on the GPU Gary J Katz 1,2, Joe Kider 1 1 University of Pennsylvania 2 Lockheed Martin IS&GS.

Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST

GPGPU platforms GP - General Purpose computation using GPU

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

GPU Programming with CUDA – Optimisation Mike Griffiths

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Genetic Programming on General Purpose Graphics Processing Units (GPGPGPU) Muhammad Iqbal Evolutionary Computation Research Group School of Engineering.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

1 Web Based Programming Section 8 James King 12 August 2003.

GPU Architecture and Programming

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

The fetch-execute cycle. 2 VCN – ICT Department 2013 A2 Computing RegisterMeaningPurpose PCProgram Counter keeps track of where to find the next instruction.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Introduction to MMX, XMM, SSE and SSE2 Technology

GPU Accelerated MRI Reconstruction Professor Kevin Skadron Computer Science, School of Engineering and Applied Science University of Virginia, Charlottesville,

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

ACCELERATING QUERY-BY-HUMMING ON GPU Pascal Ferraro, Pierre Hanna, Laurent Imbert, Thomas Izard ISMIR 2009 Presenter: Chung-Che Wang (Focus on the performance.

CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

AYTY: A new hot line- list for formaldehyde A. F. Al-Refaie, S. N. Yurchenko, A. Yachmenev, J. Tennyson Department of Physics Astronomy - University College.

Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.

CS 732: Advance Machine Learning

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

My Coordinates Office EM G.27 contact time:

GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.

IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

Our Graphics Environment Landscape Rendering. Hardware  CPU  Modern CPUs are multicore processors  User programs can run at the same time as other.

A few words on locality and arrays

GCSE Computing - The CPU

Chapter 10: Computer systems (1)

CS427 Multicore Architecture and Parallel Computing

Distributed Processors

CSC 4250 Computer Architectures

Lecture 2: Intro to the simd lifestyle and GPU internals

Recitation 2: Synchronization, Shared memory, Matrix Transpose

NVIDIA Fermi Architecture

1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.

EE 193: Parallel Computing

GCSE Computing - The CPU

6- General Purpose GPU Programming

Multicore and GPU Programming

Presentation transcript:

GAIN: GPU Accelerated Intensities Ahmed F. Al-Refaie, S. N. Yurchenko, J. Tennyson Department of Physics Astronomy - University College London - Gower Street - London - WC1E 6BT

Computing Intensities three-j symbolsprecomputed time-consuming

TROVE Doing this for each transition is tough!! However we can split it into two parts A half-linestrength for a particular initial state A simple dot product to complete it

TROVE Relegate majority of the computation for each initial state Each transition therefore reduces to a simple dot product However, the half-linestrength can still take a long time Exomol line-lists can have billions of transitions as well This sight is common for particularly dense J: hours = 1.5 months for one J’ J’’ !!

Life is too short to wait around for transitions Question: How can you complete a line-list quickly? (1) Reduce quality of the line-lists (2) Make it faster Hint: The answer is not (1)

The half-linestrength Focus of the talk will be here: H 2 CO: 30 seconds PH 3 : 1 minute SO 3 : 7-8 mins! Tens of thousands of initial states!! High J times:

Half line strength Initial basis-set Final basis-set

Half line strength Initial basis-set T:0 T:1 T:2 ….. T:9

Half line strength Initial basis-set T:0 T:1 T:2 ….. T:9

Half line strength Initial basis-set T:0 T:1 T:2 ….. T: hours was with 16 cores!

Enter the GPU Graphics Processing Units can have around 2000 cores Highly parallel nature with lots of arithmetic capabilities

OpenMP thread OpenMP thread Half line strength For all elements in the J’’ basis-set Get K f, tau f For all elements in the J’ basis-set Get K i, tau i, c i Get dipole Accumulate half-ls vector Do maths

Baseline Kernal Why?

Optimising But we have so many cores!!! WHY!?!?! 1 - Read Ji, Ki, taui 2 - Read dipole matrix 3 - Read coefficients 4 - Do math and accumulate Turns out memory operations are fairly slow. We are doing a lot of memory operations CPUs have really large and multiple caches GPUs have very simple caches……………..

Optimising We are provided a user-managed cache called: Shared memory It’s a small chunk of memory thats REALLY fast A lot of the global memory reads are redundant

Optimising Initial basis-set Final basis-set Each thread is reading the same Ji,Ki, taui and coeffs

Optimizing Why not have the threads cache it instead? Final Initial Cache quanta and coefficients

Optimizing Do math and repeat Final Initial

Optimizing Final Initial This is the Cache and Reduce (CR) Kernal

GPU thread Cache and Reduce For all elements in the J’’ basis-set Get K f, tau f For all elements in the J’ basis-set, step 256 Get K i, tau i, c i at thread point Get dipole Accumulate half-ls vector Do maths Block: 256 threads Store in shared memory For all elements shared memory Get K i, tau i, c i

Optimizing Have each thread cache a part of the initial basis-set Final Initial Cache quanta and coefficients

Optimizing

SO 3 molecule:

Porting to the GPU Half line strength Line strength completion

Simple dot product, replace with cuBLAS version. ~5x faster for H 2 CO However we have lots of final state eigenvectors Strategy is to get lots done in ‘parallel’ Use stream execution Use multiple GPUs Why not both?

Stream execution Run multiple independant kernals simultaneously

Multiple GPUs Run multiple initial states on multiple GPUs

Line strength completion

Porting to the GPU Half line strength Line strength completion

Result:

Future Work Port code to DVR3D Remove dot product and switch to DGEMM Integrate fully into TROVE Finish my PhD

Thanks