Accelerating Spherical Harmonic Transforms on the NVIDIA® GPGPU

Slides:

Advertisements

Similar presentations

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.

Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

CUDA programming (continue) Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including.

GPU Programming using BU Shared Computing Cluster

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Christopher McCabe, Derek Causon and Clive Mingham Centre for Mathematical Modelling & Flow Analysis Manchester Metropolitan University MANCHESTER M1 5GD.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

GPU Processing for Distributed Live Video Database Jun Ye Data Systems Group.

Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati † John F. Croix ‡ Sunil P. Khatri † Rahm Shastry ‡ † Texas A&M University, College.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.

All-Pairs-Shortest-Paths for Large Graphs on the GPU Gary J Katz 1,2, Joe Kider 1 1 University of Pennsylvania 2 Lockheed Martin IS&GS.

Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Beam Dynamic Calculation by NVIDIA® CUDA Technology E. Perepelkin, V. Smirnov, and S. Vorozhtsov JINR, Dubna 7 July 2009.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

Modeling GPU non-Coalesced Memory Access Michael Fruchtman.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs Wangqun Lin, Fengshun Lu College of Computer National University of Defense Technology.

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

Diane Marinkas CDA 6938 April 30, Outline Motivation Algorithm CPU Implementation GPU Implementation Performance Lessons Learned Future Work.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

GPU Architecture and Programming

Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

QCAdesigner – CUDA HPPS project

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

GPU Based Sound Simulation and Visualization Torbjorn Loken, Torbjorn Loken, Sergiu M. Dascalu, and Frederick C Harris, Jr. Department of Computer Science.

Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Sunpyo Hong, Hyesoon Kim

Canny Edge Detection Using an NVIDIA GPU and CUDA Alex Wade CAP6938 Final Project.

Programming with CUDA WS 08/09 Lecture 2 Tue, 28 Oct, 2008.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Computer Engg, IIT(BHU)

Image Transformation 4/30/2009

Linchuan Chen, Xin Huo and Gagan Agrawal

Parallel programming with GPGPU coprocessors

Graphics Processing Unit

Presentation transcript:

Accelerating Spherical Harmonic Transforms on the NVIDIA® GPGPU ECE 734 PROJECT Accelerating Spherical Harmonic Transforms on the NVIDIA® GPGPU Accelerating Spherical Harmonic Transforms on the NVIDIA® GPGPU -Vikrant Soman

Agenda Problem Statement Motivation Introduction to SPH – analysis and synthesis Overview of GPU architecture CPU-GPU implementation Results Conclusions and Future work References and Acknowledgements

Problem Statement Critical computational kernel in numerical weather prediction and climate modeling and other global geo-potential related applications Resolution of satellites is improving leading to enormous global datasets of very high degrees and orders becoming available

Motivation The computational aspects of SHTs have become challenging and time consuming. Makes SPH more DATA INTENSIVE and SLOWER ! No one has tried using GPU for SHT before. Try Google search for “Spherical Harmonic Transforms on GPU” !!

Spherical Harmonic Transforms Spherical Harmonic Transforms (SHTs) are essentially Fourier transforms on the sphere. Consists of an Analysis step and Synthesis step. Analysis: Project grid point data on the sphere onto the spectral modes. Synthesis: Inverse transform reconstructs grid point data from the spectral information.

Analysis Synthesis FFT of grid point along longitudes (F) * gaussian weights (G) Spectral values (S) Legendre polynomial functions Spectral values (X) Compute IFFT and Normalize results

GPU architecture - Overview Consists of 4 types of memory – Global(Device) Shared Constant Texture

Cuda CUDA extends C by allowing the programmer to define C functions, called kernels. Executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions. // Kernel definition __global__ void vecAdd(float* A, float* B, float* C) { } int main() // Kernel invocation vecAdd<<<1, N>>>(A, B, C);

One of the best parts of the GPGPU – Heterogeneous programming BLAS operation acceleration. Allows the implementation of CPU-GPU architecture which I have used.

Implementation Details Exploit the heterogeneous programming model CPU code implemented in MATLAB. Identified data intensive loops in the code. Map the loop indexing to GPGPU architecture to exploit parallelism Offload computation to GPU retrieve data back to CPU

Part of the kernel program Loop mapped to GPU AS(ty, tx) = A[k*wA*wA + aBegin + wA * ty + tx]; BS(ty, tx) = B[bBegin_x + wB * ty + tx]; Csub (ty,tx) = 0; // Synchronize to make sure the matrices are loaded __syncthreads(); Csub(ty,tx) = AS(ty,tx) * BS(ty,tx); int c = bx*BLOCK_SIZE + by*BLOCK_SIZE*BLOCK_SIZE*(wA/BLOCK_SIZE); A[k*wA*wA + c + tx + ty*wA] = Csub(ty,tx); for n=0:nn Pn = (legendre(n,yg))'; % Note error in Matlab normalization for m= 0:n Nmn = (-1)^m * sqrt((2*n+1)/2 * factorial(n-m)/factorial(n+m) ); P(1:njo2,n+1,m+1) = Nmn*Pn(1:njo2,m+1); end

Legendre polynomial calculation Offload data intensive operation to GPU

Analysis step Compute FFT on CPU side. MATLAB has highly optimized FFT operation.

Synthesis step IFFT is again given to CPU. GPU FFT is good only for very high points ! ( >10000 etc.)

CPU side – DELL, Intel Quad Core @2.5Ghz and 2.5GB RAM GPU – NVIDIA® 8800 GT CPU side code on MATLAB GPU code written in MATLAB extensions provided by NVIDIA® called NVMEX Interfacing between CPU-GPU via plug-in for MATLAB.

Results For grid size of 512 speed up of almost 42x !! Shows upward trend for higher sizes Not much speed up for analysis kernel. Values are comparable though

Conclusions and Future work Improves the on-the-fly Legendre polynomial calculation. Good speed up overall Errors are low. ( less than E-10 on average) Need to look into performance for higher grid sizes. Complete synthesis step results Possible exchange of ideas with PhD student at SMU, Dallas

References Drake, J. B., Worley, P., and D’Azevedo, E. 2008. Algorithm 888: Spherical harmonic transform algorithms. ACM Trans. Math. Softw. 35, 3, Article 23 (October 2008) Akshara Kaginalkar, Sharad Purohit, Benchmarking of Medium Range Weather Forecasting Model on PARAM -A parallel machine, Center for Development of Advanced Computing (C-DAC), Pune University Campus, Pune 411007 India Martin J. Mohlenkamp, A Fast Transform for Spherical Harmonics, The Journal of Fourier Analysis and Applications, 1999 Huadong Xiao, Yang Lu, Parallel computation for spherical harmonic synthesis and analysis, Computers & Geosciences, Volume 33, Issue 3, March 2007 5. NVIDIA CUDA Programming Guide 2.0 “Special thanks to Prof. Dan Negrut and Makarand Datar, UW Mech department for access to their GPU machines”