Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.

Slides:



Advertisements
Similar presentations
Yafeng Yin, Lei Zhou, Hong Man 07/21/2010
Advertisements

Introduction to the CUDA Platform
GPU Programming using BU Shared Computing Cluster
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
© NVIDIA Corporation 2013 CUDA Libraries. © NVIDIA Corporation 2013 Why Use Library No need to reprogram Save time Less bug Better Performance = FUN.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Jared Barnes Chris Jackson.  Originally created to calculate pixel values  Each core executes the same set of instructions Mario projected onto several.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
Performance and Energy Efficiency of GPUs and FPGAs
Using GPUs for Rapid Electromagnetic Modeling Tech-X Corporation 5621 Arapahoe Ave., Boulder, CO Peter Messmer*, Travis Austin, John.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
Accelerating MATLAB with CUDA
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Robert Liao Tracy Wang CS252 Spring Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
CUDA-based Volume Rendering in IGT Nobuhiko Hata Benjamin Grauer.
GPU-Accelerated Computing and Case-Based Reasoning Yanzhi Ren, Jiadi Yu, Yingying Chen Department of Electrical and Computer Engineering, Stevens Institute.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
Developing the Demosaicing Algorithm in GPGPU Ping Xiang Electrical engineering and computer science.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
GPU VSIPL: Core and Beyond Andrew Kerr 1, Dan Campbell 2, and Mark Richards 1 1 Georgia Institute of Technology 2 Georgia Tech Research Institute.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
The Library Approach to GPU Computations of Initial Value Problems Dave Yuen University of Minnesota, U.S.A. with Larry Hanyk and Radek Matyska Charles.
Martin Kruliš by Martin Kruliš (v1.0)1.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
CUDA Interoperability with Graphical Environments
Clusters of Computational Accelerators
Introduction to CUDA C Slide credit: Slides adapted from
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Introduction to CUDA.
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
6- General Purpose GPU Programming
Presentation transcript:

Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man

Outline  GPU & CUDA  Manually CUDA Coding  CUDA Library  FIR Realization  Auto Parallelizing Tool

GPU & CUDA  GPUs are massively multithreaded many core chips  Hundreds of scalar processors  Tens of thousands of concurrent threads  CUDA is the acronym for Compute Unified Device Architecture.  A parallel computing architecture developed by NVIDIA.  The computing engine in GPU.  CUDA can be accessible to software developers through industry standard programming languages. GeForce 8800 GTX (128 cores) Tesla C1060 (240 cores)

Processing Flow Serial code executes on the host while parallel code executes on the device.

Manually CUDA Coding  Find parallel kernels  Improve data reuse inside kernels to have better compute intensity  Access the memory in a GPU-friendly  Take advantage of complex memory hierarchy that make the GPU fast  Reduce the copy-in and copy-out transfers that pile up on the PCIe  Reduce memory usage in the GPU  Limit inter-block synchronizations

CUDA Libraries  Basic CUDA computation library  CUBLAS  CUFFT  GPULib  Advanced CUDA computation library  CULA  MAGMA  VSIPL

Basic libraries  CUBLAS provides a set of functions for basic vector and matrix operations  matrix‐vector copy, sort, dot product, Euclidean norm etc  CUFFT is the CUDA FFT library  cufftPlan1d(),cufftPlan2d(),cufftPlan3d()  GPULib provides a library of mathematical functions  addition, subtraction, multiplication, and division, as well as unary functions, including sin(), cos(), gamma(), and exp(),  interpolation, array reshaping, array slicing, and reduction operations

Advanced libraries  CULA: GPU Accelerated Linear Algebra  provide LAPACK (Linear Algebra PACKage) function on CUDA GPUs  MAGMA: Matrix Algebra on GPU and Multicore Architectures  develop a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures and "Multicore+GPU" systems

Advanced lib -VSIPL  VSIPL: Vector Image Signal Processing Library  Generalized matrix product  Fast FIR filtering  Correlation  Fast Fourier Transform  QR decomposition  Random number generation  Elementwise arithmetic, logical, and comparison operators, linear algebra procedures

Example // Allocate device memory for filter kernel Complex* d_filter_kernel; cutilSafeCall(cudaMalloc((void**)&d_filter_kernel, mem_size)); // Copy host memory to device cutilSafeCall(cudaMemcpy(d_filter_kernel, h_padded_filter_kernel, mem_size, cudaMemcpyHostToDevice)); // CUFFT plan cufftHandle plan; cufftSafeCall(cufftPlan1d(&plan, new_size, CUFFT_C2C, 1)); // Transform signal and kernel cufftSafeCall(cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_FORWARD));

FIR Realization on CUDA

t Threads

CUDA Demo (FIR) GPU: NVIDIA GeForce 8600 GT CPU: Intel Duo CPU 2.33G Software: Visual Studio 2005

CUDA Demo (FIR)

Auto-Parallelizing Tool  Par4All (open source environment): C and Fortran to CUDA C  PGI Accelerator: Fortran and C to CUDA C Auto-parallelizing Compiler  CAPS HMPP: C and Fortran to CUDA C Auto- parallelizing Compiler  Goose: C to CUDA C Auto-parallelizing Compiler  NOAA F2C : Fortran to CUDA C Translator

 Par4All (open source environment): C and Fortran to CUDA C