GPGPUs and CUDA Guest Lecture Computing with GPGPUs Raj Singh National Center for Microscopy and Imaging Research.

Slides:



Advertisements
Similar presentations
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
The University of Adelaide, School of Computer Science
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.
A many-core GPU architecture.. Price, performance, and evolution.
GPU Computing with CUDA as a focus Christie Donovan.
Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Parallel Programming Henri Bal Vrije Universiteit Faculty of Sciences Amsterdam.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.
Panda: MapReduce Framework on GPU’s and CPU’s
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Exploiting Disruptive Technology: GPUs for Physics Chip Watson Scientific Computing Group Jefferson Lab Presented at GlueX Collaboration Meeting, May 11,
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
GPU Programming with CUDA – Optimisation Mike Griffiths
© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
1 © 2012 The MathWorks, Inc. Parallel computing with MATLAB.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
NVIDIA Tesla GPU Zhuting Xue EE126. GPU Graphics Processing Unit The "brain" of graphics, which determines the quality of performance of the graphics.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
GPU Architecture and Programming
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
GPU Accelerated MRI Reconstruction Professor Kevin Skadron Computer Science, School of Engineering and Applied Science University of Virginia, Charlottesville,
GPUs – Graphics Processing Units Applications in Graphics Processing and Beyond COSC 3P93 – Parallel ComputingMatt Peskett.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Our Graphics Environment Landscape Rendering. Hardware  CPU  Modern CPUs are multicore processors  User programs can run at the same time as other.
General Purpose computing on Graphics Processing Units
Computer Engg, IIT(BHU)
GPU Architecture and Its Application
CS427 Multicore Architecture and Parallel Computing
Enabling machine learning in embedded systems
Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang
Multi-Processing in High Performance Computer Architecture:
NVIDIA Fermi Architecture
GPU Introduction: Uses, Architecture, and Programming Model
Vrije Universiteit Amsterdam
Multicore and GPU Programming
6- General Purpose GPU Programming
Multicore and GPU Programming
Presentation transcript:

GPGPUs and CUDA Guest Lecture Computing with GPGPUs Raj Singh National Center for Microscopy and Imaging Research

GPGPUs and CUDA Guest Lecture Haze

GPGPUs and CUDA Guest Lecture Graphics Processing Unit (GPU) Development driven by the multi-billion dollar game industry –Bigger than Hollywood Need for physics, AI and complex lighting models Impressive Flops / dollar performance –Hardware has to be affordable Evolution speed surpasses Moore’s law –Performance doubling approximately 6 months

GPGPUs and CUDA Guest Lecture GPU evolution curve *Courtesy: Nvidia Corporation

GPGPUs and CUDA Guest Lecture GPGPUs (General Purpose GPUs) A natural evolution of GPUs to support a wider range of applications Widely accepted by the scientific community Cheap high-performance GPGPUs are now available –Its possible to buy a $500 card which can provide almost 2 TFlops of computing.

GPGPUs and CUDA Guest Lecture Teraflop computing Supercomputers are still rated in Teraflops –Expensive and power hungry –Not exclusive and have to be shared by several organizations –Custom built in several cases National Center for Atmospheric Research, Boulder installed a 12 Tflop supercomputer in 2007

GPGPUs and CUDA Guest Lecture What does it mean for the scientist ? Desktop supercomputers are possible Energy efficient –Approx 200 Watts / Teraflop Turnaround time can be cut down by magnitudes. –Simulations/Jobs can take several days

GPGPUs and CUDA Guest Lecture GPU hardware Highly parallel architecture –Akin to SIMD Designed initially for efficient matrix operations and pixel manipulations pipelines Computing core is lot simpler –No memory management support –64-bit native cores –Little or no cache –Double precision support.

GPGPUs and CUDA Guest Lecture Multi-core Horsepower Latest Nvidia card has 480 cores for simultaneous processing Very high memory bandwidth –> 100 GBytes / sec and increasing Perfect for embarrassingly parallel compute intensive problems Clusters of GPGPUs available in GreenLight Figures courtesy: Nvidia programming guide 2.0

GPGPUs and CUDA Guest Lecture CPU v/s GPU

GPGPUs and CUDA Guest Lecture Programming model The GPU is seen as a compute device to execute a portion of an application that –Has to be executed many times –Can be isolated as a function –Works independently on different data Such a function can be compiled to run on the device. The resulting program is called a Kernel –C like language helps in porting existing code. Copies of kernel execute simultaneously as threads. Figure courtesy: Nvidia programming guide 2.0

GPGPUs and CUDA Guest Lecture Look Ma no cache.. Cache is expensive By running thousands of fast-switching light threads large memory latency can be masked Context switching of threads is handled by CUDA –Users have little control, only synchronization

GPGPUs and CUDA Guest Lecture CUDA / OpenCL A non-OpenGL oriented API to program the GPUs Compiler and tools allow porting of existing C code fairly rapidly Libraries for common math functions like trigonometric, pow(), exp() Provides support for general DRAM memory addressing –Scatter / gather operations

GPGPUs and CUDA Guest Lecture What do we do at NCMIR / CALIT2 ? Research on large data visualization, optical networks and distributed system. Collaborate with Earth sciences, Neuroscience, Gene research, Movie industry Large projects funded by NSF / NIH NSF EarthScope

GPGPUs and CUDA Guest Lecture Electron and Light Microscopes

GPGPUs and CUDA Guest Lecture Cluster Driven High-Resolution displays data end-points

GPGPUs and CUDA Guest Lecture Electron Tomography Used for constructing 3D view of a thin biological samples Sample is rotated around an axis and images are acquired for each ‘tilt’ angle Electron tomography enables high resolution views of cellular and neuronal structures. 3D reconstruction is a complex problem due to high noise to signal ratio, curvilinear electron path, sample deformation, scattering, magnetic lens aberrations… Biological sample Tilt series images Curvilinear electron path

GPGPUs and CUDA Guest Lecture Challenges Use a Bundle Adjustment procedure to correct for curvilinear electron path and sample deformation Evaluation of electron micrographs correspondences needs to be done with double precision when using high- order polynomial mappings Non-linear electron projection makes reconstruction computationally intensive. Wide field of view for large datasets –CCD cameras are up to 8K x 8K

GPGPUs and CUDA Guest Lecture Reconstruction on GPUs Large datasets take up to several days to reconstruct on a fast serial processor. Goal is to achieve real-time reconstruction Computation is embarrassingly parallel at the tilt level GTX 280 with double-precision support and 240 cores has shown speedups between 10X – 50X for large data Tesla units with 4Tflops are the next target for the code.

GPGPUs and CUDA Guest Lecture Really ? Free Lunch ? C-like language support –Missing support for function pointers, recursion, double precision not as accurate as CPUs, no direct access to I/O –Cannot pass structures, unions Code has to be fairly simple and free of dependencies –Completely self contained in terms of data and variables. Speedups depend on efficient code –Programmers have to code the parallelism. No magic spells available for download –Combining CPU and GPU code might be better in cases

GPGPUs and CUDA Guest Lecture And more cons … Performance is best for computation intensive apps. –Data intensive apps can be tricky. Bank conflicts hurt performance It’s a black-box with little support for runtime debugging.

GPGPUs and CUDA Guest Lecture Resources html# html# p.htmlhttp:// p.html