GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
NVIDIA Research Parallel Computing on Manycore GPUs Vinod Grover.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Graphics Processing Units
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas.
Panda: MapReduce Framework on GPU’s and CPU’s
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 Chapter 04 Authors: John Hennessy & David Patterson.
Extracted directly from:
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Sunpyo Hong, Hyesoon Kim
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.
Single Instruction Multiple Threads
Appendix C Graphics and Computing GPUs
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
Presented by: Isaac Martin
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
NVIDIA Fermi Architecture
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
General Purpose Graphics Processing Units (GPGPUs)
Multicore and GPU Programming
6- General Purpose GPU Programming
Multicore and GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

GPU-based Computing

Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and 10 registers/thread ) (8192)

Tesla C870 Threads

Example B blocks of T threads Which element am I processing?

Case Study: Irregular Terrain Model each profile is 1KB 16*16 threads 128*16 threads 45 regs 192*16 threads 37 regs

GPU Strategies for ITM

OpenMP vs CUDA for (i = 0; i < 32; i++){ for (j = 0; j < 32; j++) value=some_function(A[i][j]} #pragma omp parallel for shared(A) private(i,j)

Easy to Learn Takes time to master

For further discussion and additional information

Tesla C870 GPU vs. CELL BE 3072 threads on GPU 15x faster than CELL BE 128x faster than GPP Tesla C1060: 30 cores, double amount of registers ($1,500) with 8X GFLOPS over Intel Xeon W5590 Quad Core ($1600)

32 cores, 1536 threads per core No complex mechanisms –speculation, out of order execution, superscalar, instruction level parallelism, branch predictors Column access to memory –faster search, read First GPU with error correcting code –registers, shared memory, caches, DRAM Languages supported –C, C++, FORTRAN, Java, Matlab, and Python IEEE 754’08 Double-precision floating point –fused multiply-add operations, Streaming and task switching (25 microseconds) –Launching multiple kernels (up to 16) simultaneously visualization and computing Fermi: Next Generation GPU

H ij = max { H i-1,j-1 + S i, j, H i-1,j – G, H i,j-1 – G, 0 } //G = 10 Sequence Alignment ABCDFW A B C D F W ABCDFW A B C D F W ABCDFW A B C D F W ABCDFW A B C D F W ABCDFW A B C D F W ABCDFW A B C D F W Substituion Matrix

Cost Function ABCD…Z A5-2-2…-3 B-25-3 …-5 C-313-4…-5 D …-5 ………………… Z-3-5 …15 Only need space for 26x26 matrix New Cost Function ARND…X Q100… U … E 002… R 50-2… Y-2 -3… ………………… Previous Methods Space needed is 23x(Query Length) Sorted Substitution Table Computed new table from substitution matrix with substitution characters for top row and query sequence for column *Does not use modulo

Protein Length GPU (1.35GHz) Time (s) SSEARCH (3.2GHz) Time(s) SpeedupGPU Cycles (Billions) SSEARCH Cycles (Billions) Cycles Ratio Alignment Database: Swissprot (Aug 2008), containing 392,768 sequences. GSW vs SSEARCH. Results

From Software Perspective A kernel is a SPMD-style Programmer organizes threads into thread blocks; Kernel consists of a grid of one or more blocks A thread block is a group of concurrent threads that can cooperate amongst themselves through –barrier synchronization –a per-block private shared memory space Programmer specifies –number of blocks –number of threads per block Each thread block = virtual multiprocessor –each thread has a fixed register –each block has a fixed allocation of per-block shared memory.

Efficiency Considerations Avoid execution divergence –threads within a warp follow different execution paths. –Divergence between warps is ok Allow loading a block of data into SM –process it there, and then write the final result back out to external memory. Coalesce memory accesses –Access executive words instead of gather-scatter Create enough parallel work –5K to 10K threads

Most important factor number of thread blocks =“processor count” At thread block (or SM) level –Internal communication is cheap Focusing on decomposing the work between the “p” thread blocks