Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

CUDACL: A Tool for CUDA and OpenCL Programmers Ferosh Jacob 1, David Whittaker 2, Sagar Thapaliya 2, Purushotham Bangalore 2, Marjan Memik 32, and Jeff.
Intermediate GPGPU Programming in CUDA
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Threads - Definition - Advantages using Threads - User and Kernel Threads - Multithreading Models - Java and Solaris Threads - Examples - Definition -
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Extracted directly from:
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
GPU Architecture and Programming
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
OpenCL Programming James Perry EPCC The University of Edinburgh.
EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
Martin Kruliš by Martin Kruliš (v1.0)1.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
My Coordinates Office EM G.27 contact time:
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
Computer Engg, IIT(BHU)
Prof. Zhang Gang School of Computer Sci. & Tech.
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
CS427 Multicore Architecture and Parallel Computing
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Chapter 4: Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu,
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
Mattan Erez The University of Texas at Austin
© David Kirk/NVIDIA and Wen-mei W. Hwu,
6- General Purpose GPU Programming
Presentation transcript:

Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1

GPU vs CPU GPU has 100s of cores compared to 4-8 cores for CPU CPU - executes a single thread very quickly GPU - executes many concurrent threads slowly - traditionally excels for embarrassingly parallel tasks GPU and CPU have complementary properties.

Solve General Purpose problems using GPU. Core idea is to map data parallel algorithms into equivalent graphics concepts Have to make heavy use of graphics APIs. Traditionally a cumbersome task Never gained prominence among developers. Until......

Compute Unified Device Architecture Released in 2006 by NVIDIA Easy programming of GPU using C extension Transparently scales harnessing the ever growing power of NVIDIA GPUs Programs portable to newer GPU releases

Scalable array of multi-threaded SMs (Streaming Multiprocessors) Each SM consists of multiple Streaming Processor (SM) Inter-thread communication using shared memory CUDA Terms – Host – CPU Device - GPU

[Nickolls,ACM,2008]

Threads are grouped into thread blocks, and execute concurrently on a single SM Thread blocks are grouped into Grids, and are executed independently and parallely SIMT- Single Instruction Multiple Thread Thread creation,management,scheduling and execution occurs in groups of 32 threads called warps

[Nickolls,ACM,2008]

Each thread has its own local memory apart from register and stack space (Physically located on device memory off-chip) Next in hierarchy is a low-latency shared memory between threads in a thread block Then there is high-latency global shared memory All the above memories are physically and logically separate from system memory.

[Source: Nvidia]

cudamalloc,cudafree is used for allocation and releasing memory in Device. cudamemcpy- is used to transfer data in 2 directions a) device to host memory - cudaMemcpyHostToDevice b) host to device memory- cudaMemcpyDeviceToHost Device memory refers to global shared memory, and not thread block shared memory

CUDA programs are heterogeneous CPU+GPU co-processing systems Use CPU core for serial portions, GPU for parallel portions CUDA kernel - can be a simple function or a program on its own GPU needs 1000s of threads for full efficiency CUDA threads are extremely light-weight with little or no overhead in creation/switching

Allocate memory in device (GPU) Copy data from system memory into device memory Invoke CUDA kernel which performs processing the data Copy results backs from device memory to system memory.

[Kirk,2010]

[Nickolls,ACM,2008]

[Kirk,2010]

CUDA 5 - The latest release of CUDA Released Oct 2013 Kepler Architecture vs Fermi Architecture

GPU thread can launch parallel GPU kernels [Harris, GPU Tech Conf,2012]

Advantages Recursive parallel algorithms More efficient – GPU kept more occupied Simplify CPU/GPU divide Library calls can be made from kernel

GPU Object Linking [Harris, GPU Tech Conf,2012]

RDMA: Remote Direct Memory Access between any GPUs in cluster [Harris, GPU Tech Conf,2012]

CUDA Lite A source-source translation tool to relieve the programmer from handling memory hierarchy [Ueng, LCPC, 2008]

m-CUDA makes CUDA architecture run on regular multi-core CPU systems. Proves the effectiveness of CUDA model in non-GPU systems as well [Buck,SC08,2008]

CUDA not as simple as it sounds People have questioned the future of CUDA CUDA has a strong reputation for performance, but at the expense of ease of programming Alternates like XMT is developed, challenging CUDA XMT – many core general purpose parallel architecture. [Caragea,,Hotpar 2010]

375million CUDA capable GPUs sold by Nvidia 1 million toolkit downloads >120,000 active developers Active research community New domains like Big-Data Analytics Shazam – top 5 music app in Apple Store SalesForce.com – real time twitter data analysis and many more…. Source : NVIDIA

[Nickolls,IEEE,2010]

CUDA is promising but only supports NVIDIA GPU OpenCL, AMD Brook not main stream yet. Automatic extraction of parallelism Automatic conversion of existing code base in popular models eg: Java Threads More support for higher level languages

[ Buck,SC08,2008] : Massimiliano Fatica (NVIDIA), Patrick LeGresley (NVIDIA),Ian Buck (NVIDIA),John Stone (University of Illinois at Urbana-Champaign), Jim Phillips (University of Illinois at Urbana-Champaign), Scott Morton (Hess Corporation), Paulius Micikevicius (NVIDIA), "High Performance Computing with CUDA" Nov.2008 [Ueng, LCPC, 2008] :Sain-Zee Ueng, Melvin Lathara, Sara S,Wen-mei W. Hwu, CUDA- Lite: Reducing GPU Programming ComplexityInternational Workshop, LCPC 2008, Edmonton, Canada, July 31 - August 2, 2008 [Nickolls,IEEE,2010]: Nickolls, J, The GPU Computing Era, Micro IEEE, 2010 [Harris,GPU Tech Conf 2012] : Mark Harris, CUDA 5 and Beyond, GPU Tech Conference 2012 [Nickolls,ACM,2008] : John Nickolls, Ian Buck, Michael Garland, Kevin Skadron, Scalable Parallel Programming with CUDA,Queue – GPU Computing Vol 6, Issue 2, ACM Digital Library April 2008 [Kirk,2010]: Programming Massively Parallel Processors: A Hands-on Approach 2010, David B. Kirk, Wen-mei W. Hwu [Caragea,,Hotpar 2010] : GC Caragea, F Keceli, A Tzannes, U Vishkin - Proc. HotPar, 2010

Thank You!