GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

Slides:

Advertisements

Similar presentations

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Advertisements

Introduction to the CUDA Platform

GPU Programming using BU Shared Computing Cluster

Lecture 6: Multicore Systems

SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Processes CSCI 444/544 Operating Systems Fall 2008.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Understanding Operating Systems 1 Overview Introduction Operating System Components Machine Hardware Types of Operating Systems Brief History of Operating.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.

Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.

Contemporary Languages in Parallel Computing Raymond Hummel.

Copyright Arshi Khan1 System Programming Instructor Arshi Khan.

Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

GPGPU platforms GP - General Purpose computation using GPU

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 1 Introduction Read:

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

المحاضرة الاولى Operating Systems. The general objectives of this decision explain the concepts and the importance of operating systems and development.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.

GPU Architecture and Programming

Operating System What is an Operating System? A program that acts as an intermediary between a user of a computer and the computer hardware. An operating.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

OpenCL Programming James Perry EPCC The University of Edinburgh.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

My Coordinates Office EM G.27 contact time:

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

OPERATING SYSTEMS DO YOU REQUIRE AN OPERATING SYSTEM IN YOUR SYSTEM?

CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.

SixTrack for GPU R. De Maria. SixTrack Status SixTrack: Single Particle Tracking Code [cern.ch/sixtrack]. 70K lines written in Fortran 77/90 (with few.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Matthew Royle Supervisor: Prof Shaun Bangay.  How do we implement OpenCL for CPUs  Differences in parallel architectures  Is our CPU implementation.

Virtualization.

Enabling machine learning in embedded systems

Lecture 1 Runtime environments.

Linchuan Chen, Xin Huo and Gagan Agrawal

Chapter 4: Threads.

CS 179 Project Intro.

NVIDIA Fermi Architecture

GPU Introduction: Uses, Architecture, and Programming Model

Lecture 1 Runtime environments.

Graphics Processing Unit

6- General Purpose GPU Programming

Multicore and GPU Programming

Presentation transcript:

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com

Outline Why parallel computing is now important What GPUs are and what they provide Overview of GPU architecture Enough to orient the discussion of programming them Future changes Three “languages” for programming GPUs Those we’re not doing include CUDAFortran, Python CUDA & CL bindings, WebCL

3 Graph from UC Berkeley ParLab Serial App Performance Exponentially growing gap

Graphics Processor (GPU) as Parallel Accelerator Commodity priced, massively parallel floating point Claimed performance on various problems x CPU running serial code 4 Graph from

The GPU as a Co-Processor to the CPU: The physical and logical connections Main memory chipset GPU memory PCIe Slow Control actions & code (kernels) to run I/Os: Video Ethernet USB hub Firewire … CPU GPU Running GPU code is like requesting asynchronous I/O

0.5-3 years from now: Fusion of CPU and GPU CPU Main memory I/O subsystem Multiple cores GPU Running GPU code will be like pending method pointers for future execution. (Like C++11, TBB, TPL, PPL). Hardware task scheduler

Programming Tomorrow’s CPU will be Like Programming Today’s GPU GPUs that compute will come “for free” with computers Slow step of moving data to/from GPU will be eliminated Hardware task scheduler for both CPU and GPU will Almost eliminate OS & I/O overhead for invoking GPU kernels Also almost eliminate OS overhead for invoking parallel tasks on CPU AMD laptop chip available now (but no boards/systems) NVIDIA GPU+ARM chip available now for battery operated devices Both promise desktop chips in next year or two Programming models will probably evolve from what we’ll cover Course will use current, PCIe-based GPUs We will be dealing with overheads that will pass away over next few years

CUDA (NVIDIA) GPU Compute Architecture: Many Simple, Floating-Point Cores

32 cores (Streaming Multiprocessor) share: Instruction stream Registers Execute same program (kernel) SPMD: ~ [Same place in same kernel at the same time] Act as ’s more cores by switching context instead of waiting for memory 1000’s of virtual cores executing same lines of code together, but Sharing limited resources Cores organized into groups

GPU has multiple SMs SMs run in parallel Do not need to be executing same location in the same program at the same time In aggregate, many 1000’s of parallel copies of same kernel running simultaneously Total of up to 1Tflop/s at peak CENTRAL SOFTWARE ISSUE: How to generate and control this much parallelism

GPUs: Programming Options Libraries: called from CPU code. Write no GPU code. Examples: Image/video processing, dense & sparse matrix, FFT, random numbers Generic programming for GPU Thrust Like C++ Standard Template Library Specialize & use built-in data structures and algorithms NVIDIA GPUs only Programming the GPU directly CUDA C/C++, OpenCL, WebCL, CUDA Fortran, various Python libraries Write code that runs on GPU (kernels) Write CPU code that directly controls and coordinates –Data movement between CPU memory and GPU memory –Startup of kernels on GPU –CPU processing of results from GPU when they become available

CUDA C/C++ vs OpenCL CUDA C/C++ Proprietary (NVIDIA) Code runs on NVIDIA GPUs Reportedly 10-50% faster than OpenCL Compiles at build time to binary code for particular targeted hardware Specific NVIDIA hardware architecture versions No compiler available at run time OpenCL Open standard (Khronos) Code runs on NVIDIA & AMD GPUs, x86 multicore, FPGAs (academic research) at the same time Compiles at build time to intermediate form that is compiled at run time for the hardware that is present Compiler is available at run time Can execute downloaded or dynamically generated source code

The Three Programming Environments We’ll Cover OpenCL : Write once, run many Supports heterogeneous parallel machines (fusion) Tool chains good enough for research IMHO, will eventually replace CUDA C/C++ OpenCL : Write once, run many Supports heterogeneous parallel machines (fusion) Tool chains good enough for research IMHO, will eventually replace CUDA C/C++ CUDA C/C++: Very efficient code Lots of fussy detail to get that efficiency Robust tool chains for Linux, Windows, MacOS Specific to NVIDIA CUDA C/C++: Very efficient code Lots of fussy detail to get that efficiency Robust tool chains for Linux, Windows, MacOS Specific to NVIDIA Thrust: Easy to write Algorithms provided among the fastest (e.g., sort) NVIDIA GPUs only Thrust: Easy to write Algorithms provided among the fastest (e.g., sort) NVIDIA GPUs only

Class Project Idea Accurate edge finding in a 1D signal Journal paper published on multicore version Student project last year doing Thrust implementation Project: Do CUDA version + performance tests Paper combining previous student’s work with above: 60% probability of getting accepted in a particular IEEE conference 3 co-authors, including previous student & Lee Extended abstract due: Nov 6 Class project due during finals, same as everyone else Camera ready paper due: March 4 See or me in the next week or two if interested

Questions