GPU Introduction: Uses, Architecture, and Programming Model

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Introduction CSCI 444/544 Operating Systems Fall 2008.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
GPU Computing with CUDA as a focus Christie Donovan.
Understanding Operating Systems 1 Overview Introduction Operating System Components Machine Hardware Types of Operating Systems Brief History of Operating.
Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.
Copyright Arshi Khan1 System Programming Instructor Arshi Khan.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 1 Introduction Read:
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
المحاضرة الاولى Operating Systems. The general objectives of this decision explain the concepts and the importance of operating systems and development.
1.1 Operating System Concepts Introduction What is an Operating System? Mainframe Systems Desktop Systems Multiprocessor Systems Distributed Systems Clustered.
GPU Architecture and Programming
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Manifold Execution Model and System.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 3.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Course Book Course Objective - The student will be able to describe various operating system concepts as they are applied to memory, process, file system.
CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
Introduction to Operating Systems Concepts
Computer Organization and Architecture Lecture 1 : Introduction
CS203 – Advanced Computer Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Chapter 1: Introduction
Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.
Applied Operating System Concepts
Processes and threads.
Introduction Super-computing Tuesday
CS427 Multicore Architecture and Parallel Computing
Evolution of Operating Systems
Sujata Ray Dey Maheshtala College Computer Science Department
Chapter 1: Introduction
Chapter 1: Introduction
Enabling machine learning in embedded systems
Constructing a system with multiple computers or processors
Chapter 1: Introduction
Linchuan Chen, Xin Huo and Gagan Agrawal
חוברת שקפים להרצאות של ד"ר יאיר ויסמן מבוססת על אתר האינטרנט:
Computer Science I CSC 135.
Introduction.
Introduction.
Chapter 4: Threads.
What is an Operating System?
Chapter 4: Threads.
CS 179 Project Intro.
NVIDIA Fermi Architecture
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Chapter 1 Introduction.
1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.
Language Processors Application Domain – ideas concerning the behavior of a software. Execution Domain – Ideas implemented in Computer System. Semantic.
Introduction to Operating Systems
Sujata Ray Dey Maheshtala College Computer Science Department
Subject Name: Operating System Concepts Subject Number:
ECE 8823: GPU Architectures
Java Programming Introduction
Graphics Processing Unit
6- General Purpose GPU Programming
Multicore and GPU Programming
Presentation transcript:

GPU Introduction: Uses, Architecture, and Programming Model Lee Barford firstname dot lastname at gmail dot com

Outline Why GPUs? What GPUs are and what they provide Overview of GPU architecture Enough to orient the discussion of programming them Future changes Overview of tool chains We will cover NVIDIA’s CUDA

Power: energy used per unit time Dominant practicality and cost constraint

Insatiable need for floating point computing Graphics gaming animation Simulation: electronics aerodynamics automotive biochemistry Machine learning More flops / sec  more realistic, more accurate Power & cooling are the limits on more flops: Key metric is flops / sec / Watt Supercomputer (dominant to c. 1990) compute cluster (unaccelerated CPUs), dominant before c. 2010

Extreme throughput integer applications Crytography Cryptocurrency mining Blockchain Profit set by rate of computations offset by energy costs  Want to maximize integer operations / s / W

Exponentially growing gap Serial App Performance GPUs designs maximize number of cores to improve ops/s/W Graph from UC Berkeley ParLab

Graphics Processor (GPU) as Parallel Accelerator Commodity priced, massively parallel floating point Claimed performance on various problems 50-2500x CPU running serial code Graph from http://drdobbs.com/high-performance-computing/231500166

The GPU as a Co-Processor to the CPU: The physical and logical connections Control actions & code (kernels) to run GPU I/Os: Video Ethernet USB hub Firewire … CPU chipset PCIe Slow Main memory GPU memory Running GPU code is like requesting asynchronous I/O

Now from AMD & Intel: Fusion of CPU and GPU Multiple cores Hardware task scheduler Running GPU code will be like pending method pointers for future execution. (Like C++11, TBB, TPL, PPL). Main memory I/O subsystem

Programming implications Write two programs, in two languages Main program on CPU: Startup, shutdown, I/O, networking, databases, non-GPU functionality Control passing of data between CPU and GPU Invoke code to run on GPU Kernels on GPU Term comes from simulation (partial differential equations) Computation-heavy subroutines GPU must save enough time to make work of moving data between CPU and GPU pay off

CUDA (NVIDIA) GPU Compute Architecture: Many Simple, Floating-Point Cores

Cores organized into groups 32 cores (Streaming Multiprocessor) share: Instruction stream Registers Execute same program (kernel) in lock step SPMD: ~ [Same place in same kernel at the same time] Act as 100-1000’s more cores by switching context instead of waiting for memory 1000’s of virtual cores executing same lines of code together, but Sharing limited resources

GPU has multiple SMs SMs run in parallel Do not need to be executing same location in the same program at the same time In aggregate, many 1000’s of parallel copies of same kernel running simultaneously Total of up to 1Tflop/s at peak CENTRAL SOFTWARE ISSUES: How to generate and control this much parallelism How to avoid slowing down due to waiting for off-GPU DRAM memory access

GPU Programming Options Libraries: called from CPU code. Write no GPU code. Examples: Image/video processing, dense & sparse matrix, FFT, random numbers Generic programming for GPU Thrust Like C++ Standard Template Library Specialize & use built-in data structures and algorithms NVIDIA GPUs only Programming GPU kernels in a special-purpose language (emphasis in this course) CUDA C/C++, PyCUDA, CUDA Fortran OpenCL, WebCL, …

Questions

Two Programming Environments that We’ll Cover CUDA C/C++: Very efficient code Lots of fussy detail to get that efficiency Robust tool chains for Linux, Windows, MacOS Specific to NVIDIA Thrust: Easy to write Algorithms provided among the fastest (e.g., sort) NVIDIA GPUs only

BACKUP SLIDES

CUDA C/C++ vs OpenCL CUDA C/C++ OpenCL Proprietary (NVIDIA) Code runs on NVIDIA GPUs Reportedly 10-50% faster than OpenCL Compiles at build time to binary code for particular targeted hardware Specific NVIDIA hardware architecture versions No compiler available at run time Open standard (Khronos) Code runs on NVIDIA & AMD GPUs, x86 multicore, FPGAs (academic research) at the same time Compiles at build time to intermediate form that is compiled at run time for the hardware that is present Compiler is available at run time Can execute downloaded or dynamically generated source code

Class Project Idea Accurate edge finding in a 1D signal Journal paper published on multicore version Student project last year doing Thrust implementation Project: Do CUDA version + performance tests Paper combining previous student’s work with above: 60% probability of getting accepted in a particular IEEE conference 3 co-authors, including previous student & Lee Extended abstract due: Nov 6 Class project due during finals, same as everyone else Camera ready paper due: March 4 See or email me in the next week or two if interested

Programming Tomorrow’s CPU will be Like Programming Today’s GPU GPUs that compute will come “for free” with computers Slow step of moving data to/from GPU will be eliminated Hardware task scheduler for both CPU and GPU will Almost eliminate OS & I/O overhead for invoking GPU kernels Also almost eliminate OS overhead for invoking parallel tasks on CPU AMD laptop chip; Intel laptops (e.g. fall ‘12 refresh MacBook Pros) NVIDIA GPU+ARM chip available now for battery operated devices Both promise desktop chips in next year or two Programming models will probably evolve from what we’ll cover Course will use current, PCIe-based GPUs We will be dealing with overheads that will pass away over next few years

Teraflop GPU that runs on a (biggish) battery