My Coordinates Office EM G.27 contact time:

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Modern GPU Architectures Varun Sampath University of Pennsylvania CIS Spring 2012.
Optimization on Kepler Zehuan Wang
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.
GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Graphics Processing Units
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Panda: MapReduce Framework on GPU’s and CPU’s
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
1 Chapter 04 Authors: John Hennessy & David Patterson.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
GPU Programming with CUDA – Optimisation Mike Griffiths
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
GPU Architecture and Programming
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
OpenCL Programming James Perry EPCC The University of Edinburgh.
© 2010 NVIDIA Corporation Optimizing GPU Performance Stanford CS 193G Lecture 15: Optimizing Parallel GPU Performance John Nickolls.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
Heterogeneous Computing using openCL lecture 3 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
Single Instruction Multiple Threads
The Present and Future of Parallelism on GPUs
Lecturer: Alan Christopher
Lecture 2: Intro to the simd lifestyle and GPU internals
Lecture 5: GPU Compute Architecture
Mattan Erez The University of Texas at Austin
Lecture 5: GPU Compute Architecture for the last time
NVIDIA Fermi Architecture
The Vector-Thread Architecture
Operation of the Basic SM Pipeline
Mattan Erez The University of Texas at Austin
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
General Purpose Graphics Processing Units (GPGPUs)
Graphics Processing Unit
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Heterogeneous Computing using openCL lecture 1 F21DP Distributed and Parallel Technology Sven-Bodo Scholz

My Coordinates Office EM G.27 email: S.Scholz@hw.ac.uk contact time: Thursday after the lecture or on appointment

The Big Picture Introduction to Heterogeneous Systems OpenCL Basics Memory Issues Scheduling Optimisations

Reading + all about openCL

Heterogeneous Systems A Heterogeneous System is a Distributed System containing different kinds of hardware to jointly solve problems.

Focus Hardware: SMP + GPUs: Programming Model: Data Parallelism

Recap Concurrency “Concurrency describes a situation where two or more activities happen in the same time interval and we are not interested in the order in which it happens”

Recap Parallelism “In CS we refer to Parallelism if two or more concurrent activities are executed by two or more physical processing entities”

Recap Data-Parallelism 1 2 3 4 5 1 1.75 2.25 2.75 3.25 3.75 4.25 9 b[i] = .25 * a[i-1] + 0.5 * a[i] + 0.25 * a[i+1] b[i] = a[i]

A Peek Preview  Single-threaded (CPU) Multi-threaded (CPU) = loop iteration Single-threaded (CPU) // there are N elements for(i = 0; i < N; i++) C[i] = A[i] + B[i] Time T0 1 2 3 4 5 6 7 8 9 10 15 Multi-threaded (CPU) // tid is the thread id // P is the number of cores for(i = 0; i < tid*N/P; i++) C[i] = A[i] + B[i] T0 1 2 3 T1 4 5 6 7 T2 8 9 10 11 T3 12 13 14 15 Massively Multi-threaded (GPU) // tid is the thread id C[tid] = A[tid] + B[tid] T0 T1 1 T2 2 T3 3 T15 15

Conventional CPU Architecture Space devoted to control logic instead of ALU CPUs are optimized to minimize the latency of a single thread Can efficiently handle control flow intensive workloads Multi level caches used to hide latency Limited number of registers due to smaller number of active threads Control logic to reorder execution, provide ILP and minimize pipeline stalls Conventional CPU Block Diagram Control Logic ALU L1 Cache L2 Cache L3 Cache ~ 25GBPS System Memory A present day multicore CPU could have more than one ALU ( typically < 32) and some of the cache hierarchy is usually shared across cores

Modern GPGPU Architecture Generic many core GPU Less space devoted to control logic and caches Large register files to support multiple thread contexts Low latency hardware managed thread switching Large number of ALU per “core” with small user managed cache per core Memory bus optimized for bandwidth ~150 GBPS bandwidth allows us to service a large number of ALUs simultaneously On Board System Memory High Bandwidth bus to ALUs Simple ALUs Cache

Typical System GPU Host PCI-Express GPU Host initiated memory transfers Host initiated computations on the GPU (kernels)

Nvidia GPUs Fermi Architecture Register File 32768 x 32bit Warp Scheduler Core Dispatch Unit Instruction Cache LDST SFU Interconnect Memory L1 Cache / 64kB Shared Memory L2 Cache GTX 480 - Compute 2.0 capability 15 cores or Streaming Multiprocessors (SMs) Each SM features 32 CUDA processors 480 CUDA processors Global memory with ECC Dispatch Port Operand Collector FP Unit Int Unit Result Queue CUDA Core

Nvidia GPUs Fermi Architecture Register File 32768 x 32bit Warp Scheduler Core Dispatch Unit Instruction Cache LDST SFU Interconnect Memory L1 Cache / 64kB Shared Memory L2 Cache SM executes threads in groups of 32 called warps. Two warp issue units per SM Concurrent kernel execution Execute multiple kernels simultaneously to improve efficiency CUDA core consists of a single ALU and floating point unit FPU Dispatch Port Operand Collector FP Unit Int Unit Result Queue CUDA Core

SIMD vs SIMT SIMT denotes scalar instructions and multiple threads sharing an instruction stream HW determines instruction stream sharing across ALUs E.g. NVIDIA GeForce (“SIMT” warps), AMD Radeon architectures (“wavefronts”) where all the threads in a warp /wavefront proceed in lockstep Divergence between threads handled using predication SIMT instructions specify the execution and branching behavior of a single thread SIMD instructions exposes vector width, E.g. of SIMD: explicit vector instructions like x86 SSE

SIMT Execution Model Wavefront Add Mul … Add Mul Add Mul Add Mul … SIMD execution can be combined with pipelining ALUs all execute the same instruction Pipelining is used to break instruction into phases When first instruction completes (4 cycles here), the next instruction is ready to execute SIMD Width Add Mul … Add Mul Add Mul Wavefront Add Mul … Cycle 1 2 3 4 5 6 7 8 9

NVIDIA Memory Hierarchy L1 cache per SM configurable to support shared memory and caching of global memory 48 KB Shared / 16 KB of L1 cache 16 KB Shared / 48 KB of L1 cache Data shared between work items of a group using shared memory Each SM has a 32K register bank L2 cache (768KB) that services all operations (load, store and texture) Unified path to global for loads and stores Shared Memory L1 Cache L2 Cache Global Memory Thread Block Registers

NVIDIA Memory Model in OpenCL Like AMD, a subset of hardware memory exposed in OpenCL Configurable shared memory is usable as local memory Local memory used to share data between items of a work group at lower latency than global memory Private memory utilizes registers per work item Global Memory Private Memory Workitem 1 Compute Unit 1 Local Memory Global / Constant Memory Data Cache Compute Unit N Compute Device Compute Device Memory

Mapping Threads to Data Consider a simple vector addition of 16 elements 2 input buffers (A, B) and 1 output buffer (C) are required A B C = + Vector Addition: Array Indices 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Mapping Threads to Data Create thread structure to match the problem 1-dimensional problem in this case Thread Code: C[tid] = A[tid] + B[tid] Thread IDs Thread Structure 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Vector Addition: A + B = C

Thread Structure OpenCL’s thread structure is designed to be scalable Each instance of a kernel is called a work-item (though “thread” is commonly used as well) Work-items are organized as work-groups Work-groups are independent from one-another (this is where scalability comes from) An index space defines a hierarchy of work-groups and work-items

Thread Structure Work-items can uniquely identify themselves based on: A global id (unique within the index space) A work-group ID and a local ID within the work-group

If you are excited..... you can get started on your own  google AMD openCL or Intel openCL or Apple openCL or Khronos download an SDK and off you go!