GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Graphics Processing Units
Comparison of Modern CPUs and GPUs And the convergence of both Jonathan Palacios Josh Triska.
Panda: MapReduce Framework on GPU’s and CPU’s
Graphics Processors CMSC 411. GPU graphics processing model Texture / Buffer Texture / Buffer Vertex Geometry Fragment CPU Displayed Pixels Displayed.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 April 4, 2013 © Barry Wilkinson CUDAIntro.ppt.
Emotion Engine A look at the microprocessor at the center of the PlayStation2 gaming console Charles Aldrich.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.
GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Computer Graphics Graphics Hardware
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Extracted directly from:
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
GPU Programming with CUDA – Optimisation Mike Griffiths
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.
GPU Architecture and Programming
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
GPUs – Graphics Processing Units Applications in Graphics Processing and Beyond COSC 3P93 – Parallel ComputingMatt Peskett.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
Appendix C Graphics and Computing GPUs
Computer Graphics Graphics Hardware
Prof. Zhang Gang School of Computer Sci. & Tech.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 July 12, 2012 © Barry Wilkinson CUDAIntro.ppt.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Nov 4, 2013.
Mattan Erez The University of Texas at Austin
Presented by: Isaac Martin
NVIDIA Fermi Architecture
Computer Graphics Graphics Hardware
General Purpose Graphics Processing Units (GPGPUs)
Graphics Processing Unit
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
Presentation transcript:

GPGPU introduction

Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the area in the processing chips. – Regular CPU – what are the chip areas going into? » Power per operation is low – How to maximize power per operation? » Give most area to build lots of ALUs, minimize area for control and cache. » This is how GPU is built  power per operation is much better than regular CPU Most power efficient system built using CPU only, IBM Bluegene reaches 2Gflops/Watt System built using CPU+GPU can reach close to 4Gflops/Watt.

Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline Application  Geometry  Rasterizer  image GPU development: – Fixed graphics hardware – Programmable vertex/pixel shaders – GPGPU general purpose computation (beyond graphics) using GPU in applications other than 3D graphics GPGPU can be treated as a co-processor for compute intensive tasks – With sufficient large bandwidth between CPU and GPU.

CPU and GPU GPU is specialized for compute intensive, highly data parallel computation – More area is dedicated to processing – Good for high arithmetic intensity programs with a high ratio between arithmetic operations and memory operations. DRAM Cache ALU Control ALU DRAM CPU GPU

A powerful CPU or many less powerful CPUs?

Flop rate of CPU and GPU GFlop/Sec Single Precision Double Precision

Compute Unified Device Architecture (CUDA) Hardware/software architecture for NVIDIA GPU to execute programs with different languages – Main concept: hardware support for hierarchy of threads

Fermi architecture First generation (GTX 465, GTX 480, Telsa C2050, etc) has 512 CUDA cores – 16 stream multiprocessor (SM) of 32 processing units (cores) – Each core execute one floating point or integer instruction per clock for a thread Latest Tesla GPU: Tesla K40 have 2880 CUDA cores based on the Kepler architecture. Warp size is still the same, but more everything. Kepler architecture: 15 SMX

Fermi Streaming Multiprocessor (SM) 32 CUDA processors with pipelined ALU and FPU – Execute a group of 32 threads called warp. – Support IEEE (single and double precision floating points) with fused multiply-add (FMA) instruction). – Configurable shared memory and L1 cache

Kepler Streaming Multiprocessor (SMX) 192 CUDA processors with pipelined ALU and FPU – 4 warp scheduler Execute a group of 32 threads called warp. – Support IEEE (single and double precision floating points) with fused multiply-add (FMA) instruction). – Configurable shared memory and L1 cache – 48K data cache

SIMT and warp scheduler SIMT: Single instruction, multi-thread – Threads in groups (or 16, 32) that are scheduled together call warp. – All threads in a warp start at the same PC, but free to branch and execute independently. – A warp executes one common instruction at a time To execute different instructions at different threads, the instructions are executed serially – To get efficiency, we want all instructions in a warp to be the same. – SIMT is basically SIMD that emulates MIMD (programmers don’t feel they are using SIMD).

Fermi Warp scheduler 2 per SM: representing a compromise between cost and complexity

Kepler Warp scheduler 4 per SM with 2 instruction dispatcher units each.

NVIDIA GPUs (toward general purpose computing) DRAM Cache ALU Control ALU DRAM

Typical CPU-GPU system Main connection from GPU to CPU/memory is the PCI-Express (PCIe) PCIe1.1 supports up to 8GB/s (common systems support 4GB/s) PCIe2.0 supports up to 16GB/s

Bandwidth in a CPU-GPU system

GPU as a co-processor CPU gives compute intensive jobs to GPU CPU stays busy with the control of execution Main bottleneck: – The connection between main memory and GPU memory Data must be copied for the GPU to work on and the results must come back from GPU PCIe is reasonably fast, but is often still the bottleneck.

GPGPU constraints Dealing with programming models for GPU such as CUDA C or OpenCL Dealing with limited capability and resources – Code is often platform dependent. Problem of mapping computation on to a hardware that is designed for graphics.