GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Optimization on Kepler Zehuan Wang

Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.

Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

GRAPHICS AND COMPUTING GPUS Jehan-François Pâris

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Panda: MapReduce Framework on GPU’s and CPU’s

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Exploiting Disruptive Technology: GPUs for Physics Chip Watson Scientific Computing Group Jefferson Lab Presented at GlueX Collaboration Meeting, May 11,

Computer System Architectures Computer System Software

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Codeplay CEO © Copyright 2012 Codeplay Software Ltd 45 York Place Edinburgh EH1 3HP United Kingdom Visit us at The unique challenges of.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Extracted directly from:

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

GPU Programming with CUDA – Optimisation Mike Griffiths

© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.

3. April 2006Bernd Panzer-Steindel, CERN/IT1 HEPIX 2006 CPU technology session some ‘random walk’

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

GPU Architecture and Programming

Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.

Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

My Coordinates Office EM G.27 contact time:

Background Computer System Architectures Computer System Software.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

IBM Power system – HPC solution

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.

Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

CS427 Multicore Architecture and Parallel Computing

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

Multicore and GPU Programming

6- General Purpose GPU Programming

CSE 502: Computer Architecture

Multicore and GPU Programming

Presentation transcript:

GPU System Architecture Alan Gray EPCC The University of Edinburgh

Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest accelerator Products –NVIDIA and AMD GPUs Accelerated Systems 2

Why do we need accelerators? The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers got faster by increasing the frequency –Voltage was decreased to keep power reasonable. Now, voltage cannot be decreased any further –1s and 0s in a system are represented by different voltages –Reducing overall voltage further would reduce this difference to a point where 0s and 1s cannot be properly distinguished 3

Why do we need accelerators? Instead, performance increases can be achieved through exploiting parallelism Need a chip which can perform many parallel operations every clock cycle –Many cores and/or many operations per core Want to keep power/core as low as possible Much of the power expended by CPU cores is on functionality not generally that useful for HPC –e.g. branch prediction 4

Why do we need accelerators? So, for HPC, we want chips with simple, low power, number-crunching cores But we need our machine to do other things as well as the number crunching –Run an operating system, perform I/O, set up calculation etc Solution: “Hybrid” system containing both CPU and “accelerator” chips 5

Why do we need accelerators? It costs a huge amount of money to design and fabricate new chips –Not feasible for relatively small HPC market Luckily, over the last few years, Graphics Processing Units (GPUs) have evolved for the highly lucrative gaming market –And largely possess the right characteristics for HPC –Many number-crunching cores GPU vendors NVIDIA and AMD have tailored existing GPU architectures to the HPC market GPUs now firmly established in HPC industry 6

AMD 12-core CPU Not much space on CPU is dedicated to compute = compute unit (= core) 7

NVIDIA Fermi GPU GPU dedicates much more space to compute –At expense of caches, controllers, sophistication etc = compute unit (= SM = 32 CUDA cores) 8

Memory GPUs use Graphics memory: much higher bandwidth than standard CPU memory CPUs use DRAM GPUs use Graphics DRAM For many applications, performance is very sensitive to memory bandwidth 9

Latest Technology NVIDIA –Tesla HPC specific GPUs have evolved from GeForce series AMD –FirePro HPC specific GPUs have evolved from (ATI) Radeon series 10

NVIDIA Tesla Series GPU Chip partitioned into Streaming Multiprocessors (SMs) Multiple cores per SM Not cache coherent. No communication possible across SMs. 11

NVIDIA GPU SM Less scheduling units than cores Threads are scheduled in groups of 32, called a warp Threads within a warp always execute the same instruction in lock-step (on different data elements) 12

NVIDIA Tesla Series “Fermi” 2050 “Fermi” 2070 “Fermi” 2090 “Kepler” K20 “Kepler” K20X CUDA cores DP Performance 515 GFlops 665 GFlops 1.17 TFlops 1.31 TFlops Memory Bandwidth 144 GB/s 178 GB/s208 GB/s250 GB/s Memory3 GB6 GB 5 GB6 GB 13

NVIDIA Roadmap 14

AMD FirePro AMD acquired ATI in 2006 AMD FirePro series: derivative of Radeon chips with HPC enhancements FirePro S10000 –Card contains 2 GPUs –Peak 1.48 TFLOP (double precision) –6GB GDDR5 SDRAM Much less widely used compared to NVIDIA, because of programming support issues 15

DRAM GPUs accelerated systems GPUs cannot be used instead of CPUs –They must be used together –GPUs act as accelerators –Responsible for the computationally expensive parts of the code CPU GDRAM GPU PCIe I/O 16

Programming CUDA : Extensions to the C language which allow interfacing to the hardware (NVIDIA specific) OpenCL: Similar to CUDA but cross-platform (including AMD and NVIDIA) Directives based approach: directives help compiler to automatically create code for GPU. OpenACC and now also new OpenMP

DRAM GPU Accelerated Systems CPUs and Accelerators are used together –Communicate over PCIe bus CPU GDRAM Accelerator PCIe I/O 18

Scaling to larger systems Can have multiple CPUs and accelerators within each “workstation” or “shared memory node” –E.g. 2 CPUs +2 Accelerators (above) –CPUs share memory, but Accelerators do not DRAM CPU PCIe I/O CPU PCIe I/O Accelerator + GDRAM Interconnect Interconnect allows multiple nodes to be connected Accelerator + GDRAM 19

GPU Accelerated Supercomputer Acc.+CPU Node Acc.+CPU Node Acc.+CPU Node Acc.+CPU Node Acc.+CPU Node Acc.+CPU Node Acc.+CPU Node Acc.+CPU Node Acc.+CPU Node … … … ……… 20

To run on multiple accelerators in parallel –Normally use one host CPU core (thread) per acccelerator –Program manages communication between host CPUs in the same fashion as traditional parallel programs –e.g. MPI and/or OpenMP (latter shared memory node only) 21

DIY GPU Workstation Just need to slot GPU card into PCI-e Need to make sure there is enough space and power in workstation 22

GPU Servers Multiple servers can be connected via interconnect Several vendors offer GPU Servers Example Configuration: –4 GPUs plus 2 (multicore) CPUs 23

Cray XK7 Each compute node contains 1 CPU + 1 GPU –Can scale up to 500,000 nodes 24

Cray XK7 PCIe Gen2 25

Cray XK7 Compute Blade Compute Blade: 4 Compute Nodes 4 CPUs (middle) + 4 GPUs (right) + 2 interconnect chips (left) (2 compute nodes share a single interconnect chip) 26

Summary GPUs have higher compute and memory bandwidth capabilities than CPUs –Silicon dedicated to many simplistic cores –Use of graphics memory GPUs are typically not used alone, but work in tandem with CPUs NVIDIA lead the market share. –AMD also have high performance GPUs, but not so widely used due to programming support GPU accelerated systems scale from simple workstations to large-scale supercomputers 27