Modern GPU Architectures Varun Sampath University of Pennsylvania CIS 565 - Spring 2012.

Slides:

Advertisements

Similar presentations

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.

Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Multi-core and tera- scale computing A short overview of benefits and challenges CSC 2007 Andrzej Nowak, CERN

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Lecture 6: Multicore Systems

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Intel Multi-Core Technology. New Energy Efficiency by Parallel Processing – Multi cores in a single package – Second generation high k + metal gate 32nm.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Optimization on Kepler Zehuan Wang

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development.

A many-core GPU architecture.. Price, performance, and evolution.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

Graphics Processing Units

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Exploiting Disruptive Technology: GPUs for Physics Chip Watson Scientific Computing Group Jefferson Lab Presented at GlueX Collaboration Meeting, May 11,

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.

GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Computer Graphics Graphics Hardware

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Extracted directly from:

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

GPU Architecture and Programming

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.

1 Latest Generations of Multi Core Processors

ICAL GPU 架構中所提供分散式運算之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.

© 2010 NVIDIA Corporation Optimizing GPU Performance Stanford CS 193G Lecture 15: Optimizing Parallel GPU Performance John Nickolls.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

My Coordinates Office EM G.27 contact time:

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

The Present and Future of Parallelism on GPUs

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 July 12, 2012 © Barry Wilkinson CUDAIntro.ppt.

CS427 Multicore Architecture and Parallel Computing

EECE571R -- Harnessing Massively Parallel Processors ece

Lecture 2: Intro to the simd lifestyle and GPU internals

Mattan Erez The University of Texas at Austin

NVIDIA Fermi Architecture

General Purpose Graphics Processing Units (GPGPUs)

Lecture 5: Synchronization and ILP

6- General Purpose GPU Programming

CSE 502: Computer Architecture

Multicore and GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Modern GPU Architectures Varun Sampath University of Pennsylvania CIS Spring 2012

Agenda/GPU Decoder Ring Fermi / GF100 / GeForce GTX 480 – “Fermi Refined” / GF110 / GeForce GTX 580 – “Little Fermi” / GF104 / GeForce GTX 460 Cypress / Evergreen / RV870 / Radeon HD 5870 – Cayman / Northern Islands / Radeon HD 6970 Tahiti / Southern Islands / GCN / Radeon HD 7970 Kepler / GK104 / GeForce GTX 680 Future – Project Denver – Heterogeneous System Architecture

From G80/GT200 to Fermi GPU Compute becomes a driver for innovation – Unified address space – Control flow advancements – Arithmetic performance – Atomics performance – Caching – ECC (is this seriously a graphics card?) – Concurrent kernel execution & fast context switching

Unified Address Space Image from NVIDIANVIDIA PTX 2.0 ISA supports 64-bit virtual addressing (40-bit in Fermi) CUDA 4.0+: Address space shared with CPU Advantages?

Unified Address Space cudaMemcpy(d_buf, h_buf, sizeof(h_buf), cudaMemcpyDefault) Runtime manages where buffers live Enables copies between different devices (not only GPUs) via DMA – Called GPUDirect – Useful for HPC clusters Pointers for global and shared memory are equivalent

Control Flow Advancements Predicated instructions – avoid branching stalls (no branch predictor) Indirect function calls: call{.uni} fptr, flist; What does this enable support for?

Control Flow Advancements Predicated instructions – avoid branching stalls (no branch predictor) Indirect function calls: call{.uni} fptr, flist; What does this enable support for? – Function pointers – Virtual functions – Exception handling Fermi gains support for recursion

Arithmetic Improved support for IEEE floating point standards Double precision performance at half-speed Native 32-bit integer arithmetic Does any of this help for graphics code?

Cache Hierarchy 64KB L1 cache per SM – Split into 16KB and 48KB pieces – Developer chooses whether shared memory or cache gets larger space 768KB L2 cache per GPU – Makes atomics really fast. Why? – 128B cache line – Loosens memory coalescing constraints

Dual warp schedulers – why? Two banks of 16 CUDA cores, 16 LD/ST units, 4 SFU units A warp can now complete as quickly as 2 cycles The Fermi SM 10 Image from NVIDIANVIDIA

The Stats Image from Stanford CS193gStanford CS193g

The Review in March 2010 Compute performance unbelievable, gaming performance on the other hand… – “The GTX 480… it’s hotter, it’s noisier, and it’s more power hungry, all for 10-15% more performance.” – AnandTech (article titled “6 months late, was it worth the wait?”)AnandTech Massive 550mm 2 die – Only 14/16 SMs could be enabled (480/512 cores)

“Fermi Refined” – GTX 580 All 32-core SMs enabled Clocks ~10% higher Transistor mix enables lower power consumption

“Little Fermi” – GTX 460 Smaller memory bus (256-bit vs 384-bit) Much lower transistor count (1.95B) Superscalar execution: one scheduler dual- issues Reduce overhead per core? Image from AnandTechAnandTech

A 2010 Comparison NVIDIA GeForce GTX cores GB/s memory bandwidth 1.34 TFLOPS single precision 3 billion transistors ATI Radeon HD cores GB/s memory bandwidth 2.72 TFLOPS single precision 2.15 billion transistors Over double the FLOPS for less transistors! What is going on here?

VLIW Architecture Very-Long-Instruction- Word Each instruction clause contains up to 5 instructions for the ALUs to execute in parallel +Save on scheduling and interconnect (clause “packing” done by compiler) - Utilization Image from AnandTechAnandTech

Execution Comparison Image from AnandTechAnandTech

Assembly Example AMD VLIW ILNVIDIA PTX

16 Streaming Processors packed into SIMD Core/compute unit (CU) – Execute a “wavefront” (a 64- thread warp) over 4 cycles 20 CUs * 16 SPs * 5 ALUs = 1600 ALUs The Rest of Cypress Image from AnandTechAnandTech

Performance Notes VLIW architecture relies on instruction-level parallelism Excels at heavy floating-point workloads with low register dependencies – Shaders? Memory hierarchy not as aggressive as Fermi – Read-only texture caches – Can’t feed all of the ALUs in an SP in parallel Fewer registers per ALU Lower LDS capacity and bandwidth per ALU

Optimizing for AMD Architectures Many of the ideas are the same – the constants & names just change – Staggered Offsets (Partition camping) – Local Data Share (shared memory) bank conflicts – Memory coalescing – Mapped and pinned memory – NDRange (grid) and work-group (block)sizing – Loop Unrolling Big change: be aware of VLIW utilization Consult the OpenCL Programming Guide

AMD’s Cayman Architecture – A Shift AMD found average VLIW utilization in games was 3.4/5 Shift to VLIW4 architecture Increased SIMD core count at expense of VLIW width Found in Radeon HD 6970 and 6950

Paradigm Shift – Graphics Core Next Switch to SIMD-based instruction set architecture (no VLIW) – 16-wide SIMD units executing a wavefront – 4 SIMD units + 1 scalar unit per compute unit – Hardware scheduling Memory hierarchy improvements – Read/write L1 & L2 caches, larger LDS Programming goodies – Unified address space, exceptions, functions, recursion, fast context switching Sound Familiar?

Image from AMDAMD Radeon HD 7970: 32 CUs * 4 SIMDs/CU * 16 ALUs/SIMD = 2048 ALUs

Image from AMDAMD

NVIDIA’s Kepler NVIDIA GeForce GTX SPs 28nm process GB/s memory bandwidth 195W TDP 1/24 double performance 3.5 billion transistors NVIDIA GeForce GTX SPs 40nm process GB/s memory bandwidth 244W TDP 1/8 double performance 3 billion transistors A focus on efficiency Transistor scaling not enough to account for massive core count increase or power consumption Kepler die size 56% of GF110’s

Kepler’s SMX Removal of shader clock means warp executes in 1 GPU clock cycle – Need 32 SPs per clock – Ideally 8 instructions issued every cycle (2 for each of 4 warps) Kepler SMX has 6x SP count – 192 SPs, 32 LD/ST, 32 SFUs – New FP64 block Memory (compared to Fermi) – register file size has only doubled – shared memory/L1 size is the same – L2 decreases Image from NVIDIANVIDIA

Performance GTX 680 may be compute regression but gaming leap “Big Kepler” expected to remedy gap – Double performance necessary Images from AnandTechAnandTech

Future: Integration Hardware benefits from merging CPU & GPU – Mobile (smartphone / laptop): lower energy consumption, area consumption – Desktop / HPC: higher density, interconnect bandwidth Software benefits – Mapped pointers, unified addressing, consistency rules, programming languages point toward GPU as vector co-processor

The Contenders AMD – Heterogeneous System Architecture – Virtual ISA, make use of CPU or GPU transparent – Enabled by Fusion APUs, blending x86 and AMD GPUs NVIDIA – Project Denver – Desktop-class ARM processor effort – Target server/HPC market? Intel – Larrabee Intel MIC

References NVIDIA Fermi Compute Architecture Whitepaper. Link Link NVIDIA GeForce GTX 680 Whitepaper. LinkLink Schroeder, Tim C. “Peer-to-Peer & Unified Virtual Addressing” CUDA Webinar. SlidesSlides AnandTech Review of the GTX 460. LinkLink AnandTech Review of the Radeon HD LinkLink AnandTech Review of the GTX 680. LinkLink AMD OpenCL Programming Guide (v 1.3f). LinkLink NVIDIA CUDA Programming Guide (v 4.2). LinkLink

Bibliography Beyond3D’s Fermi GPU and Architecture Analysis. LinkLink RWT’s article on Fermi. LinkLink AMD Financial Analyst Day Presentations. LinkLink