NVIDIA Fermi Architecture

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
CUBLAS and CUSPARSE MVM Timing Gavin Harrison. SMVM Algorithm.
Modern GPU Architectures Varun Sampath University of Pennsylvania CIS Spring 2012.
Optimization on Kepler Zehuan Wang
1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
CUDA Programming Model Xing Zeng, Dongyue Mou. Introduction Motivation Programming Model Memory Model CUDA API Example Pro & Contra Trend Outline.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Evolution of the Programmable Graphics Pipeline Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Graphics Processing Units
Panda: MapReduce Framework on GPU’s and CPU’s
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Extracted directly from:
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
GPU Architecture and Programming
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
© 2010 NVIDIA Corporation Optimizing GPU Performance Stanford CS 193G Lecture 15: Optimizing Parallel GPU Performance John Nickolls.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
General Purpose computing on Graphics Processing Units
Computer Engg, IIT(BHU)
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 July 12, 2012 © Barry Wilkinson CUDAIntro.ppt.
CS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing
Basic CUDA Programming
Lecture 2: Intro to the simd lifestyle and GPU internals
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Lecture 5: GPU Compute Architecture
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Nov 4, 2013.
Mattan Erez The University of Texas at Austin
Lecture 5: GPU Compute Architecture for the last time
Chapter 1 Introduction.
Mattan Erez The University of Texas at Austin
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
Mattan Erez The University of Texas at Austin
Graphics Processing Unit
6- General Purpose GPU Programming
CSE 502: Computer Architecture
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

NVIDIA Fermi Architecture Joseph Kider University of Pennsylvania CIS 565 - Fall 2011

Administrivia Project checkpoint on Monday

Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors 1 to 8. Maybe 1 if all threads are in the same warp and issued at the same time. 3

G80, GT200, and Fermi November 2006: G80 June 2008: GT200 March 2011: Fermi (GF100) Fermi design started around 2005. Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

New GPU Generation What are the technical goals for a new GPU generation?

New GPU Generation What are the technical goals for a new GPU generation? Improve existing application performance. How?

New GPU Generation What are the technical goals for a new GPU generation? Improve existing application performance. How? Advance programmability. In what ways? Given your experience with programming the G80, what features do you want?

Fermi: What’s More? More total cores (SPs) – not SMs though More registers: 32K per SM More shared memory: up to 48K per SM More Super Functional Units (SFUs)

Fermi: What’s Faster? Faster double precision – 8x over GT200 Faster atomic operations. What for? 5-20x Faster context switches Between applications – 10x Between graphics and compute, e.g., OpenGL and CUDA Double precision runs at half speed, on GT200 it was 1/8 speed Atomics are faster due to L1/L2 caches and more atomic hardware units. Faster atomics allow more CPU-like applications like ray tracing and random scatter. Application context switches are below 25 microseconds

Fermi: What’s New? L1 and L2 caches. Dual warp scheduling For compute or graphics? Dual warp scheduling Concurrent kernel execution C++ support Full IEEE 754-2008 support in hardware Unified address space Error Correcting Code (ECC) memory support Fixed function tessellation for graphics Previous L1/L2 was for textures only Motivation for cache is more for compute than graphics. Graphics is streaming; all data is brought on-chip each frameC++ support includes virtual functions, new/delete, and try/catch C++: virtual functions, new/delete, try/catch IEEE support includes denorms, so it doesn’t flush to zero. It is a pain to do in hardware; CPUs would use a trap and do it in software – thousands of cycles Previous GPUs used IEEE 754-198

G80, GT200, and Fermi Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

G80, GT200, and Fermi Number of cores have basically increased at the same rate as the number of transistors, following Moore’s Law. Granted, double precision support was added as well as L1 and L2 caches and more complex dispatch (dual warp scheduling). Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

GT200 and Fermi Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

Fermi Block Diagram GF100 16 SMs Each with 32 cores 512 total cores Each SM hosts up to 48 warps, or 1,536 threads In flight, up to 24,576 threads Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

Fermi SM Why 32 cores per SM instead of 8? Why not more SMs? G80 – 8 cores GT200 – 8 cores GF100 – 32 cores Enables dual-warp scheduling. 2 cycles per warp now, instead of 4.

Fermi SM Dual warp scheduling 32K registers 32 cores 16 Load/stores Why? 32K registers 32 cores Floating point and integer unit per core 16 Load/stores 4 SFUs Dual warp scheduling allows increased use of execution units by allowing a better mix of instructions G80 had 8K registers, GT200 had 16 registers SFU is for log, sin, cos, etc. Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

Fermi SM 16 SMs * 32 cores/SM = 512 floating point operations per cycle Why not in practice? In practice, there is memory access and branches. But also look at this block diagram, there are only 16 load/store units and 4 SFUs, so lots of calls to log() for example will reduce parallelism. Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

Fermi SM Each SM 64KB on-chip memory Configurable by CUDA developer 48KB shared memory / 16KB L1 cache, or 16KB L1 cache / 48 KB shared memory Configurable by CUDA developer Cache is good for unpredictable or irregular memory access; shared memory is good for predictable, regular memory access. Cache makes it easier to port CPU applications More shared memory is good for porting applications from previous GPU generations Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

Fermi Dual Warping Scheduling Double precision instructions do not support dual dispatch with any other operation Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

Kernels from same application execute in parallel Slide from: http://gpgpu.org/wp/wp-content/uploads/2009/11/SC09_CUDA_luebke_Intro.pdf

Fermi Caches Registers spill to L1 cache instead of directly to DRAM Slide from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

Fermi Caches Slide from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

Fermi: Unified Address Space Useful for developers implementing libraries since one pointer can point to any memory type. Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address space with CPU. Why? Useful for developers implementing libraries since one pointer can point to any memory type.

Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address space with CPU. Why? No explicit CPU/GPU copies Direct GPU-GPU copies Direct I/O device to GPU copies Useful for developers implementing libraries since one pointer can point to any memory type.

Fermi ECC ECC Protected Register file, L1, L2, DRAM Uses redundancy to ensure data integrity against cosmic rays flipping bits For example, 64 bits is stored as 72 bits Fix single bit errors, detect multiple bit errors What are the applications? Naturally occurring radiation can flip bits Graphics doesn’t need ECC, but compute does. For example: medical imaging and finical options pricing. Single-Error Correct Double---Error Detect (SECDED)

Fermi Tessellation Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

Fermi Tessellation 1.6 billion triangles per second for water Hair include physics simulation and rendering Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

Fermi Tessellation Fixed function hardware on each SM for graphics Texture filtering Texture cache Tessellation Vertex Fetch / Attribute Setup Stream Output Viewport Transform. Why? Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

Observations Becoming easier to port CPU code to the GPU In fact… Recursion, fast atomics, L1/L2 caches, faster global memory In fact…

Observations Becoming easier to port CPU code to the GPU In fact… Recursion, fast atomics, L1/L2 caches, faster global memory In fact… GPUs are starting to look like CPUs Beefier SMs, L1 and L2 caches, dual warp scheduling, double precision, fast atomics