A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

L9: Floating Point Issues CS6963. Outline Finish control flow and predicated execution discussion Floating point – Mostly single precision until recent.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

CUBLAS and CUSPARSE MVM Timing Gavin Harrison. SMVM Algorithm.

Modern GPU Architectures Varun Sampath University of Pennsylvania CIS Spring 2012.

Optimization on Kepler Zehuan Wang

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

© NVIDIA Corporation 2009 Mark Harris NVIDIA Corporation Tesla GPU Computing A Revolution in High Performance Computing.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

Graphics Processing Units

Panda: MapReduce Framework on GPU’s and CPU’s

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

1 Chapter 04 Authors: John Hennessy & David Patterson.

Extracted directly from:

Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.

GPU Architecture and Programming

(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

ICAL GPU 架構中所提供分散式運算之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

© 2010 NVIDIA Corporation Optimizing GPU Performance Stanford CS 193G Lecture 15: Optimizing Parallel GPU Performance John Nickolls.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Sunpyo Hong, Hyesoon Kim

Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

My Coordinates Office EM G.27 contact time:

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.

Computer Engg, IIT(BHU)

Appendix C Graphics and Computing GPUs

CS427 Multicore Architecture and Parallel Computing

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

Lecture 2: Intro to the simd lifestyle and GPU internals

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

Lecture 5: GPU Compute Architecture

Mattan Erez The University of Texas at Austin

Presented by: Isaac Martin

Lecture 5: GPU Compute Architecture for the last time

NVIDIA Fermi Architecture

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

General Purpose Graphics Processing Units (GPGPUs)

6- General Purpose GPU Programming

Multicore and GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

 Unprecedented FP performance  Ideal for data parallel applications  Programmability 1999: GPU Nov 2006: G80 Jun 2008: GT200 Sep 2009: Fermi

 3G Streaming Multiprocessor  3 bn transistors  512 CUDA cores  384(6 * 64)-bit memory interface

 Improve Double Precision Performance – 256 FMA ops/clock  ECC support – 1 st time in a GPU  True Cache Hierarchy – L1 cache, shared memory and global memory  More Shared Memory – 3x more than GT200; configurable  Faster Context Switching – under 25 µs  Faster Atomic Operations – 20x faster than GT200

 512 CUDA cores  32 Cores/SM  16 SM  4x more core/SM than GT200

 Each SM has  2 warp scheduler  2 instruction dispatch unit  Dual – issue in each SM  Most instructions can be dual issued  Exception: Double Precision Warp Scheduler Inst Dispatch Unit …… Warp 8 inst 11Warp 9 inst 11 Warp 2 inst 42Warp 3 inst 33 Warp 14 inst 95Warp 15 inst 95 :: Warp 8 inst 12Warp 9 inst 12 Warp 14 inst 96Warp 3 inst 34 Warp 2 inst 43Warp 15 inst 96 time

 16 load/store units/SM  Source and destination address calculated by load/store unit

 Full IEEE support  16 double precision FMA ops/SM  8x the peak double precisi0on floating point performance over GT200  4 Special Functional Units(SFU)s/SM for transcendental instructions, such as, sin, cosine, reciprocal and square root.

 Fermi is the first architecture to support PTX 2.0.  PTX 2.0 greatly improves GPU programmability, accuracy, and performance.  Primary goals of PTX 2.0  stable ISA that can span multiple GPU generations.  provide a machine-independent ISA for C, C++, Fortran, and other compiler targets.  provide a common ISA for optimizing code generators and translators which map PTX to specific target machines.

 Full IEEE bit and 64-bit precision.  support for Fused Multiply-Add for all FP precision (prior generations used MAD for single precision FP).  support for subnormal numbers, and all four rounding modes (nearest, zero, positive infinity and negative infinity).  Unified Address Space  1 TB continuous address space for local (thread private), shared (block shared) and global address spaces.  unified pointers can be used to pass objects in any memory space.

 Full support for object oriented C++ code, not just procedural C code.  Full 32-bit integer path with 64-bit extensions.  Load/store ISA supports 64-bit addressing for future growth.  Improved Conditional Performance through Predication.

 Optimized for OpenCL and DirectCompute.  Shares key abstractions like threads, blocks, grids, barrier synchronization, per-block shared memory, global memory, and atomic operations.  new append, bit-reverse and surface instructions.  Improved efficiency of “atomic” integer instructions. (5x-20x times prior generations)  atomic instructions handled by special integer units attached to L2 cache controller.

 64 KB on-chip configurable memory/SM  16 KB L1 cache + 48 KB Shared memory  48 KB L1 cache + 16 KB Shared memory  3x more Shared memory than GT200.  Unified L2

 64 KB on-chip memory/SM  16 KB L1 cache + 48 KB Shared memory OR  48 KB L1 cache + 16 KB Shared memory  3x more Shared memory than GT200.  Significant performance gain for existing apps using Shared memory only  Apps using software managed cache can be stream lined to use hardware managed cache

 Single error correction – Double error detection  DRAM  Chip’s register files  Shared memories  L1 and L2 cache

 Two-level thread scheduler.  Chip level: thread blocks => SMs  SM level: Warps (32 threads) => execution unit  24,576 simultaneously active threads  10x faster app context switching ( < 25 μs )  Concurrent Kernel Execution Serial Kernel Execution Concurrent Kernel Execution time

 The Relatively Small Size of GPU Memory  Inability to do I/O directly to GPU Memory  Managing application level parallelism

Q?

 ture.html ture.html  Fermi Compute Architecture White Paper - _Compute_Architecture_Whitepaper.pdf _Compute_Architecture_Whitepaper.pdf  The Top 10 Innovations in the New NVIDIA Fermi Architecture, and the Top 3 Next Challenges by Dave Patterson, Co-author of Computer Architecture: A Quantitative Approach

Thank You