A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
L9: Floating Point Issues CS6963. Outline Finish control flow and predicated execution discussion Floating point – Mostly single precision until recent.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
CUBLAS and CUSPARSE MVM Timing Gavin Harrison. SMVM Algorithm.
Modern GPU Architectures Varun Sampath University of Pennsylvania CIS Spring 2012.
Optimization on Kepler Zehuan Wang
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
© NVIDIA Corporation 2009 Mark Harris NVIDIA Corporation Tesla GPU Computing A Revolution in High Performance Computing.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Graphics Processing Units
Panda: MapReduce Framework on GPU’s and CPU’s
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 Chapter 04 Authors: John Hennessy & David Patterson.
Extracted directly from:
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
GPU Architecture and Programming
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
CUDA - 2.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:
ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
© 2010 NVIDIA Corporation Optimizing GPU Performance Stanford CS 193G Lecture 15: Optimizing Parallel GPU Performance John Nickolls.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
Auto-tuning Dense Matrix Multiplication for GPGPU with Cache
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Sunpyo Hong, Hyesoon Kim
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
Computer Engg, IIT(BHU)
Appendix C Graphics and Computing GPUs
CS427 Multicore Architecture and Parallel Computing
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Lecture 2: Intro to the simd lifestyle and GPU internals
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Lecture 5: GPU Compute Architecture
Mattan Erez The University of Texas at Austin
Presented by: Isaac Martin
Lecture 5: GPU Compute Architecture for the last time
NVIDIA Fermi Architecture
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
General Purpose Graphics Processing Units (GPGPUs)
6- General Purpose GPU Programming
Multicore and GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

 Unprecedented FP performance  Ideal for data parallel applications  Programmability 1999: GPU Nov 2006: G80 Jun 2008: GT200 Sep 2009: Fermi

 3G Streaming Multiprocessor  3 bn transistors  512 CUDA cores  384(6 * 64)-bit memory interface

 Improve Double Precision Performance – 256 FMA ops/clock  ECC support – 1 st time in a GPU  True Cache Hierarchy – L1 cache, shared memory and global memory  More Shared Memory – 3x more than GT200; configurable  Faster Context Switching – under 25 µs  Faster Atomic Operations – 20x faster than GT200

 512 CUDA cores  32 Cores/SM  16 SM  4x more core/SM than GT200

 Each SM has  2 warp scheduler  2 instruction dispatch unit  Dual – issue in each SM  Most instructions can be dual issued  Exception: Double Precision Warp Scheduler Inst Dispatch Unit …… Warp 8 inst 11Warp 9 inst 11 Warp 2 inst 42Warp 3 inst 33 Warp 14 inst 95Warp 15 inst 95 :: Warp 8 inst 12Warp 9 inst 12 Warp 14 inst 96Warp 3 inst 34 Warp 2 inst 43Warp 15 inst 96 time

 16 load/store units/SM  Source and destination address calculated by load/store unit

 Full IEEE support  16 double precision FMA ops/SM  8x the peak double precisi0on floating point performance over GT200  4 Special Functional Units(SFU)s/SM for transcendental instructions, such as, sin, cosine, reciprocal and square root.

 Fermi is the first architecture to support PTX 2.0.  PTX 2.0 greatly improves GPU programmability, accuracy, and performance.  Primary goals of PTX 2.0  stable ISA that can span multiple GPU generations.  provide a machine-independent ISA for C, C++, Fortran, and other compiler targets.  provide a common ISA for optimizing code generators and translators which map PTX to specific target machines.

 Full IEEE bit and 64-bit precision.  support for Fused Multiply-Add for all FP precision (prior generations used MAD for single precision FP).  support for subnormal numbers, and all four rounding modes (nearest, zero, positive infinity and negative infinity).  Unified Address Space  1 TB continuous address space for local (thread private), shared (block shared) and global address spaces.  unified pointers can be used to pass objects in any memory space.

 Full support for object oriented C++ code, not just procedural C code.  Full 32-bit integer path with 64-bit extensions.  Load/store ISA supports 64-bit addressing for future growth.  Improved Conditional Performance through Predication.

 Optimized for OpenCL and DirectCompute.  Shares key abstractions like threads, blocks, grids, barrier synchronization, per-block shared memory, global memory, and atomic operations.  new append, bit-reverse and surface instructions.  Improved efficiency of “atomic” integer instructions. (5x-20x times prior generations)  atomic instructions handled by special integer units attached to L2 cache controller.

 64 KB on-chip configurable memory/SM  16 KB L1 cache + 48 KB Shared memory  48 KB L1 cache + 16 KB Shared memory  3x more Shared memory than GT200.  Unified L2

 64 KB on-chip memory/SM  16 KB L1 cache + 48 KB Shared memory OR  48 KB L1 cache + 16 KB Shared memory  3x more Shared memory than GT200.  Significant performance gain for existing apps using Shared memory only  Apps using software managed cache can be stream lined to use hardware managed cache

 Single error correction – Double error detection  DRAM  Chip’s register files  Shared memories  L1 and L2 cache

 Two-level thread scheduler.  Chip level: thread blocks => SMs  SM level: Warps (32 threads) => execution unit  24,576 simultaneously active threads  10x faster app context switching ( < 25 μs )  Concurrent Kernel Execution Serial Kernel Execution Concurrent Kernel Execution time

 The Relatively Small Size of GPU Memory  Inability to do I/O directly to GPU Memory  Managing application level parallelism

Q?

 ture.html ture.html  Fermi Compute Architecture White Paper - _Compute_Architecture_Whitepaper.pdf _Compute_Architecture_Whitepaper.pdf  The Top 10 Innovations in the New NVIDIA Fermi Architecture, and the Top 3 Next Challenges by Dave Patterson, Co-author of Computer Architecture: A Quantitative Approach

Thank You