CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
© NVIDIA Corporation 2009 Mark Harris NVIDIA Corporation Tesla GPU Computing A Revolution in High Performance Computing.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Graphics Processing Units
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST
Panda: MapReduce Framework on GPU’s and CPU’s
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 April 4, 2013 © Barry Wilkinson CUDAIntro.ppt.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 Chapter 04 Authors: John Hennessy & David Patterson.
Extracted directly from:
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
CUDA - 2.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:
ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
Auto-tuning Dense Matrix Multiplication for GPGPU with Cache
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
Computer Engg, IIT(BHU)
Appendix C Graphics and Computing GPUs
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 July 12, 2012 © Barry Wilkinson CUDAIntro.ppt.
Lecture 20 Computing with Graphical Processing Units
CS427 Multicore Architecture and Parallel Computing
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Lecture 5: GPU Compute Architecture
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Nov 4, 2013.
Mattan Erez The University of Texas at Austin
Presented by: Isaac Martin
Lecture 5: GPU Compute Architecture for the last time
NVIDIA Fermi Architecture
Mattan Erez The University of Texas at Austin
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
Mattan Erez The University of Texas at Austin
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
Graphics Processing Unit
Lecture 5: Synchronization and ILP
6- General Purpose GPU Programming
CSE 502: Computer Architecture
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming Fall 2009 Jih-Kwon Peir Computer Information Science Engineering University of Florida

Chapter 12: Fermi – New NVIDIA GPU Architecture http://www.pcper.com/article.php?aid=789 http://www.nvidia.com/object/fermi_architecture.html http://www.nvidia.com/content/PDF/fermi_white_papers/P.Glask owsky_NVIDIA's_Fermi- The_First_Complete_GPU_Architecture.pdf http://techreport.com/articles.x/17670/3

Fermi Implements CUDA Definition of memory scope, grid, thread block, thread, are same as in Tesla Grid: Array of thread blocks Thread Block: up to 1536 concurrent threads, comm. through shared memory GPU has an array of SMs, each executes one or more thread block, each block is grouped into warps with 32 thread per warp Other resource constraints are implementation based

Fermi – GT300 First implementation of Fermi – GT300 3.0 billion transistors 512 CUDA cores, 16 SMs, 32 cores per SM (4x incr.) L2 Cache (new) GDDR5 memory, support 6GB, 6 64-bit channels GigaThread scheduler Host interface G80: 16x8=128 cores; GT200: 30x8=240 cores, 1.4 B transistors

Fermi – GT300 Key Feature 32 cores per SM, 512 cores Fully pipelined integer and floating point unit that implements new IEEE 754-2008 standard include fused multiply-add (FMA) An unfused multiply-add compute the product b×c, round it to N significant bits, add the result to a, and round back to N bits. A fused multiply-add (FMA) compute the entire sum a+b×c to its full precision before rounding the final result to N significant bits. Fermi supports FMA for both singe and double precision 5

Fermi – GT300 Key Feature Two warps from different thread blocks (even different kernels) can be issued and executed concurrently Linear addressing model with caching at all levels Large shared memory / L1 cache New 768KB L2 cache, shared by all SMs Double precision performance 8x faster than GT200 and reach ~600 double-precision GFLOPs ECC protection from the registers to DRAM 6

Fermi – GT300 Key Feature (cont.) Fermi supports simultaneous execution of multiple kernels from the same application, each kernel distributed to one or more SMs GigaThread hardware thread scheduler, manages 1,536 simultaneously active threads for each SM across 16 kernels Switching from one application to another is 20x faster on Fermi Fermi supports OpenCL, Fortran, C++, Java, Matlab, and Python. Each SM has 32cores and 16 LS/ST units, 4 SFUs 7

Fermi – GT300 Key Feature (cont.) 16KB shared+48KB cache; or 48KB shared+16KB cache 768KB L2 cache, fully coherent across the chip and connected to all of the SMs. Fast atomic operations on Fermi—Nvidia estimates five and 20 times better than GT200—in part thanks to the presence of the L2 cache. Fermi's virtual and physical address spaces are 40 bits, but the true physical limits are dictated by the number of memory devices that can be attached. The practical limit will be 6GB with 2Gb memories and 12GB with 4Gb devices. 8

Fermi – GT300 Key Feature (cont.) Fermi's native instruction set has been extended with hardware support for both OpenCL and DirectCompute. These changes have prompted an update to PTX. Nvidia continues taking care of OpenCL and DirectCompute. Among the changes in PTX 2.0 is a 40-bit, 1TB unified address space. This single address space encompasses the per-thread, per-SM (or per block), and global memory spaces built into the CUDA programming model, with a single set of load and store instructions. These instructions support 64-bit addressing. These changes should allow C++ pointers. PTX 2.0 adds other odds and ends to make C++ support feasible. 9

Instruction Schedule Example A total of 32 instructions from one or two warps can be dispatched in each cycle to any two of the four execution blocks within a Fermi SM: two blocks of 16 cores each, one block of four Special Function Units, and one block of load/store units. This figure shows how instructions are issued to the four execution blocks. It takes two cycles for the 32 instructions in each warp to execute on the cores or load/store units. A warp of 32 special-function instructions is issued in a single cycle but takes eight cycles to complete on the four SFUs Another major improvement in Fermi and PTX 2.0 is a new unified addressing model. All addresses in the GPU are allocated from a continuous 40-bit (one terabyte) address space. Global, shared, and local addresses are defined as ranges within this address space and can be accessed by common load/store instructions. (The load/store instructions support 64-bit addresses to allow for future growth.)

G80 Example: Thread Scheduling (cont.) SM implements zero-overhead warp scheduling At any time, only one of the warps is executed by SM Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a warp execute the same instruction when selected What about Fermi? More parallel execution…

The Cache and Memory Hierarchy Like earlier GPUs, the Fermi architecture provides for local (shared) memory in each SM New to Fermi is the ability to use some of this local memory as a first-level (L1) cache for global memory references. The local memory is 64K in size, and can be split 16K/48K or 48K/16K between L1 and shared memory The decision to allocate 16K or 48K of the local memory as cache depends on two factors: how much shared memory is needed, and how predictable the kernel’s accesses to global memory. A larger shared-memory requirement argues for less cache; more frequent or unpredictable accesses to larger regions of DRAM argues for more cache.

The L2 Cache on Fermi Each Fermi GPU is equipped with an L2 cache (768KB in size for a 512-core chip). The L2 cache covers GPU local DRAM as well as system memory L2 has a set of memory read-modify-write operations that are atomic  for synchronization across thread blocks, or even kernels?? Good for mutual exclusion, but not for Barrier implemented by a set of integer ALUs logically that can lock access to a single memory address while the read-modify-write sequence is completed This memory address can be in system memory (cache data from CPU??), in the GPU’s locally connected DRAM, or even in the memory spaces of other PCI Express connected devices According to NVIDIA, atomic operations on Fermi are 5× to 20× faster than on previous GPUs using conventional synchronization 13

Other Features Fermi provides six 64-bit DRAM channels that support SDDR3 and GDDR5 DRAMs. Up to 6GB of GDDR5 DRAM can be connected to the chip Fermi is the first GPU to provide ECC protection for DRAM; the chip’s register files, shared memories, L1 and L2 caches are also ECC protected. The GigaThread controller that manages application context switching also provides a pair of streaming data-transfer engines, each of which can fully saturate Fermi’s PCI Express host interface. Typically, one will be used to move data from system memory to GPU memory when setting up a GPU computation, while the other will be used to move result data from GPU memory to system memory

GPU Road Map - Nvidia Nvidia roadmap sketch:  Jen-Hsun Huang (CEO of Nvidia) projects performance using an unusual, GPU-computing- focused metric: double-precision gigaFLOPS per watt, indicates power-efficiency rather than raw, peak performance. Kepler: The next major GPU architecture from Nvidia, for release in 2011 using 28-nanometer process. Nvidia intends for three times in DP FLOPS per watt, than Fermi.  Improvement goes "far beyond" what process technology advances alone can achieve.  Changes in the chip architecture, design, and software will contribute to that advance, as well. The Maxwell architecture will come in 2013, with 22-nm fabrication process.  Maxwell promises nearly an 8X increase in DP FLOPS per watt beyond Fermi chips.  Huang noted that, in parallel computing, power is the primary constraint, which is why he chose that metric to describe future architectures.

Dave Patterson’s Comment on Fermi This is a preview on the Top 10 most important innovations in the new Fermi architecture as well as 3 challenges on how to bring future GPUs even closer to mainstream computing Top 10 Innovations in Fermi: 1 Real Floating Point in Quality and Performance 2 Error Correcting Codes on Main Memory and Caches 3 64‐bit Virtual Address Space 4 Caches 5 Fast Context Switching 6 Unified Address Space 7 Debugging Support 8 Faster Atomic Instructions to Support Task‐Based Parallel Programming 9 A Brand New Instruction Set 10 Also, Fermi is Faster than G80 Top 3 Next Challenges: 1 The Relatively Small Size of GPU Memory 2 Inability to do I/O directly to GPU Memory 3 No Glueless Multisocket Hardware and Software

Assignment Group meeting to discuss term assignment and answer the following two questions (note this is group assignment) Top 10 Innovations in Fermi from Patterson: 1 Rank the 10 innovations 2 Add 2 more that you think is missing Top 3 Next Challenges from patterson: 1 Rank the 3 next challenges 2 Add 2 more challenges that you think is missing Turn in (by email, one per group to TA) your answer by 6pm, Friday (11/5). Extra Credit for the assignment