Appendix C Graphics and Computing GPUs

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
The University of Adelaide, School of Computer Science
Appendix A. Appendix A — 2 FIGURE A.2.1 Historical PC. VGA controller drives graphics display from framebuffer memory. Copyright © 2009 Elsevier, Inc.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
CEG 4131-Fall Graphics Processing Unit GPU CEG4131 – Fall 2012 University of Ottawa Bardia Bandali CEG4131 – Fall 2012.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Performance and Energy Efficiency of GPUs and FPGAs
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
1 Chapter 04 Authors: John Hennessy & David Patterson.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.
Martin Kruliš by Martin Kruliš (v1.0)1.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
P&H Ap. A GPUs for Graphics and Computing. Appendix A — 2 FIGURE A.2.1 Historical PC. VGA controller drives graphics display from framebuffer memory.
Computer Engg, IIT(BHU)
GPU Architecture and Its Application
A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Employing compression solutions under openacc
CUDA Interoperability with Graphical Environments
Chapter 6 Parallel Processors from Client to Cloud
CS427 Multicore Architecture and Parallel Computing
Morgan Kaufmann Publishers
Graphics Processing Unit
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang
Lecture 2: Intro to the simd lifestyle and GPU internals
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Nov 4, 2013.
Mattan Erez The University of Texas at Austin
NVIDIA Fermi Architecture
Mattan Erez The University of Texas at Austin
General Purpose Graphics Processing Units (GPGPUs)
© 2012 Elsevier, Inc. All rights reserved.
Graphics Processing Unit
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Appendix C Graphics and Computing GPUs Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.2.1 Historical PC. VGA controller drives graphics display from framebuffer memory. Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure. 3 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.2.3 Graphics logical pipeline. Programmable graphics shader stages are blue, and fixed-function blocks are white. 4 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.2.4 Logical pipeline mapped to physical processors. The programmable shader stages execute on the array of unified processors, and the logical graphics pipeline dataflow recirculates through the processors. 5 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.2.5 Basic unified GPU architecture. Example GPU with 112 streaming processor (SP) cores organized in 14 streaming multiprocessors (SMs); the cores are highly multithreaded. It has the basic Tesla architecture of an NVIDIA GeForce 8800. The processors connect with four 64-bit-wide DRAM partitions via an interconnection network. Each SM has eight SP cores, two special function units (SFUs), instruction and constant caches, a multithreaded instruction unit, and a shared memory. 6 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.3.1 Direct3D 10 graphics pipeline. Each logical pipeline stage maps to GPU hardware or to a GPU processor. Programmable shader stages are blue, fixed-function blocks are white, and memory objects are gray. Each stage processes a vertex, geometric primitive, or pixel in a streaming dataflow fashion. 7 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.3.2 GPU-rendered image. To give the skin visual depth and translucency, the pixel shader program models three separate skin layers, each with unique subsurface scattering behavior. It executes 1400 instructions to render the red, green, blue, and alpha color components of each skin pixel fragment. 8 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.3.3 Decomposing result data into a grid of blocks of elements to be computed in parallel. 9 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.3.4 Sequential code (top) in C versus parallel code (bottom) in CUDA for SAXPY (see Chapter 6). CUDA parallel threads replace the C serial loop—each thread computes the same result as one loop iteration. The parallel code computes n results with n threads organized in blocks of 256 threads. 10 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.3.5 Nested granularity levels—thread, thread block, and grid—have corresponding memory sharing levels—local, shared, and global. Per-thread local memory is private to the thread. Per-block shared memory is shared by all threads of the block. Per-application global memory is shared by all threads. 11 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.3.6 Sequence of kernel F instantiated on a 2D grid of 2D thread blocks, an interkernel synchronization barrier, followed by kernel G on a 1D grid of 1D thread blocks. 12 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.4.1 Multithreaded multiprocessor with eight scalar processor (SP) cores. The eight SP cores each have a large multithreaded register file (RF) and share an instruction cache, multithreaded instruction issue unit, constant cache, two special function units (SFUs), interconnection network, and a multibank shared memory. 13 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. tFIGURE C.4.2 SIMT multithreaded warp scheduling. The scheduler selects a ready warp and issues an instruction synchronously to he parallel threads composing the warp. Because warps are independent, the scheduler may select a different warp each time. 14 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.4.3 Basic PTX GPU thread instructions. 15 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.6.1 Special function approximation statistics. For the NVIDIA GeForce 8800 special function unit (SFU). 16 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.6.2 Double precision fused-multiply-add (FMA) unit. Hardware to implement floating-point A 3 B 1 C for double precision. 17 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.7.1 NVIDIA Tesla unified graphics and computing GPU architecture. This GeForce 8800 has 128 streaming processor (SP) cores in 16 streaming multiprocessors (SMs), arranged in eight texture/processor clusters (TPCs). The processors connect with six 64-bit-wide DRAM partitions via an interconnection network. Other GPUs implementing the Tesla architecture vary the number of SP cores, SMs, DRAM partitions, and other units. 18 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.7.2 Texture/processor cluster (TPC) and a streaming multiprocessor (SM). Each SM has eight streaming processor (SP) cores, two SFUs, and a shared memory. 19 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.7.3 SGEMM dense matrix-matrix multiplication performance rates. The graph shows single precision GFLOPS rates achieved in multiplying square N3N matrices (solid lines) and thin N364 and 643N matrices (dashed lines). Adapted from Figure 6 of Volkov and Demmel [2008]. The black lines are a 1.35 GHz GeForce 8800 GTX using Volkov’s SGEMM code (now in NVIDIA CUBLAS 2.0) on matrices in GPU memory. The blue lines are a quad-core 2.4 GHz Intel Core2 Quad Q6600, 64-bit Linux, Intel MKL 10.0 on matrices in CPU memory. 20 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.7.4 Dense matrix factorization performance rates. The graph shows GFLOPS rates achieved in matrix factorizations using the GPU and using the CPU alone. Adapted from Figure 7 of Volkov and Demmel [2008]. The black lines are for a 1.35 GHz NVIDIA GeForce 8800 GTX, CUDA 1.1, Windows XP attached to a 2.67 GHz Intel Core2 Duo E6700 Windows XP, including all CPU–GPU data transfer times. The blue lines are for a quad-core 2.4 GHz Intel Core2 Quad Q6600, 64-bit Linux, Intel MKL 10.0. 21 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.7.5 Fast Fourier Transform throughput performance. The graph compares the performance of batched one-dimensional in-place complex FFTs on a 1.35 GHz GeForce 8800 GTX with a quad-core 2.8 GHz Intel Xeon E5462 series (code named “Harpertown”), 6MB L2 Cache, 4GB Memory, 1600 FSB, Red Hat Linux, Intel MKL 10.0. 22 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.7.6 Parallel sorting performance. This graph compares sorting rates for parallel radix sort implementations on a 1.5 GHz GeForce 8800 Ultra and an 8-core 2.33 GHz Intel Core2 Xeon E5345 system. 23 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.8.1 Compressed sparse row (CSR) matrix. 24 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.8.2 Serial C code for a single row of sparse matrix-vector multiply. 25 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.8.3 Serial code for sparse matrix-vector multiply. 26 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.8.4 CUDA version of sparse matrix-vector multiply. 27 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.8.5 Shared memory version of sparse matrix-vector multiply. 28 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.8.6 Template for serial plus-scan. 29 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.8.7 CUDA template for parallel plus-scan. 30 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.8.8 Tree-based parallel scan data references. 31 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.8.9 CUDA implementation of plus-reduction. 32 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.8.10 CUDA code for radix sort. 33 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.8.11 CUDA code to partition data on a bit-by-bit basis, as part of radix sort. 34 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.8.12 Serial code to compute all pair-wise forces on N bodies. 35 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.8.13 CUDA thread code to compute the total force on a single body. 36 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.8.14 CUDA code to compute the total force on each body, using shared memory to improve performance. 37 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.8.15 Performance measurements of the N-body application on a GeForce 8800 GTX and a GeForce 9600. The 8800 has 128 stream processors at 1.35 GHz, while the 9600 has 64 at 0.80 GHz (about 30% of the 8800). The peak performance is 242 GFLOPS. For a GPU with more processors, the problem needs to be bigger to achieve full performance (the 9600 peak is around 2048 bodies, while the 8800 doesn’t reach its peak until 16,384 bodies). For small N, more than one thread per body can significantly improve performance, but eventually incurs a performance penalty as N grows. 38 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.8.16 Performance measurements on the N-body code on a CPU. The graph shows single precision N-body performance using Intel Core2 CPUs, denoted by their CPU model number. Note the dramatic reduction in GFLOPS performance (shown in GFLOPS on the y-axis), demonstrating how much faster the GPU is compared to the CPU. The performance on the CPU is generally independent of problem size, except for an anomalously low performance when N516,384 on the X9775 CPU. The graph also shows the results of running the CUDA version of the code (using the CUDA-for-CPU compiler) on a single CPU core, where it outperforms the C11 code by 24%. As a programming language, CUDA exposes parallelism and locality that a compiler can exploit. The Intel CPUs are a 3.2 GHz Extreme X9775 (code named “Penryn”), a 2.66 GHz E8200 (code named “Wolfdale”), a desktop, pre-Penryn CPU, and a 1.83 GHz T2400 (code named “Yonah”), a 2007 laptop CPU. The Penryn version of the Core 2 architecture is particularly interesting for N-body calculations with its 4-bit divider, allowing division and square root operations to execute four times faster than previous Intel CPUs. 39 Copyright © 2014 Elsevier Inc. All rights reserved.

Copyright © 2014 Elsevier Inc. All rights reserved. FIGURE C.8.17 12 images captured during the evolution of an N-body system with 16,384 bodies. 40 Copyright © 2014 Elsevier Inc. All rights reserved.