NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

Slides:

Advertisements

Similar presentations

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Lecture 6: Multicore Systems

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

GRAPHICS AND COMPUTING GPUS Jehan-François Pâris

GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

Panda: MapReduce Framework on GPU’s and CPU’s

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

SUPERSCALAR EXECUTION. two-way superscalar The DLW-2 has two ALUs, so it’s able to execute two arithmetic instructions in parallel (hence the term two-way.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Computer performance.

C.S. Choy95 COMPUTER ORGANIZATION Logic Design Skill to design digital components JAVA Language Skill to program a computer Computer Organization Skill.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Computer Graphics Graphics Hardware

1 Chapter 04 Authors: John Hennessy & David Patterson.

Extracted directly from:

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.

GPU Architecture and Programming

SKILL AREA: 1.2 MAIN ELEMENTS OF A PERSONAL COMPUTER.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Processor Architecture

Sunpyo Hong, Hyesoon Kim

Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

My Coordinates Office EM G.27 contact time:

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.

NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

Computer Engg, IIT(BHU)

Computer Graphics Graphics Hardware

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 July 12, 2012 © Barry Wilkinson CUDAIntro.ppt.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

GPU Architecture and Its Application

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

CS427 Multicore Architecture and Parallel Computing

Architecture & Organization 1

Lecture 2: Intro to the simd lifestyle and GPU internals

Mattan Erez The University of Texas at Austin

Superscalar Processors & VLIW Processors

Architecture & Organization 1

NVIDIA Fermi Architecture

Computer Graphics Graphics Hardware

Graphics Processing Unit

6- General Purpose GPU Programming

CSE 502: Computer Architecture

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course: CSE Fall 2011

Outline  Introduction  What is GPU Computing?  Fermi  The Programming Model  The Streaming Multiprocessor  The Cache and Memory Hierarchy  Conclusion

Introduction  Traditional microprocessor technology see diminishing returns.  Improvement in clock speeds and architectural sophistication is slowing  Focus has shifted to multicore designs.  These too are reaching practical limits for personal computing.

Introduction (2)  CPUs are optimized for applications where work done by limited number of threads  Threads exhibit high data locality  Mix of different operations  High percentage of conditional branches.  CPUs are inefficient for high-performance computing applications  Integer and floating-point execution units is small  Most of the CPU space, complexity and heat it generates, devoted to: Caches, instruction decoders, branch predictors, and other features to enhance single-threaded performance.

Introduction (3)  GPU design aims applications with multiple threads dominated by long sequences of computational instructions.  GPUs are much better at thread handling, data caching, virtual memory management, flow control, and other CPU-like features.  CPUs will never go away, but  GPUs deliver more cost-effective and energy-efficient performance.

Introduction (4)  The key GPU design goal is to maximize floating- point throughput.  Most of the circuitry within each core is dedicated to computation, rather than speculative features  power consumed by GPUs goes into the application’s actual algorithmic work.

What is GPU Computing?  Use of a graphics processing unit to do general purpose scientific and engineering computing.  GPU computing not a replacement for CPU computing.  Each approach has advantages for certain kinds of software.  CPU and GPU work together in a heterogeneous co- processing computing model.  The sequential part of the application runs on the CPU  the computationally-intensive part is accelerated by the GPU.

What is GPU Computing?  From the user’s perspective, the application just runs faster because of using the GPU to boost performance.

History  GPU computing began with nonprogrammable 3G- graphics accelerators  Multi-chip 3D rendering engines were developed starting in the 1980s,  By mid-1990s all the essential elements integrated onto a single chip.  , these chips progressed from the simplest pixel-drawing functions to implementing the full 3D pipeline

History  NVIDIA’s GeForce 3 in 2001 introduced programmable pixel shading to the consumer market.  The programmability of this chip was very limited  Later GeForce products became more flexible and faster  adding separate programmable engines for vertex and geometry shading.  This evolution culminated in the GeForce 7800

GeForce 7800  Had Three kinds of programmable engines for different stages of the 3D pipeline.  Several additional stages of configurable and fixed-function logic.

History  programming evolved as a way to perform non- graphics processing on these graphics-optimized architectures  By running carefully crafted shader code against data presented as vertex or texture information.  Retrieving the results from a later stage in the pipeline.

History  Managing three different programmable engines in a single 3D pipeline led to unpredictable bottlenecks  too much effort went into balancing the throughput of each stage.  In 2006, NVIDIA introduced the GeForce 8800,  This design featured a “unified shader architecture” with 128 processing elements distributed among eight shader cores.  Each shader core could be assigned to any shader task, eliminating the need for stage-by-stage balancing Greatly improving overall performance.

History  To bring the advantages of the 8800 architecture and CUDA to new markets such as HPC, NVIDIA introduced the Tesla product line.  Current Tesla products use the more recent GT200 architecture.  The Tesla line begins with PCI Express add-in boards— essentially graphics cards without display outputs—and with drivers optimized for GPU computing instead of 3D rendering.  With Tesla, programmers don’t have to worry about making tasks look like graphics operations;  the GPU can be treated like a many core processor.

History  In NVIDIA introduced its parallel architecture called “CUDA”.  Consists of 100s of processor cores that operate together.  Easy programming for the associated CUDA parallel programming model.  Developer modify their application to take the compute-intensive kernels and map them to the GPU. adding “C” keywords  The rest of the application remains on the CPU.  Developer launches 10s of 1000s of threads simultaneously.  The GPU hardware manages the threads and does thread scheduling.  Although GPU computing is only a few years old now  More programmers with direct GPU computing experience than have ever used a Cray dedicated supercomputers.  Academic support for GPU computing is also growing quickly. over 200 colleges and universities are teaching classes in CUDA programming

Fermi  Code name for NVIDIA’s next-generation CUDA architecture  consists of  16 streaming multiprocessors (SMs) each consisting of 32 cores each can execute one floating-point or integer instruction per clock.  The SMs are supported by a second-level cache  Host interface  GigaThread scheduler  Multiple DRAM interfaces.

Fermi  Code name for NVIDIA’s next-generation CUDA architecture

The Programming Model  Complexity of the Fermi architecture is managed by multi-level programming model  allows software developers to focus on algorithm design  No need to know details about mapping algorithm to the hardware improve productivity

The Programming Model Kernels  In NVIDIA’s CUDA software platform the computational elements of algorithms called kernels  Kernels can be written in the C language  extended with additional keywords to express parallelism directly  Once compiled, kernels consist of many threads that execute the same program in parallel

The Programming Model Thread Blocks 

The Programming Model Warps  Thread blocks are divided into warps of 32 threads.  The warp is the fundamental unit of dispatch within a single SM.  Two warps from different thread blocks can be issued and executed concurrently  increase hardware utilization and energy efficiency.  Thread blocks are grouped into grids  each executes a unique kernel

The Programming Model IDs  Threads and thread blocks each have identifiers (IDs)  Specify their relationship to the kernel.  Used by each thread as indexes to their input and output data and shared memory locations.

The Programming Model  At any one time, the entire Fermi device is dedicated to a single application.  an application may include multiple kernels.  Fermi supports simultaneous execution of multiple kernels from the same application  Each kernel distributed to one or more SMs This capability increases the utilization of the device

The Programming Model GigaThread  Switching from one application to another needs 25µsec  Short enough to maintain high utilization even when running multiple applications  This switching is managed by GigaThread (hardware thread scheduler)  Manages 1,536 simultaneously active threads for each streaming multiprocessor across 16 kernels.

The Programming Model Languages  Fermi support  C-language  FORTRAN (with independent solutions from The Portland Group and NOAA)  Java, Matlab, and Python  Fermi brings new instruction level support for C++  previously unsupported on GPUs  will make GPU computing more widely available than ever.

Supported software platforms  Supported software platforms  NVIDIA’s own CUDA development environment  The OpenCL standard managed by the Khronos Group  Microsoft’s Direct Compute API.

The Streaming Multiprocessor  Comprise 32 cores each:  can perform floating-point and integer operations  16 load-store units for memory operations  four special-function units  64K of local SRAM split between cache and local memory.

The Streaming Multiprocessor core  Floating-point operations follow the IEEE floating-point standard.  Each core can perform  one single-precision fused multiply-add operation in each clock period  one double-precision fused multiply-add FMA in two clock periods. no rounding off in the intermediate result  Fermi performs more than 8× as many double-precision operations per clock than previous GPU generations

The Streaming Multiprocessor core  FMA support increases the accuracy and performance of other mathematical operations  division and square root  extended-precision arithmetic  interval arithmetic  Linear algebra.  The integer ALU supports the usual mathematical and logical operations  including multiplication, on both 32-bit and 64-bit values.

The Streaming Multiprocessor Memory operations  Handled by a set of 16 load-store units in each SM.  load/store instructions refer to memory in terms of two-dimensional arrays  providing addresses in terms of x and y values.  Data can be converted from one format to another as it passes between DRAM and the core registers at the full rate.  examples of optimizations unique to GPUs

The Streaming Multiprocessor four Special Function Units  handle special operations such as sin, cos and exp  Four of these operations can be issued per cycle in each SM.

The Streaming Multiprocessor execution blocks  Within Fermi SM has four execution blocks  Cores are divided into two execution blocks:16 cores each.  One block offer 16 load-store units  One block of the four SFUs,  In each cycle, 32 instructions can be dispatched from one or two warps to these blocks.  Two cycles to execute the 32 instructions on the cores or load/store units.  32 special-function instructions can issued in single cycle  takes eight cycles to complete on the four SFUs. (32/4 = 8)

This figure shows how instructions are issued to the execution blocks.

ISA improvements (1)  Fermi debuts the Parallel Thread eXecution (PTX) 2.0 instruction-set architecture (ISA).  Defines instruction set and new virtual machine architecture.  Compilers supporting NVIDIA GPUs, provide PTX- compliant binaries that act as a hardware-neutral.  Applications can be portable across GPU generations and implementations.

ISA improvements (2)  All instructions support predication.  Each instruction can be executed or skipped based on condition codes.  Each thread perform different operations as needed while execution continues at full speed.  If predication isn’t sufficient, usual if-then-else structure with branch statements used

The Cache and Memory Hierarchy L1  Fermi architecture provides local memory in each SM, can be split  Shared memory  First-level (L1) cache for global memory references.  The local memory is 64K in size  Split 16K/48K or 48K/16K between L1 cache and shared memory. Depends on How much shared memory is needed, how predictable the kernel’s accesses to global memory are likely to be.

The Cache and Memory Hierarchy  Shared memory, provides low-latency access to moderate amounts of data  Because the access latency to this memory is also completely predictable  algorithms can be written to interleave loads, calculations, and stores with maximum efficiency.

The Cache and Memory Hierarchy  A larger shared-memory requirement argues for less cache;  more frequent or unpredictable accesses to larger regions of DRAM argues for more cache.

The Cache and Memory Hierarchy L2  Fermi come with an L2 cache  768KB in size for a 512-core chip.  Covers GPU local DRAM as well as system memory.  The L2 cache subsystem implements:  Set of memory read-modify-write atomic operations Managing access to data shared across thread blocks or kernels. atomic operations are 5× to 20× faster than on previous GPUs using conventional synchronization methods.

The Cache and Memory Hierarchy DRAM  The final stage of the local memory hierarchy.  Fermi provides six 64-bit DRAM channels that support SDDR3 and GDDR5 DRAMs.  Up to 6GB of GDDR5 DRAM can be connected to the chip.

Error Correcting Code ECC  Fermi is the first GPU to provide ECC protects  DRAM, register files, shared memories, L1 and L2 caches.  The level of protection is known as SECDED:  single (bit) error correction, double error detection.  Instead of each 64-bit memory channel carrying eight extra bits for ECC information  NVIDIA has a secrete solution for packing the ECC bits into reserved lines of memory.

The Cache and Memory Hierarchy  The GigaThread controller also provides a pair of streaming data-transfer engines,  each can fully saturate Fermi’s PCI Express host interface.  Typically, one will be used to move data from system memory to GPU memory when setting up a GPU computation.  while the other will be used to move result data from GPU memory tos system memory.

Conclusion  CPUs is the best for dynamic workloads with short sequences of computational operations and unpredictable control flow.  workloads dominated by computational work performed within a simpler control flow need GPU