Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang gzhang@tju.edu.cn.

Slides:

Advertisements

Similar presentations

Processes and Threads Chapter 3 and 4 Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College,

Advertisements

1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved.

Threads, SMP, and Microkernels

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.

CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic

The University of Adelaide, School of Computer Science

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

Chapter Hardwired vs Microprogrammed Control Multithreading

Chapter 17 Parallel Processing.

Graphics Processors CMSC 411. GPU graphics processing model Texture / Buffer Texture / Buffer Vertex Geometry Fragment CPU Displayed Pixels Displayed.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

1 Chapter 04 Authors: John Hennessy & David Patterson.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

EKT303/4 Superscalar vs Super-pipelined.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

My Coordinates Office EM G.27 contact time:

Processor Level Parallelism 1

Our Graphics Environment Landscape Rendering. Hardware  CPU  Modern CPUs are multicore processors  User programs can run at the same time as other.

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

The Present and Future of Parallelism on GPUs

Prof. Zhang Gang School of Computer Sci. & Tech.

The University of Adelaide, School of Computer Science

EECE571R -- Harnessing Massively Parallel Processors ece

Distributed Processors

Copyright © 2012, Elsevier Inc. All rights reserved.

Flynn’s Classification Of Computer Architectures

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 14 The Roofline Visual Performance Model Prof. Zhang Gang

Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Topic 11 Amazon Web Services Prof. Zhang Gang

Prof. Zhang Gang School of Computer Sci. & Tech.

Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Topic 4 Storage Prof. Zhang Gang School of.

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 13 SIMD Multimedia Extensions Prof. Zhang Gang School.

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 22 Similarities & Differences between Vector Arch & GPUs Prof. Zhang Gang.

Prof. Zhang Gang School of Computer Sci. & Tech.

Prof. Zhang Gang School of Computer Sci. & Tech.

Lecture 5: GPU Compute Architecture

Mattan Erez The University of Texas at Austin

Array Processor.

Superscalar Processors & VLIW Processors

Lecture 5: GPU Compute Architecture for the last time

Multivector and SIMD Computers

Mattan Erez The University of Texas at Austin

Prof. Leonardo Mostarda University of Camerino

The Vector-Thread Architecture

Mattan Erez The University of Texas at Austin

The University of Adelaide, School of Computer Science

6- General Purpose GPU Programming

CSE 502: Computer Architecture

Multicore and GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang gzhang@tju.edu.cn School of Computer Sci. & Tech. Tianjin University, Tianjin, P. R. China

NVIDIA GPU Computational Structures Similarities to vector machines: Works well with data-level parallel problems Scatter-gather transfers Mask registers Large register files Differences: No scalar processor Uses multithreading to hide memory latency Has many functional units, as opposed to a few deeply pipelined units like a vector processor

Grid, Thread block, SIMD thread A Grid is the code that runs on a GPU that consists of a set of Thread Blocks. Multiply two vectors together, each 8192 elements long.

Grid, Thread block, SIMD thread A Grid is composed of Thread Blocks, each with up to 512 elements. A SIMD instruction executes 32 elements at a time In this example Grid has 16 Thread Blocks Since 8192÷512=16 Thread Blocks contain 16 SIMD threads Since 512÷32=16

Thread Block Scheduler A Thread Block is assigned to a processor by the Thread Block Scheduler. The Thread Block Scheduler has some similarities to a control processor in a vector architecture. It determines the number of thread blocks needed for the loop and keeps allocating them to different multithreaded SIMD Processors until the loop is completed.

Multithreaded SIMD Processor The figure shows a simplified block diagram of a multithreaded SIMD Processor. It has 16 SIMD lanes.

SIMD Thread Scheduler The SIMD Thread Scheduler includes a scoreboard Scheduler knows which threads of SIMD instructions are ready to run Scheduler sends them off to a dispatch unit to be run on the multithreaded SIMD Processor It is identical to a hardware thread scheduler in a traditional multithreaded processor, just that it is scheduling threads of SIMD instructions.

Two levels of hardware scheduler GPU hardware has two levels of hardware schedulers: (1) the Thread Block Scheduler that assigns Thread Blocks to multithreaded SIMD Processors, which ensures that thread blocks are assigned to the processors whose local memories have the corresponding data (2) the SIMD Thread Scheduler within a SIMD Processor, which schedules when threads of SIMD instructions should run

Exercises What is the meaning of Grid in GPUs? What is the meaning of Thread block in GPUs? What is the meaning of SIMD thread in GPUs? What are the hardware schedulers in GPUs?