Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 22 Similarities & Differences between Vector Arch & GPUs Prof. Zhang Gang.

Slides:



Advertisements
Similar presentations
1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved.
Advertisements

Threads, SMP, and Microkernels
© 2009 Fakultas Teknologi Informasi Universitas Budi Luhur Jl. Ciledug Raya Petukangan Utara Jakarta Selatan Website:
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
The University of Adelaide, School of Computer Science
Parallel computer architecture classification
1 Threads, SMP, and Microkernels Chapter 4. 2 Process: Some Info. Motivation for threads! Two fundamental aspects of a “process”: Resource ownership Scheduling.
Chapter 17 Parallel Processing.
Jan 30, 2003 GCAFE: 1 Compilation Targets Ian Buck, Francois Labonte February 04, 2003.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Flynn’s Taxonomy of Computer Architectures Source: Wikipedia Michael Flynn 1966 CMPS 5433 – Parallel Processing.
Chapter One Introduction to Pipelined Processors.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Multi-core architectures. Single-core computer Single-core CPU chip.
1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
1 Threads, SMP, and Microkernels Chapter Multithreading Operating system supports multiple threads of execution within a single process MS-DOS.
EKT303/4 Superscalar vs Super-pipelined.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Processor Level Parallelism 1
The Present and Future of Parallelism on GPUs
Prof. Zhang Gang School of Computer Sci. & Tech.
CS203 – Advanced Computer Architecture
A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.
CHAPTER SEVEN PARALLEL PROCESSING © Prepared By: Razif Razali.
18-447: Computer Architecture Lecture 30B: Multiprocessors
Chapter 6 Parallel Processors from Client to Cloud
Parallel Processing - introduction
Multi-core processors
Parallel Computing Lecture
Flynn’s Classification Of Computer Architectures
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 14 The Roofline Visual Performance Model Prof. Zhang Gang
Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Topic 11 Amazon Web Services Prof. Zhang Gang
Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Topic 7 Physical Infrastructure of WSC Prof. Zhang Gang
Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Topic 13 Using Energy Efficiently Inside the Server Prof. Zhang.
Prof. Zhang Gang School of Computer Sci. & Tech.
Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Topic 4 Storage Prof. Zhang Gang School of.
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 13 SIMD Multimedia Extensions Prof. Zhang Gang School.
Prof. Zhang Gang School of Computer Sci. & Tech.
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Prof. Zhang Gang School of Computer Sci. & Tech.
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang
COMP4211 : Advance Computer Architecture
Multi-Processing in High Performance Computer Architecture:
Array Processor.
Chapter 17 Parallel Processing
Mattan Erez The University of Texas at Austin
Parallel Computing’s Challenges
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Introduction to CUDA.
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
Memory System Performance Chapter 3
Multicore and GPU Programming
The University of Adelaide, School of Computer Science
Lecture 20 Parallel Programming CSE /27/2019.
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
Presentation transcript:

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 22 Similarities & Differences between Vector Arch & GPUs Prof. Zhang Gang gzhang@tju.edu.cn School of Computer Sci. & Tech. Tianjin University, Tianjin, P. R. China

Comparison of vector processor and SIMD Processor Since both architectures are designed to execute data-level parallel programs, but take different paths, this comparison is in depth to try to gain better understanding of what is needed for DLP hardware. A SIMD Processor is like a vector processor. The multiple SIMD Processors in GPUs act as independent MIMD cores, just as many vector computers have multiple vector processors. This view would consider the NVIDIA GTX 480 as a 15-core machine with hardware support for multithreading, where each core has 16 lanes.

Differences between Vector Arch & GPUs The biggest difference is multithreading, which is fundamental to GPUs and missing from most vector processors. VMIPS register file holds entire vectors—that is, a contiguous block of 64 doubles. A VMIPS processor has 8 vector registers with 64 elements, or 512 elements total. In contrast, a single vector in a GPU would be distributed across the registers of all SIMD Lanes. A GPU thread of SIMD instructions has up to 64 registers with 32 elements each, or 2048 elements. These extra GPU registers support multithreading.

For pedagogic purposes, we assume the vector processor has four lanes and the multithreaded SIMD Processor also has four SIMD Lanes. Figure 4.22(a) A vector processor with four lanes Figure 4.22(b) A multithreaded SIMD Processor of a GPU with four SIMD Lanes

Similarities between vector processor & SIMD Processor This figure shows: The four SIMD Lanes act in concert much like a four-lane vector unit And a SIMD Processor acts much like a vector processor.

Similarities between vector processor & SIMD Processor In reality, there are many more lanes in GPUs, so GPU “chimes” are shorter. While a vector processor might have 2 to 8 lanes and a vector length of, say, 32—making a chime 4 to 16 clock cycles—a multithreaded SIMD Processor might have 8 or 16 lanes. A SIMD thread is 32 elements wide, so a GPU chime would just be 2 or 4 clock cycles. This difference is why we use “SIMD Processor” as the more descriptive term because it is closer to a SIMD design than it is to a traditional vector processor design.

Exercises Why are the GPUs’ “chimes” shorter? What are the Similarities between Vector Arch & GPUs? What are the differences between Vector Arch & GPUs?