Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Slides:

Advertisements

Similar presentations

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

CSE 431 Computer Architecture Fall 2008 Chapter 7B: SIMDs, Vectors, and GPUs Mary Jane Irwin ( ) [Adapted from Computer Organization.

Lecture 6: Multicore Systems

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.

The University of Adelaide, School of Computer Science

Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters

Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.

GRAPHICS AND COMPUTING GPUS Jehan-François Pâris

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

Chapter 17 Parallel Processing.

1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.

ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.

Chapter 7 Multicores, Multiprocessors, and Clusters.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Lecture 39: Review Session #1 Reminders –Final exam, Thursday 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Computer Organization CS224 Fall 2012 Lesson 52. Message Passing  Each processor has private physical address space  Hardware sends/receives messages.

Chapter 7 Multicores, Multiprocessors, and Clusters CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University.

CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

1 Chapter 04 Authors: John Hennessy & David Patterson.

1 Parallelism, Multicores, Multiprocessors, and Clusters [Adapted from Computer Organization and Design, Fourth Edition, Patterson & Hennessy, © 2009]

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Morgan Kaufmann Publishers

Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.

Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters

Lecture 3: Computer Architectures

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

My Coordinates Office EM G.27 contact time:

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Processor Level Parallelism 1

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

GPU Architecture and Its Application

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

Distributed Processors

buses, crossing switch, multistage network.

Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”

Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters

Morgan Kaufmann Publishers

Course Outline Introduction in algorithms and applications

Morgan Kaufmann Publishers

Morgan Kaufmann Publishers

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang

Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters

Multi-Processing in High Performance Computer Architecture:

Parallel Processors from Client to Cloud

Morgan Kaufmann Publishers Parallel Processors from Client to Cloud

Chapter 17 Parallel Processing

Parallel Processing Architectures

Lecture 24: Memory, VM, Multiproc

buses, crossing switch, multistage network.

Overview Parallel Processing Pipelining

Chapter 1 Introduction.

Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters

Chapter 4 Multiprocessors

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Graphics Processing Unit

Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters

Multicore and GPU Programming

Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters

CIS 6930: Chip Multiprocessor: GPU Architecture and Programming

Presentation transcript:

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1

2 Multiprocessor Taxonomy SISD: single instruction and single data stream: uniprocessor MISD: no commercial multiprocessor: imagine data going through a pipeline of execution engines SIMD: vector architectures: lower flexibility MIMD: most multiprocessors today: easy to construct with off-the-shelf computers, most flexibility

SIMD Operate element-wise on vectors of data –E.g., MMX and SSE instructions in x86 Multiple data elements in 128-bit wide registers All processors execute the same instruction at the same time –Each with different data address, etc. Simplifies synchronization Reduced instruction control hardware Works best for highly data-parallel applications

Vector Processors Highly pipelined function units Stream data from/to vector registers to units –Data collected from memory into registers –Results stored from registers to memory Example: Vector extension to MIPS –32 × 64-element registers (64-bit elements) –Vector instructions lv, sv : load/store vector addv.d : add vectors of double addvs.d : add scalar to each element of vector of double

Example: DAXPY (Y = a × X + Y) Conventional MIPS code l.d $f0,a($sp) ;load scalar a addiu r4,$s0,#512 ;upper bound of what to load loop: l.d $f2,0($s0) ;load x(i) mul.d $f2,$f2,$f0 ;a × x(i) l.d $f4,0($s1) ;load y(i) add.d $f4,$f4,$f2 ;a × x(i) + y(i) s.d $f4,0($s1) ;store into y(i) addiu $s0,$s0,#8 ;increment index to x addiu $s1,$s1,#8 ;increment index to y subu $t0,r4,$s0 ;compute bound bne $t0,$zero,loop ;check if done Vector MIPS code l.d $f0,a($sp) ;load scalar a lv $v1,0($s0) ;load vector x mulvs.d $v2,$v1,$f0 ;vector-scalar multiply lv $v3,0($s1) ;load vector y addv.d $v4,$v2,$v3 ;add y to product sv $v4,0($s1) ;store the result

History of GPUs Early video cards –Frame buffer memory with address generation for video output 3D graphics processing –Originally high-end computers (e.g., SGI) –Moore’s Law  lower cost, higher density –3D graphics cards for PCs and game consoles Graphics Processing Units –Processors oriented to 3D graphics tasks –Vertex/pixel processing, shading, texture mapping, rasterization

GPU Architectures Processing is highly data-parallel –GPUs are highly multithreaded –Use thread switching to hide memory latency Less reliance on multi-level caches –Graphics memory is wide and high-bandwidth Trend toward general purpose GPUs –Heterogeneous CPU/GPU systems –CPU for sequential code, GPU for parallel code Programming languages/APIs –DirectX, OpenGL –C for Graphics (Cg), High Level Shader Language (HLSL) –Compute Unified Device Architecture (CUDA)

Interconnection Networks Network topologies –Arrangements of processors, switches, and links BusRing 2D Mesh N-cube (N = 3) Fully connected

Network Characteristics Performance –Latency per message (unloaded network) –Throughput –Congestion delays (depending on traffic) Cost Power Routability in silicon

Example! Consider a system with two multiprocessors with the following configurations: –Machine A with two processors, each with local memory of 512 MB with local memory access latency of 20 cycles per word and remote memory access latency of 60 cycles per word. –Machine B with two processors, with a shared memory of 1GB with access latency of 40 cycles per word. Suppose an application has two threads running on the two processors, each of them need to access an entire array of 4096 words. –Is it possible to partition this array on the local memories of ‘A’ machine so that the application runs faster on it rather than ‘B’ machine? If so, specify the partitioning! If not, by how many more cycles should the ‘B’ memory latency be worsened for a partitioning on the ‘A’ machine to enable a faster run than the ‘B’ machine? Assume that the memory operations dominate the execution time. 10

Answer! Suppose we have ‘x’ words on one processor and (T-x) words on the other processor, where T = Execution Time on ‘A’ = max(20x + 60(T-x), 60x + 20(T-x)) = max(60T-40x, 20T+40x) The max is 40T (unit time), where x = T/2 Execution Time on ‘B’ = 40T So, we can't make ‘A’ faster than ‘B’. However, if ‘B’ access is one more cycle slower (that is, 41 cycles access latency), ‘A’ could be faster. 11