CSE 431 Computer Architecture Fall 2008 Chapter 7B: SIMDs, Vectors, and GPUs Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
The University of Adelaide, School of Computer Science
Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,
ELEC 6200, Fall 07, Oct 29 McPherson: Vector Processors1 Vector Processors Ryan McPherson ELEC 6200 Fall 2007.
Review: Multiprocessor Basics
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Chapter 17 Parallel Processing.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Computer Organization CS224 Fall 2012 Lesson 52. Message Passing  Each processor has private physical address space  Hardware sends/receives messages.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
Computer Graphics Graphics Hardware
Lecture 30Fall 2006 Computer Architecture Fall 2006 Lecture 30. CMPs & SMTs Adapted from Mary Jane Irwin ( ) [Adapted.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Introduction CSE 410, Spring 2008 Computer Systems
Lecture 22Fall 2006 Computer Systems Fall 2006 Lecture 22: Intro. to Multiprocessors Adapted from Mary Jane Irwin ( )
1 Parallelism, Multicores, Multiprocessors, and Clusters [Adapted from Computer Organization and Design, Fourth Edition, Patterson & Hennessy, © 2009]
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Morgan Kaufmann Publishers
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
.1 Multiprocessor on a Chip & Simultaneous Multi-threads [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]
Motherboard A motherboard allows all the parts of your computer to receive power and communicate with one another.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Outline Why this subject? What is High Performance Computing?
Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters
.1 Multiprocessor on a Chip & Simultaneous Multi-threads [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]
Copyright © Curt Hill SIMD Single Instruction Multiple Data.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.
CSE431 L25 MultiIntro.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 25. Intro to Multiprocessors Mary Jane Irwin (
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
Processor Level Parallelism 1
CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.
CSE431 L28 CMP&SMT.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 28. CMPs & SMTs Mary Jane Irwin ( )
Computer Graphics Graphics Hardware
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Parallel Processing - introduction
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers
/ Computer Architecture and Design
COMP4211 : Advance Computer Architecture
Computer Graphics Graphics Hardware
Computer Systems Fall 2006 Lecture 28. CMPs & SMTs
Graphics Processing Unit
Multicore and GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
Presentation transcript:

CSE 431 Computer Architecture Fall 2008 Chapter 7B: SIMDs, Vectors, and GPUs Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4th Edition, Patterson & Hennessy, © 2008, MK]

Flynn’s Classification Scheme SISD – single instruction, single data stream aka uniprocessor - what we have been talking about all semester SIMD – single instruction, multiple data streams single control unit broadcasting operations to multiple datapaths MISD – multiple instruction, single data no such machine (although some people put vector machines in this category) MIMD – multiple instructions, multiple data streams aka multiprocessors (SMPs, MPPs, clusters, NOWs) Now obsolete terminology except for . . .

SIMD Processors Single control unit (one copy of the code) PE Control Single control unit (one copy of the code) Multiple datapaths (Processing Elements – PEs) running in parallel Q1 – PEs are interconnected (usually via a mesh or torus) and exchange/share data as directed by the control unit Q2 – Each PE performs the same operation on its own local data

Example SIMD Machines Did SIMDs die out in the early 1990s ?? Maker Year # PEs # b/ PE Max memory (MB) PE clock (MHz) System BW (MB/s) Illiac IV UIUC 1972 64 1 13 2,560 DAP ICL 1980 4,096 2 5 MPP Goodyear 1982 16,384 10 20,480 CM-2 Thinking Machines 1987 65,536 512 7 MP-1216 MasPar 1989 4 1024 25 23,000 No, the answer is that now they are EVERYWHERE Did SIMDs die out in the early 1990s ??

Multimedia SIMD Extensions The most widely used variation of SIMD is found in almost every microprocessor today – as the basis of MMX and SSE instructions added to improve the performance of multimedia programs A single, wide ALU is partitioned into many smaller ALUs that operate in parallel 8 bit + 16 bit adder 8 bit + 32 bit adder 16 bit adder 8 bit + 8 bit + Loads and stores are simply as wide as the widest ALU, so the same data transfer can transfer one 32 bit value, two 16 bit values or four 8 bit values There are now hundreds of SSE instructions in the x86 to support multimedia operations

Vector Processors A vector processor (e.g., Cray) pipelines the ALUs to get good performance at lower cost. A key feature is a set of vector registers to hold the operands and results. Collect the data elements from memory, put them in order into a large set of registers, operate on them sequentially in registers, and then write the results back to memory They formed the basis of supercomputers in the 1980’s and 90’s Consider extending the MIPS instruction set (VMIPS) to include vector instructions, e.g., addv.d to add two double precision vector register values addvs.d and mulvs.d to add (or multiply) a scalar register to (by) each element in a vector register lv and sv do vector load and vector store and load or store an entire vector of double precision data

MIPS vs VMIPS DAXPY Codes: Y = a × X + Y l.d $f0,a($sp) ;load scalar a addiu r4,$s0,#512 ;upper bound to load to loop: l.d $f2,0($s0) ;load X(i) mul.d $f2,$f2,$f0 ;a × X(i) l.d $f4,0($s1) ;load Y(i) add.d $f4,$f4,$f2 ;a × X(i) + Y(i) s.d $f4,0($s1) ;store into Y(i) addiu $s0,$s0,#8 ;increment X index addiu $s1,$s1,#8 ;increment Y index subu $t0,r4,$s0 ;compute bound bne $t0,$zero,loop ;check if done For class handout

MIPS vs VMIPS DAXPY Codes: Y = a × X + Y l.d $f0,a($sp) ;load scalar a addiu r4,$s0,#512 ;upper bound to load to loop: l.d $f2,0($s0) ;load X(i) mul.d $f2,$f2,$f0 ;a × X(i) l.d $f4,0($s1) ;load Y(i) add.d $f4,$f4,$f2 ;a × X(i) + Y(i) s.d $f4,0($s1) ;store into Y(i) addiu $s0,$s0,#8 ;increment X index addiu $s1,$s1,#8 ;increment Y index subu $t0,r4,$s0 ;compute bound bne $t0,$zero,loop ;check if done l.d $f0,a($sp) ;load scalar a lv $v1,0($s0) ;load vector X mulvs.d $v2,$v1,$f0 ;vector-scalar multiply lv $v3,0($s1) ;load vector Y addv.d $v4,$v2,$v3 ;add Y to a × X sv $v4,0($s1) ;store vector result For lecture DAXPY – Double precision a x X Plus Y – forms the inner loop of the Linpack benchmark. Assume that the starting addresses of X is in $s0 and Y is in $s1

Vector verus Scalar Instruction fetch and decode bandwidth is dramatically reduced (also saves power) Only six instructions in VMIPS versus almost 600 in MIPS for 64 element DAXPY Hardware doesn’t have to check for data hazards within a vector instruction. A vector instruction will only stall for the first element, then subsequent elements will flow smoothly down the pipeline. And control hazards are nonexistent. MIPS stall frequency is about 64 times higher than VMIPS for DAXPY Easier to write code for data-level parallel app’s Have a known access pattern to memory, so heavily interleaved memory banks work well. The cost of latency to memory is seen only once for the entire vector Recent announcements from Intel suggest that vectors will play a bigger role in commodity processors. Intel’s Advanced Vector Instructions (AVI), to arrive in 2010, will expand the width of the SSE registers from 128 bits to 256 bits and eventually to 1024 bits (16 double-precision floating point numbers). And Larrabee is reputed to have vector instructions.

Example Vector Machines Maker Year Peak perf. # vector Processors PE clock (MHz) STAR-100 CDC 1970 ?? 113 2 ASC TI 20 MFLOPS 1, 2, or 4 16 Cray 1 Cray 1976 80 to 240 MFLOPS 80 Cray Y-MP 1988 333 MFLOPS 2, 4, or 8 167 Earth Simulator NEC 2002 35.86 TFLOPS 8 Did Vector machines die out in the late 1990s ??

The PS3 “Cell” Processor Architecture Composed of a non-SMP architecture 234M transistors @ 4Ghz 1 Power Processing Element (PPE) “control” processor. The PPE is similar to a Xenon core Slight ISA differences, and fine-grained MT instead of real SMT And 8 “Synergistic” (SIMD) Processing Elements (SPEs). The real compute power and differences lie in the SPEs (21M transistors each) An attempt to ‘fix’ the memory latency problem by giving each SPE complete control over it’s own 256KB “scratchpad” memory – 14M transistors Direct mapped for low latency 4 vector units per SPE, 1 of everything else – 7M transistors 512KB L2$ and a massively high bandwidth (200GB/s) processor-memory bus Marketing-related info: the PPE is /so/ similar to the Xenon that other than some specialized SIMD instructions, code is near compatible. (Instruction length also differs, but that's a 'minor' issue). What really matters is that Microsoft has a real leg up on the 'mental pull' to developers. The reason is that code that's developed on the Xenon will compile and run, with very few modifications, on the PPE of the Cell. As such, Xenon has 3 "PPE-style" processors, allowing the primary development path to be MS-based. After all, once you get the game working with the much more comfortable Xenon architecture, you can then try to put some rough segments onto the SPE's, and hope for some speedup. The trick is that this way, most of the development time will be in a Xenon-native development, rather than Cell-native. This gives the dev-team more time to optimize the Xenon code, and more importantly tends to increase the amount of code that will eventually run on the PPE. A full Cell development process would start with the SPE sub-programs, but since that isn't a portable development process on either the Xbox or the Revolution, MS is hoping developers won't use it. By short-circuiting the PS3 development process by providing such a compatible and comfortable platform, MS is hoping to reduce utilization of the SPEs, and over-reliance on the PPE, reducing the Cell's functional utilization.

How to make use of the SPEs Note that this process requires 8 SPEs, and only 7 are enabled in the PS3's Cell. As such, some routines must be run on the same SPE, resulting in lower performance. Also note that the memory subsystem on your average desktop machine is around 6.5 GB/s. The graphics memory on your high-end video card gives maybe 25GB/s. The bus transmitting all of that data gives 200GB/s, enough for the PPE and all 7 SPE's to run at 25GB/s on the "EIM" (Element Interface Bus), which allows all of this performance to happen. That bus is a 3-segment 96B/cycle bus, and really is the backbone of the design. Without it, none of this would matter.

What about the Software? Uses special IBM “Hypervisor” Like an OS for OS’s Runs both a real time OS (for sound) and non-real time OS (for things like AI) Software must be specially coded to run well The single PPE will be quickly bogged down Must make use of SPEs wherever possible This isn’t easy, by any standard What about Microsoft? Development suite identifies which 6 threads you’re expected to run Four of them are DirectX based, and handled by the OS Only need to write two threads, functionally

Graphics Processing Units (GPUs) GPUs are accelerators that supplement a CPU so they do not need to be able to perform all of the tasks of a CPU. They dedicate all of their resources to graphics CPU-GPU combination – heterogeneous multiprocessing Programming interfaces that are free from backward binary compatibility constraints resulting in more rapid innovation in GPUs than in CPUs Application programming interfaces (APIs) such as OpenGL and DirectX coupled with high-level graphics shading languages such as NVIDIA’s Cg and CUDA and Microsoft’s HLSL GPU data types are vertices (x, y, z, w) coordinates and pixels (red, green, blue, alpha) color components GPUs execute many threads (e.g., vertex and pixel shading) in parallel – lots of data-level parallelism

Typical GPU Architecture Features Rely on having enough threads to hide the latency to memory (not caches as in CPUs) Each GPU is highly multithreaded Use extensive parallelism to get high performance Have extensive set of SIMD instructions; moving towards multicore Main memory is bandwidth, not latency driven GPU DRAMs are wider and have higher bandwidth, but are typically smaller, than CPU memories Leaders in the marketplace (in 2008) NVIDIA GeForce 8800 GTX (16 multiprocessors each with 8 multithreaded processing units) AMD’s ATI Radeon and ATI FireGL Watch out for Intel’s Larrabee GPGPUs as well

Multicore Xbox360 – “Xenon” processor To provide game developers with a balanced and powerful platform Three SMT processors, 32KB L1 D$ & I$, 1MB UL2 cache 165M transistors total 3.2 Ghz “near” POWER ISA 2-issue, 21 stage pipeline, with 128 128-bit registers Weak branch prediction – supported by software hinting In order instructions Narrow cores – 2 INT units, 2 128-bit VMX units, 1 of anything else An ATI-designed 500MZ GPU w/ 512MB of DDR3 DRAM 337M transistors, 10MB framebuffer 48 pixel shader cores, each with 4 ALUs These are both in HIDE MODE – need to replace with NVIDIA processor description Things to note: the 32-bit Power ISA supports 32 registers natively. Moving to 128 registers requires ‘cramming’ 7-bit register operands in. No one knows how they do it, but it’s quirky. The branch predictor is quite simple, and my guess is that it’s either a 1-bit predictor or a small 2-bit predictor. Microsoft has presented a number of papers on how software hinted and compiler supported branch prediction can help. A “VMX” unit is the colloquial term for the SIMD operations similar to AltiVec we see on board. This one is custom modified to support Direct3D data format packing and unpacking. Other notes: the GPU is twice as big as the CPU. The 10MB framebuffer is an off-chip high-speed memory explicitly for full-screen anti-aliasing. In FSAA, you need to do 5 reads and 1 write per pixel, which quickly floods any memory subsystem. Instead, they build it into the framebuffer itself, which is a very fast little chip that does nothing but hold the image and smooth it out.

Xenon Block Diagram Core 0 Core 1 Core 2 1MB UL2 GPU 512MB DRAM DVD HDD Port Front USBs (2) Wireless MU ports (2 USBs) Rear USB (1) Ethernet IR Audio Out Flash Systems Control Core 0 L1D L1I Core 1 L1D L1I Core 2 L1D L1I 1MB UL2 XMA Dec GPU 512MB DRAM BIU/IO Intf SMC It is important to note the way that data can be streamed from the L2 cache to the GPU. In particular, the L2 can have banks ‘locked’ away from normal use, and allowed for direct-FIFO access to the GPU. This allows the processor to stream data into the GPU very efficiently, without clogging up the cache, and ensuring optimal bandwidth usage. This is especially useful in "procedural synthesis", where a template object (such as a tree) is programmatically modified slightly each time it is drawn, to make it look natural. The locked cache allows FIFO streaming of such objects to the GPU without reducing available bandwidth to the processor, and without trashing the cache. Also of note is that if you run two of the three processors at full-tilt, it's just enough to feed the GPU at full-rate. The system was meant for 6 threads, four of which are graphics threads doing procedural synthesis and the like. MC1 3D Core 10MB EDRAM Video Out Analog Chip Video Out MC0

Next Lecture and Reminders Multiprocessor network topologies Reading assignment – PH, Chapter PH 9.4-9.7 Reminders HW6 out November 13th and due December 11th Check grade posting on-line (by your midterm exam number) for correctness Second evening midterm exam scheduled Tuesday, November 18, 20:15 to 22:15, Location 262 Willard Please let me know ASAP (via email) if you have a conflict