Compsci 220 / ECE 252 (Lebeck)1 Compsci 220/ ECE 252 Computer Architecture Data Parallel: Vectors, SIMD, GPGPU Slides originally developed by Amir Roth.

Slides:



Advertisements
Similar presentations
1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved.
Advertisements

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs1 CIS 501: Computer Architecture Unit 11: Data-Level Parallelism: Vectors & GPUs Slides developed.
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar1 CIS 501: Computer Architecture Unit 8: Superscalar Pipelines Slides developed by Joe Devietti,
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
The University of Adelaide, School of Computer Science
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Duke Compsci 220 / ECE 252 Advanced Computer Architecture I
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal
Instruction Level Parallelism (ILP) Colin Stevens.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Graphics Processing Units
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Architecture Basics ECE 454 Computer Systems Programming
CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs1 CIS 501: Computer Architecture Unit 11: Data-Level Parallelism: Vectors & GPUs Slides developed.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Classifying GPR Machines TypeNumber of Operands Memory Operands Examples Register- Register 30 SPARC, MIPS, etc. Register- Memory 21 Intel 80x86, Motorola.
GPU Architecture Overview
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.
Introduction to MMX, XMM, SSE and SSE2 Technology
November 22, 1999The University of Texas at Austin Native Signal Processing Ravi Bhargava Laboratory of Computer Architecture Electrical and Computer.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
GPU Architecture Overview Patrick Cozzi University of Pennsylvania CIS Fall 2013.
Copyright © Curt Hill SIMD Single Instruction Multiple Data.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
A Survey of the Current State of the Art in SIMD: Or, How much wood could a woodchuck chuck if a woodchuck could chuck n pieces of wood in parallel? Wojtek.
Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.
GPU Architecture Overview Patrick Cozzi University of Pennsylvania CIS Fall 2012.
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
Use of Pipelining to Achieve CPI < 1
The Present and Future of Parallelism on GPUs
CS 352H: Computer Systems Architecture
Single Instruction Multiple Data
Computer Architecture Principles Dr. Mike Frank
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers
Vector Processing => Multimedia
Mattan Erez The University of Texas at Austin
NVIDIA Fermi Architecture
Multivector and SIMD Computers
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
EE 193: Parallel Computing
General Purpose Graphics Processing Units (GPGPUs)
CSC3050 – Computer Architecture
CSE 502: Computer Architecture
Presentation transcript:

Compsci 220 / ECE 252 (Lebeck)1 Compsci 220/ ECE 252 Computer Architecture Data Parallel: Vectors, SIMD, GPGPU Slides originally developed by Amir Roth with contributions by Milo Martin at University of Pennsylvania with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, David Wood, University of Eindhoven slides by Prof. dr. Henk Corporaal and Dr. Bart Mesman

Administrative Work on Projects Homework #5 Compsci 220 / ECE 252 (Lebeck)2

Best Way to Compute This Fast? Sometimes you want to perform the same operations on many data items  Surprise example: SAXPY One approach: superscalar (instruction-level parallelism)  Loop unrolling with static scheduling –or– dynamic scheduling  Problem: wide-issue superscalar scaling issues –N 2 bypassing, N 2 dependence check, wide fetch –More register file & memory traffic (ports) Can we do better? Compsci 220 / ECE 252 (Lebeck)3 for (I = 0; I < 1024; I++) Z[I] = A*X[I] + Y[I]; 0:ldf X(r1),f1 // I is in r1 mulf f0,f1,f2 // A is in f0 ldf Y(r1),f3 addf f2,f3,f4 stf f4,Z(r1) addi r1,4,r1 blti r1,4096,0

Better Alternative: Data-Level Parallelism Data-level parallelism (DLP)  Single operation repeated on multiple data elements  SIMD (Single-Instruction, Multiple-Data)  Less general than ILP: parallel insns are all same operation  Exploit with vectors Old idea: Cray-1 supercomputer from late 1970s  Eight 64-entry x 64-bit floating point “Vector registers”  4096 bits (0.5KB) in each register! 4KB for vector register file  Special vector instructions to perform vector operations  Load vector, store vector (wide memory operation)  Vector+Vector addition, subtraction, multiply, etc.  Vector+Constant addition, subtraction, multiply, etc.  In Cray-1, each instruction specifies 64 operations! Compsci 220 / ECE 252 (Lebeck)4

5 Example Vector ISA Extensions Extend ISA with floating point (FP) vector storage …  Vector register: fixed-size array of 32- or 64- bit FP elements  Vector length: For example: 4, 8, 16, 64, … … and example operations for vector length of 4  Load vector: ldf.v X(r1),v1 ldf X+0(r1),v1[0] ldf X+1(r1),v1[1] ldf X+2(r1),v1[2] ldf X+3(r1),v1[3]  Add two vectors: addf.vv v1,v2,v3 addf v1[i],v2[i],v3[i] (where i is 0,1,2,3)  Add vector to scalar: addf.vs v1,f2,v3 addf v1[i],f2,v3[i] (where i is 0,1,2,3)

Compsci 220 / ECE 252 (Lebeck)6 Example Use of Vectors – 4-wide Operations  Load vector: ldf.v X(r1),v1  Multiply vector to scalar: mulf.vs v1,f2,v3  Add two vectors: addf.vv v1,v2,v3  Store vector: stf.v v1,X(r1) Performance?  If CPI is one, 4x speedup  But, vector instructions don’t always have single-cycle throughput  Execution width (implementation) vs vector width (ISA) ldf X(r1),f1 mulf f0,f1,f2 ldf Y(r1),f3 addf f2,f3,f4 stf f4,Z(r1) addi r1,4,r1 blti r1,4096,0 ldf.v X(r1),v1 mulf.vs v1,f0,v2 ldf.v Y(r1),v3 addf.vv v2,v3,v4 stf.v v4,Z(r1) addi r1,16,r1 blti r1,4096,0 7x1024 instructions7x256 instructions (4x fewer instructions)

Vector Datapath & Implementatoin Vector insn. are just like normal insn… only “wider”  Single instruction fetch (no extra N 2 checks)  Wide register read & write (not multiple ports)  Wide execute: replicate floating point unit (same as superscalar)  Wide bypass (avoid N 2 bypass problem)  Wide cache read & write (single cache tag check) Execution width (implementation) vs vector width (ISA)  Example: Pentium 4 and “Core 1” executes vector ops at half width  “Core 2” executes them at full width Because they are just instructions…  …superscalar execution of vector instructions is common  Multiple n-wide vector instructions per cycle Compsci 220 / ECE 252 (Lebeck)7

8 Intel’s SSE2/SSE3/SSE4… Intel SSE2 (Streaming SIMD Extensions 2)  bit floating point registers ( xmm0–xmm15 )  Each can be treated as 2x64b FP or 4x32b FP (“packed FP”)  Or 2x64b or 4x32b or 8x16b or 16x8b ints (“packed integer”)  Or 1x64b or 1x32b FP (just normal scalar floating point)  Original SSE: only 8 registers, no packed integer support Other vector extensions  AMD 3DNow!: 64b (2x32b)  PowerPC AltiVEC/VMX: 128b (2x64b or 4x32b) Looking forward for x86  Intel’s “Sandy Bridge” will bring 256-bit vectors to x86  Intel’s “Larrabee” graphics chip will bring 512-bit vectors to x86

Compsci 220 / ECE 252 (Lebeck)9 Other Vector Instructions These target specific domains: e.g., image processing, crypto  Vector reduction (sum all elements of a vector)  Geometry processing: 4x4 translation/rotation matrices  Saturating (non-overflowing) subword add/sub: image processing  Byte asymmetric operations: blending and composition in graphics  Byte shuffle/permute: crypto  Population (bit) count: crypto  Max/min/argmax/argmin: video codec  Absolute differences: video codec  Multiply-accumulate: digital-signal processing More advanced (but in Intel’s Larrabee)  Scatter/gather loads: indirect store (or load) from a vector of pointers  Vector mask: predication (conditional execution) of specific elements

Using Vectors in Your Code Write in assembly  Ugh Use “intrinsic” functions and data types  For example: _mm_mul_ps() and “__m128” datatype Use a library someone else wrote  Let them do the hard work  Matrix and linear algebra packages Let the compiler do it (automatic vectorization)  GCC’s “-ftree-vectorize” option  Doesn’t yet work well for C/C++ code (old, very hard problem) Compsci 220 / ECE 252 (Lebeck)10

SIMD Systems Massively parallel systems Array of very simple processors Broadcast one instruction: each processor executes on local data Connection Machines: CM-1, CM-2, CM-5 (SPARC processors…?) Maspar Goodyear But….isn’t this what GPUs have? Compsci 220 / ECE 252 (Lebeck)11

GPGPU: Nvidia Tesla Compsci 220 / ECE 252 (Lebeck)12

Streaming Multiprocessor (SM) - Each SM has 8 Scalar Processors (SP) - IEEE bit floating point support (incomplete support) - Each SP is a 1.35 GHz processor (32 GFLOPS peak) - Supports 32 and 64 bit integers - 8,192 dynamically partitioned 32-bit registers - Supports 768 threads in hardware (24 SIMT warps of 32 threads) - Thread scheduling done in hardware - 16KB of low-latency shared memory - 2 Special Function Units (reciprocal square root, trig functions, etc) Each GPU has 16 SMs…

Nvidia Fermi: 512 PEs

Fine-Grained Interleaved Threading Pros: reduce cache size, no branch predictor, no OOO scheduler Cons: register pressure, thread scheduler, require huge parallelism Without and with fine-grained interleaved threading

Step 2: Single Instruction Multiple Data SSE has 4 data lanes GPU has 8/16/24/... data lanes GPU uses wide SIMD: 8/16/24/... processing elements (PEs) CPU uses short SIMD: usually has vector width of 4.

Hardware Support Supporting interleaved threading + SIMD execution

Single Instruction Multiple Thread (SIMT) Hide vector width using scalar threads.

Example of SIMT Execution Assume 32 threads are grouped into one warp.

Step 3: Simple Core The Stream Multiprocessor (SM) is a light weight core compared to IA core. Light weight PE: Fused Multiply Add (FMA) SFU: Special Function Unit

NVIDIA's Motivation of Simple Core "This [multiple IA-core] approach is analogous to trying to build an airplane by putting wings on a train." --Bill Dally, NVIDIA

Review: How Do We Reach Here? NVIDIA Fermi, 512 Processing Elements (PEs)

ATI Radeon 5000 Series Architecture

Throughput Oriented Architectures 1.Fine-grained interleaved threading (~2x comp density) 2.SIMD/SIMT (>10x comp density) 3.Simple core (~2x comp density) Key architectural features of throughput oriented processor. ref: Michael Garland. David B. Kirk, "Understanding throughput-oriented architectures", CACM (link)