1 Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A Quantitative Approach, Fifth Edition
2 Introduction SIMD architectures can exploit significant data-level parallelism for: matrix-oriented scientific computing media-oriented image and sound processors SIMD is more energy efficient than MIMD Needs to fetch one instruction for multiple data operation Makes SIMD attractive for personal mobile devices SIMD allows programmer to continue to think sequentially but achieves parallelism. Introduction
3 SIMD Parallelism Vector architectures became widespread because of hardware and cache improvements SIMD ISA extensions MMX (Multimedia Extensions) in 1996 SSE (Streaming SIMD extensions) till 2006 AVX (advanced Vector Extensions) till now. Graphics Processor Units (GPUs) Separate system and graphics CPU & memory. Introduction
4 Vector Architectures Basic idea: Read sets of data elements into “vector registers” Operate on those registers Disperse the results back into memory Registers are controlled by compiler Used to hide memory latency Leverage memory bandwidth Vector Architectures
5 VMIPS Loosely based on Cray-1 Vector registers Each register holds a 64-element, 64 bits/element vector Can also be bit, bit, or bit vectors Register file has 16 read and 8 write ports Vector functional units Fully pipelined Data and control hazards are detected Vector load-store unit Fully pipelined One word per clock cycle Scalar registers 32 general-purpose registers 32 floating-point registers Vector Architectures
6 VMIPS Instructions ADDVV.D: add two vectors ADDVS.D: add vector to a scalar LV/SV: vector load and vector store from address Example: DAXPY L.DF0,a; load scalar a LVV1,Rx; load vector X MULVS.DV2,V1,F0; vector-scalar multiply LVV3,Ry; load vector Y ADDVVV4,V2,V3; add SVRy,V4; store the result Requires 6 instructions vs. almost 600 for MIPS Vector Architectures
7 VMIPS Loops are vectorizable if there is no loop-carried dependence. Dependent operations within a loop are chained together in a vector instruction. Loop unrolling can reduce MIPS stalls but is not as efficient.
8 Vector Execution Time Execution time depends on three factors: Length of operand vectors Structural hazards Data dependencies VMIPS functional units consume one element per clock cycle Execution time is approximately the vector length Convoy Set of vector instructions that could potentially execute together Vector Architectures
9 Vector execution time Sequences with read-after-write dependency hazards can be in the same convey via chaining. Chaining Allows a vector operation to start as soon as the individual elements of its vector source operand become available. Similar to forwarding in scalar pipeline. Chime Unit of time to execute one convoy m convoys executes in m chimes A vector length of n, requires m x n clock cycles Vector Architectures
10 Example LVV1,Rx;load vector X MULVS.DV2,V1,F0;vector-scalar multiply LVV3,Ry;load vector Y ADDVV.DV4,V2,V3;add two vectors SVRy,V4;store the sum Convoys: 1LVMULVS.D 2LVADDVV.D 3SV(has structural hazard on LV) 3 chimes For 64 element vectors, requires 64 x 3 = 192 clock cycles Vector Architectures
11 Challenges Start up time Latency of vector functional unit, pipeline depth Assume the same as Cray-1 Floating-point add => 6 clock cycles Floating-point multiply => 7 clock cycles Floating-point divide => 20 clock cycles Vector load => 12 clock cycles Improvements : > 1 element per clock cycle Non-64 wide vectors IF statements in vector code Memory system optimizations to support vector processors Multiple dimensional matrices Sparse matrices Vector Architectures
12 Multiple Lanes Element n of vector register A is “hardwired” to element n of vector register B Allows for multiple hardware lanes Vector Architectures
13 Vector Length Register Vector length not known at compile time, may be less than 64. Use Vector Length Register (VLR), especially in LV/SV. Use strip mining for vectors over the maximum length: low = 0; VL = (n % MVL); /*find odd-size piece using modulo op % */ for (j = 0; j <= (n/MVL); j=j+1) { /*outer loop*/ for (i = low; i < (low+VL); i=i+1) /*runs for length VL*/ Y[i] = a * X[i] + Y[i] ; /*main operation*/ low = low + VL; /*start of next vector*/ VL = MVL; /*reset the length to maximum vector length*/ } Vector Architectures
14 Vector Mask Registers Conditional loops affect vectorization, because of control dependences. Consider: for (i = 0; i < 64; i=i+1) if (X[i] != 0) X[i] = X[i] – Y[i]; Use vector mask register to “disable” elements: LVV1,Rx;load vector X into V1 LVV2,Ry;load vector Y L.DF0,#0;load FP zero into F0 SNEVS.DV1,F0;sets VM(i) to 1 if V1(i)!=F0 SUBVV.DV1,V1,V2;subtract under vector mask SVRx,V1;store the result in X The operation is done on elements that their vector mask is 1; leaving others unchanged. Vector Architectures
15 Memory Banks Memory system must be designed to support high bandwidth for vector loads and stores Spread accesses across multiple banks Control bank addresses independently Load or store non sequential words Support multiple vector processors sharing the same memory Example: 32 processors, each generating 4 loads and 2 stores/cycle Processor cycle time is ns, SRAM cycle time is 15 ns How many memory banks needed? Vector Architectures
16 Stride: Handling multidimensional arrays Consider matrix multiplication program: for (i = 0; i < 100; i=i+1) for (j = 0; j < 100; j=j+1) { A[i][j] = 0.0; for (k = 0; k < 100; k=k+1) A[i][j] = A[i][j] + B[i][k] * D[k][j]; } multiplication of rows of B with columns of D are vectorized. If matrices are stored row major, B elements has a unit (1) stride, while D elements has 100 stride. Use non-unit stride memory access (LVWS, SVWS) to fetch the data. Bank conflict (1 cycle stall) occurs when the same bank is hit faster than bank busy time: #banks / LCM(stride,#banks) < bank busy time Vector Architectures
17
18 Scatter-Gather Consider a sparse matrix addition like below: for (i = 0; i < n; i=i+1) A[K[i]] = A[K[i]] + C[M[i]]; K, M are index vectors. Use index vectors to gather nonzero elements from memory and scatter them after manipulation: LVVk, Rk;load K LVIVa, (Ra+Vk);load A[K[]] LVVm, Rm;load M LVIVc, (Rc+Vm);load C[M[]] ADDVV.DVa, Va, Vc;add them SVI(Ra+Vk), Va;store A[K[]] Vector Architectures
19 SIMD ISA Extensions for Multimedia Media applications operate on data types narrower than the native sizes. 8 bit color + 8 bit transparency + 8 bit audio. A 256-bit adder can be “partitioned” to 32 8-bit adders or so on. Use much smaller register files than the vector procesors. Limitations, compared to vector instructions: Number of data operands are encoded into opcode, leading to so many instructions vs. using vector length registers in vector processors. No sophisticated addressing modes (strided, scatter-gather) No mask registers SIMD Instruction Set Extensions for Multimedia
20 SIMD Implementations Intel MMX (1996) Use 64-bit floating point registers for: Eight 8-bit integer ops or four 16-bit integer ops So many DSP and multimedia instructions were added. Streaming SIMD Extensions (SSE 1,2,3,4) ( ) Added separate 128-bit registers. Eight 16-bit integer ops Double precision (64-bit) floating point data type was added to SIMD since SSE2 Four 32-bit integer/fp ops or two 64-bit integer/fp ops Advanced Vector Extensions (2010) Introduced 256-bit floating point register to double the BW. SIMD Instruction Set Extensions for Multimedia
21 SIMD extension benefits Little implementation cost of ALUs. Little state information compared to vector processors. Lower memory BW required. Doesn’t encounter virtual memory issues because of lower memory usage.
22 Copyright © 2012, Elsevier Inc. All rights reserved. Example SIMD Code (4x speedup compared to SISD) Example DXPY: L.DF0,a;load scalar a MOVF1, F0;copy a into F1 for SIMD MUL MOVF2, F0;copy a into F2 for SIMD MUL MOVF3, F0;copy a into F3 for SIMD MUL DADDIUR4,Rx,#512;last address to load Loop: L.4D F4,0[Rx];load X[i], X[i+1], X[i+2], X[i+3] MUL.4DF4,F4,F0;a×X[i],a×X[i+1],a×X[i+2],a×X[i+3] L.4DF8,0[Ry];load Y[i], Y[i+1], Y[i+2], Y[i+3] ADD.4DF8,F8,F4;a×X[i]+Y[i],..., a×X[i+3]+Y[i+3] S.4D0[Ry],F8;store into Y[i], Y[i+1], Y[i+2], Y[i+3] DADDIURx,Rx,#32;increment index to X DADDIURy,Ry,#32;increment index to Y DSUBUR20,R4,Rx;compute bound BNEZR20,Loop;check if done SIMD Instruction Set Extensions for Multimedia
23 Roofline Performance Model Basic idea: Plot peak floating-point throughput as a function of arithmetic intensity Ties together floating-point performance and memory performance for a target machine Arithmetic intensity Floating-point operations per byte read from memory GFLOPs/sec = Min(Peak Memory BW × Arithmetic Intensity, Peak Floating Point Perf.) SIMD Instruction Set Extensions for Multimedia Ridge point
24 Graphical Processing Units Given the hardware invested to do graphics well, how can we supplement it to improve performance of a wider range of applications? Basic idea: Heterogeneous execution model CPU is the host, GPU is the device Develop a C-like programming language for GPU CUDA: Compute Unified Device Architecture. Developed by NVIDIA OpenCL : similar open language. Unify all forms of GPU parallelism as CUDA thread Programming model is “Single Instruction Multiple Thread” (SIMT) 32 threads form a thread block to be executed. Graphical Processing Units
25 Multithreading Hiding the instruction execution stalls and memory stalls, by doing multiple parallel tasks. e.g. processing parallel queries, separate dimensions of a three dimensional problem, or handling multiple applications in a personal computer. Requires multiple separate register files, PC, and page tables for each thread. Memory is handled and shared by virtualization. Program must contain multiple threads.
26 Hardware approaches to multithreading Fine-grained multithreading: Switches between threads at each clock. Skipping the stalled threads. Threads are interleaved. Increases throughput but also increases per thread latency. Used in Sun Niagara and NVIDIA GPUs. Coarse-grained multithreading: Switches the threads on costly stalls like L2, or L3 cache misses. No processor have used it. Simultaneous multithreading (SMT): Uses multiple issue & dynamic scheduling in addition to fine-grained multithreading Most common implementation, in CPUs.
27 MT comparison
28 A CUDA example for DAXPY Each loop iteration becomes a thread: Graphical Processing Units
29 NVIDIA GPU Architecture Similarities to vector machines: Works well with data-level parallel problems Scatter-gather transfers Mask registers Large register files Differences: No scalar processor Uses multithreading to hide memory latency Has many functional units, as opposed to a few deeply pipelined units like a vector processor Graphical Processing Units
30 GPU terms
31 Example Multiply two vectors of length 8192 Code that works over all elements is the grid Thread blocks break this down into manageable sizes 512 threads per block Thus grid size = 16 blocks SIMD instruction executes 32 elements at a time Thread block is analogous to a vector with length of 32 Each block is assigned to a multithreaded SIMD processor by the thread block scheduler Current-generation GPUs (Fermi) have 7-15 multithreaded SIMD processors Graphical Processing Units
32
33 Terminology Threads of SIMD instructions Is 32 elements wide. Each has its own PC Thread scheduler uses scoreboard to dispatch instructions No data dependencies between threads. Keeps track of up to 48 threads of SIMD instructions Thread block scheduler assigns blocks to SIMD processors Within each SIMD processor: 16 SIMD lanes in Fermi, hence each thread is completed in 2 cylces. Has 32, bit registers. Each lane contain 2048 = 64 * 32 32bit or 32*32 64bit registers Graphical Processing Units
34 Multithreaded SIMD processor architecture
35 Example NVIDIA GPU has 32,768 registers Divided into lanes Each SIMD thread is limited to 64 registers SIMD thread has up to: 64 vector registers of bit elements 32 vector registers of bit elements Fermi has 16 SIMD lanes, each containing 2048 registers Graphical Processing Units
36 NVIDIA Instruction Set Arch. “Parallel Thread Execution (PTX)” Translation to microinstructions is performed in software PTX format: Opcode.type d,a,b,c; R0, R1, … are used for 32 bit registers, RD0, RD1, … used for 64 bit registers Graphical Processing Units
37 Some instructions
38 Conditional Branching Like vector architectures, GPU branch hardware uses internal masks Per-thread-lane 1-bit predicate register Also uses Branch synchronization stack Instruction markers to manage when a branch diverges into multiple execution paths Push on divergent branch …and when paths converge Pops stack Graphical Processing Units
39 Example if (X[i] != 0) X[i] = X[i] – Y[i]; else X[i] = Z[i]; Compiled to: ld.global.f64RD0, [X+R8]; RD0 = X[i] setp.neq.s32P1, RD0, #0; P1 is predicate register bra ELSE1, *Push; Push old mask, set new mask bits ; if P1 false, go to ELSE1 ld.global.f64RD2, [Y+R8]; RD2 = Y[i] sub.f64RD0, RD0, RD2; Difference in RD0 st.global.f64[X+R8], RD0; X[i] = ENDIF1, *Comp; complement mask bits ; if P1 true, go to ENDIF1 ELSE1:ld.global.f64 RD0, [Z+R8]; RD0 = Z[i] st.global.f64 [X+R8], RD0; X[i] = RD0 ENDIF1:, *Pop; pop to restore old mask Graphical Processing Units
40 NVIDIA GPU Memory Structures Each SIMD Lane has private section of off-chip DRAM or on-chip cache called “Private memory”. Contains stack frame, spilling registers, and variables Each multithreaded SIMD processor also has a dhared local memory Dedicated to each thread block by processor. DRAM Memory shared by all SIMD processors is called GPU Memory Host can read and write GPU memory Graphical Processing Units
41 GPU Memory structure
42 Fermi Architecture Innovations Each SIMD processor has Two SIMD thread schedulers, two instruction dispatch units 2 sets of 16 SIMD lanes (SIMD width=32), 16 load-store units, 4 special function units Thus, two threads of SIMD instructions are scheduled every two clock cycles Fast double precision : peak performance 515 GFLOP/sec Caches for GPU memory: I,D L1 caches for each SIMD processor & a 768 KB shared L2 cache L1 caches are part of 64 KB local memory. 64-bit addressing Error correcting codes for memory and registers. Faster context switching Faster atomic instructions Graphical Processing Units
43 Fermi Multithreaded SIMD Proc. Graphical Processing Units
44 Vectors vs. GPUs GPUs use Multithreading Wider SIMD lanes 8-16 vs. 2-8 in Vectors. With 32 element threads or vectors this leads to 2-4 cycles/chime in GPU vs cycles/chime in Vectors. Scalar processor in Vectors Is faster then using host In GPUs, So GPUs use 1 lane For scalar operations which Is less power efficient.
45 4 lane Vector & SIMD architectures
46 SIMD extensions vs. GPUs
47 GPUs for mobiles and servers
48 Core-i7 vs. Tesla
49 Core-i7 vs. Tesla