Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University
© S. Dandamudi2 Pipelining Vector machines exploit pipelining in all its activities Computations Movement of data from/to memory Pipelining provides overlapped execution Increases throughput Hides latency …
Carleton University© S. Dandamudi3 Pipelining (cont’d) Pipeline overlaps execution: 6 versus 18 cycles
Carleton University© S. Dandamudi4 Pipelining (cont’d) One measure of performance: Ideal case: n-stage pipeline should give a speedup of n Two factors affect this: Pipeline fill Pipeline drain Non-pipelined execution time Pipelined execution time Speedup =
Carleton University© S. Dandamudi5 Pipelining (cont’d) N computations, each takes n * T time Non-pipelined time = N * n * T time Pipelined time = n * T + (N – 1) T time = (n + N –1) T time n * Nn * N n + N 1 Speedup = 1/N + 1/n – 1/(n * N ) 1 =
Carleton University© S. Dandamudi6 Pipelining (cont’d) n = 9 n = 3 n = 6
Carleton University© S. Dandamudi7 Pipelining (cont’d) Pipeline depth, n
Carleton University© S. Dandamudi8 Vector Machines Provide high-level operations Work on vectors (linear arrays of numbers) A typical vector operation Add two 64-element floating-point vectors Equivalent to an entire loop CRAY format V3 V2 VOP V1 V3 V2 VOP V1
Carleton University© S. Dandamudi9 Vector Machines (cont’d) Consists of Scalar unit Works on scalars Address arithmetic Vector unit Responsible for vector operations Several vector functional units Integer add, FP add, FP multiply …
Carleton University© S. Dandamudi10 Vector Machines (cont’d) Two types of architecture Memory-to-memory architecture Vectors are memory resident First machines are of this type Example: CDC Star 100, CYBER 205 Vector-register architecture Vectors are stored in registers Modern vector machines belong to this type Examples: Cary 1/2/X-MP/YMP, NEC SX/2, Fujitsu VP200, Hitachi S820
Carleton University© S. Dandamudi11 Components Primary components of vector-register machine Vector registers Each register can hold a small vector Example: Cray-1 has 8 vector registers Each vector register can hold 64 doublewords (64-bit values) Two read ports and one write port Allows overlap among the vector operations
Carleton University© S. Dandamudi12 Cray-1Architecture
Carleton University© S. Dandamudi13 Components Vector functional units Each unit is fully pipelined Can start a new operation on every clock cycle Cray-1 has six functional units FP Add, FP multiply, FP reciprocal, Integer add, Logical, Shift Scalar registers Store scalars Compute addresses to pass on to the load/store unit
Carleton University© S. Dandamudi14 Components Vector load/store unit Moves vectors between memory and vector registers Load and store operations are pipelined Some processors have more than one load/store unit NEC SX/2 has 8 load/store units Memory Designed to allow pipelined access Typically use interleaved memories Will discuss later
Carleton University© S. Dandamudi15 Some Example Vector Machines MachineYear# VRVR size# LSUs CRAY Cray Y-MP loads/1 store Fujitsu VP Hitachi S NEC SX/ var.8 Convex C
Carleton University© S. Dandamudi16 Some Example Vector Machines (cont’d) Vector functional units Cray X-MP/Y-MP 8 units FP add, FP multiply, FP reciprocal Integer add, 2 logical Shift Population count/parity
Carleton University© S. Dandamudi17 Some Example Vector Machines (cont’d) Vector functional units (cont’d) NEX SX/2 16 units 4 FP add, 4 FP multiply/divide 4 Integer add/logical, 4 Shift
Carleton University© S. Dandamudi18 Advantages of Vector Machines Flynn’s bottleneck can be reduced Vector instructions significantly improve code density A single vector instruction specifies a great deal of work Reduce the number of instructions needed to execute a program Eliminate control overhead of a loop A vector instruction represents the entire loop Loop overhead can be substantial
Carleton University© S. Dandamudi19 Advantages of Vector Machines (cont’d) Impact of main memory latency can be reduced Vector instructions that access memory have a known pattern Pipelined access can be used Can exploit interleaved memory High latency associated with memory can be amortized over the entire vector Latency is not associated with each data item When accessing a floating-point number
Carleton University© S. Dandamudi20 Advantages of Vector Machines (cont’d) Control hazards can be reduced Vector machines organize data operands into regular sequences Suitable for pipelined access in hardware Vector operation loop Data hazards can be eliminated Due to structured nature of data Allows planned prefetching of data
Carleton University© S. Dandamudi21 Example Problem A Typical Vector Problem Y = a * X + Y X and Y are vectors This problem is known as SAXPY (single precision A*X Plus Y) DAXPY (double precision A*X Plus Y) SAXPY/DAXPY represents a small piece of code that takes most of the time in the benchmark
Carleton University© S. Dandamudi22 Example Problem (cont’d) Non-vector code fragment LD F0,a ADDI R4,Rx,#512 ;last address to load loop: LD F2,0(Rx) ;F2 := M[0+Rx] ; i.e., load X[i] MULT F2,F0,F2 ;a*X[i]
Carleton University© S. Dandamudi23 Example Problem (cont’d) LD F4,0(Ry) ;load Y[i] ADD F4,F2,F4 ;a*X[i] + y[i] SD F4,0(Ry) ;store into Y[i] ADDI Rx,Rx,#8 ;increment index to X ADDI Ry,Ry,#8 ;increment index to Y SUB R20,R4,Rx ;R20 := R4-Rx JNZ R20,loop ;jump if not done 9 instructions in the loop
Carleton University© S. Dandamudi24 Example Problem (cont’d) Vector code fragment LD F0,a ;load scalar a LV V1,Rx ;load vector X MULTSV V2,F0,V1 ;V2 := F0 * V1 LV V3,Ry ;load vector Y ADDV V4,V2,V3 ;V4 := V2 + V3 SV Ry,V4 ; store the result Only 6 vector instructions!
Carleton University© S. Dandamudi25 Example Problem (cont’d) Two main observations Execution efficiency Vector code Executes 6 instructions Non-vector code Nearly 600 instructions (9 * 64) Lots of control overhead 4 out of 9 instructions! Absent in the vector code
Carleton University© S. Dandamudi26 Example Problem (cont’d) Two main observations Frequency of pipeline interlock Non-vector code: Every ADD must wait for MULT Every SD must wait for ADD Loop unrolling can eliminate this interlock Vector code Each instruction is independent Pipeline stalls once per vector operation Not once per vector element
Carleton University© S. Dandamudi27 Vector Length Vector register has a natural vector length 64 elements in CRAY systems What if the vector has a different length? Three cases Vector length < Vector register length Use a vector length register to indicate the vector length Vector length = Vector register length Vector length > Vector register length
Carleton University© S. Dandamudi28 Vector Length (cont’d) Vector length > Vector register length Use strip mining Vector is partitioned into strips that are less than or equal to the vector register length Odd strip
Carleton University© S. Dandamudi29 Vector Stride Vector stride Distance separating the elements that are to be merged into a single vector In elements, not bytes Typically multidimensional matrices may have non-unit stride access patterns Example: matrix multiply
Carleton University© S. Dandamudi30 Vector Stride (cont’d) Matrix multiplication for (i = 1, 100) for (j = 1, 100) A[i,j] = 0 for (k = 1, 100) A[i,j] = A[i,j] + B[i,k] * C[k,j] Non-unit stride Unit stride
Carleton University© S. Dandamudi31 Vector Stride (cont’d) Access pattern of B and C depends on how the matrix is stored Row-major Matrix is stored row-by-row Used by most languages except FORTRAN Column-major Matrix is stored column-by-column Used by FORTRAN
Carleton University© S. Dandamudi32 Vector Stride (cont’d)
Carleton University© S. Dandamudi33 Cray X-MP Instructions Integer addition Vi Vj+VkVi = Vj + Vk Vi Sj+VkVi = Sj + Vk Sj is a scalar Floating-point addition Vi Vj+FVkVi = Vj + Vk Vi Sj+FVkVi = Sj + Vk Sj is a scalar
Carleton University© S. Dandamudi34 Cray X-MP Instructions (cont’d) Load instructions Vi,A0,AkVi = M(A0)+Ak Vector load with stride Ak Loads VL elements from memory address A0 Vi,A0,1Vi = M(A0)+1 Vector load with stride 1 Special case
Carleton University© S. Dandamudi35 Cray X-MP Instructions (cont’d) Store instructions ,A0,Ak Vi Vector store with stride Ak Stores VL elements starting at memory address A0 ,A0,1 Vi Vector store with stride 1 Special case
Carleton University© S. Dandamudi36 Cray X-MP Instructions (cont’d) Logical AND instructions Vi Vj&VkVi = Vj & Vk Vi Sj&VkVi = Sj & Vk Sj is a scalar Shift instructions Vi Vj>AkVi = Vj >> Ak Vi Vj<AkVi = Vj << Ak Left/right shift each element of Vj and store the result in Vi
Carleton University© S. Dandamudi37 Sample Vector Functional Units Vector functional unit# StagesAvailable to chain Vector results Integer ADD (64-bit)38VL+8 64-bit shift38VL bit shift49VL+9 Floating ADD611VL+11 Floating MULTIPLY712VL+12
Carleton University© S. Dandamudi38 X-MP Pipeline Operation Three phases Setup phase Sets functional units to perform the appropriate operation Establishes routes to source and destination vector registers Requires 3 clock cycles for all functional units Execution phase Shutdown phase
Carleton University© S. Dandamudi39 X-MP Pipeline Operation (Cont’d) Three phases (cont’d) Execution phase Source and destination vector registers are reserved Cannot be used by another instruction Source vector register is reserved for VL+3 clock cycles VL = vector length One pair of operands/clock cycle enter the first stage
Carleton University© S. Dandamudi40 X-MP Pipeline Operation (Cont’d) Three phases (cont’d) Shutdown phase Shutdown time = 3 clock cycles Shutdown time Time difference between when the last result emerges and when the destination vector register becomes available for other instructions
Carleton University© S. Dandamudi41 X-MP Pipeline Operation (Cont’d) Three phases (cont’d) Shutdown phase Destination register becomes available after 3 + n + (VL 1) + 3 = n + VL + 5 clock cycles Setup time = shutdown time = 3 clock cycles First result comes after n clock cycles Remaining (VL 1) results come out at one/clock cycle
Carleton University© S. Dandamudi42 A Simple Vector Add Operation A1 5 VL A1 V1 V2+FV3
Carleton University© S. Dandamudi43 Overlapped Vector Operations A1 5 VL A1 V1 V2+FV3 V4 V5*FV6
Carleton University© S. Dandamudi44 Chaining Example A1 5 VL A1 V1 V2+FV3 V4 V5*FV1
Carleton University© S. Dandamudi45 Vector Processing Performance
Carleton University© S. Dandamudi46 Interleaved Memories Traditional memory designs Provide sequential, non-overlapped access Use high-order interleaving Interleaved memories Facilitate overlapped, pipelined access Used by vector and high performance systems Use low-order interleaving
Carleton University© S. Dandamudi47 Interleaved Memories (cont’d)
Carleton University© S. Dandamudi48 Interleaved Memories (cont’d) Two types of designs Synchronized access organization Upper m bits are given to all memory banks simultaneously Requires output latches Does not efficiently support non-sequential access Independent access organization Supports pipelined access for arbitrary access pattern Require address registers
Carleton University© S. Dandamudi49 Interleaved Memories (cont’d) Synchronized access organization
Carleton University© S. Dandamudi50 Interleaved Memories (cont’d) Pipelined transfer of data in interleaved memories
Carleton University© S. Dandamudi51 Interleaved Memories (cont’d) Independent access organization
Carleton University© S. Dandamudi52 Interleaved Memories (cont’d) Number of banks B B M M = memory access time in cycles Sequential access if stride = B B = 8, M = 6 clock cycles, stride = 1 Time to read 16 words = = 22 clock cycles If stride is 8, it takes 16 * 6 = 96 clock cycles Last slide