Download presentation
Presentation is loading. Please wait.
Published byMarianna Shields Modified over 9 years ago
1
Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University
2
© S. Dandamudi2 Pipelining Vector machines exploit pipelining in all its activities Computations Movement of data from/to memory Pipelining provides overlapped execution Increases throughput Hides latency …
3
Carleton University© S. Dandamudi3 Pipelining (cont’d) Pipeline overlaps execution: 6 versus 18 cycles
4
Carleton University© S. Dandamudi4 Pipelining (cont’d) One measure of performance: Ideal case: n-stage pipeline should give a speedup of n Two factors affect this: Pipeline fill Pipeline drain Non-pipelined execution time Pipelined execution time Speedup =
5
Carleton University© S. Dandamudi5 Pipelining (cont’d) N computations, each takes n * T time Non-pipelined time = N * n * T time Pipelined time = n * T + (N – 1) T time = (n + N –1) T time n * Nn * N n + N 1 Speedup = 1/N + 1/n – 1/(n * N ) 1 =
6
Carleton University© S. Dandamudi6 Pipelining (cont’d) n = 9 n = 3 n = 6
7
Carleton University© S. Dandamudi7 Pipelining (cont’d) Pipeline depth, n
8
Carleton University© S. Dandamudi8 Vector Machines Provide high-level operations Work on vectors (linear arrays of numbers) A typical vector operation Add two 64-element floating-point vectors Equivalent to an entire loop CRAY format V3 V2 VOP V1 V3 V2 VOP V1
9
Carleton University© S. Dandamudi9 Vector Machines (cont’d) Consists of Scalar unit Works on scalars Address arithmetic Vector unit Responsible for vector operations Several vector functional units Integer add, FP add, FP multiply …
10
Carleton University© S. Dandamudi10 Vector Machines (cont’d) Two types of architecture Memory-to-memory architecture Vectors are memory resident First machines are of this type Example: CDC Star 100, CYBER 205 Vector-register architecture Vectors are stored in registers Modern vector machines belong to this type Examples: Cary 1/2/X-MP/YMP, NEC SX/2, Fujitsu VP200, Hitachi S820
11
Carleton University© S. Dandamudi11 Components Primary components of vector-register machine Vector registers Each register can hold a small vector Example: Cray-1 has 8 vector registers Each vector register can hold 64 doublewords (64-bit values) Two read ports and one write port Allows overlap among the vector operations
12
Carleton University© S. Dandamudi12 Cray-1Architecture
13
Carleton University© S. Dandamudi13 Components Vector functional units Each unit is fully pipelined Can start a new operation on every clock cycle Cray-1 has six functional units FP Add, FP multiply, FP reciprocal, Integer add, Logical, Shift Scalar registers Store scalars Compute addresses to pass on to the load/store unit
14
Carleton University© S. Dandamudi14 Components Vector load/store unit Moves vectors between memory and vector registers Load and store operations are pipelined Some processors have more than one load/store unit NEC SX/2 has 8 load/store units Memory Designed to allow pipelined access Typically use interleaved memories Will discuss later
15
Carleton University© S. Dandamudi15 Some Example Vector Machines MachineYear# VRVR size# LSUs CRAY-219858641 Cray Y-MP19888642 loads/1 store Fujitsu VP10019828-25632-10242 Hitachi S8101983322564 NEC SX/219848+8192256+var.8 Convex C-1198581281
16
Carleton University© S. Dandamudi16 Some Example Vector Machines (cont’d) Vector functional units Cray X-MP/Y-MP 8 units FP add, FP multiply, FP reciprocal Integer add, 2 logical Shift Population count/parity
17
Carleton University© S. Dandamudi17 Some Example Vector Machines (cont’d) Vector functional units (cont’d) NEX SX/2 16 units 4 FP add, 4 FP multiply/divide 4 Integer add/logical, 4 Shift
18
Carleton University© S. Dandamudi18 Advantages of Vector Machines Flynn’s bottleneck can be reduced Vector instructions significantly improve code density A single vector instruction specifies a great deal of work Reduce the number of instructions needed to execute a program Eliminate control overhead of a loop A vector instruction represents the entire loop Loop overhead can be substantial
19
Carleton University© S. Dandamudi19 Advantages of Vector Machines (cont’d) Impact of main memory latency can be reduced Vector instructions that access memory have a known pattern Pipelined access can be used Can exploit interleaved memory High latency associated with memory can be amortized over the entire vector Latency is not associated with each data item When accessing a floating-point number
20
Carleton University© S. Dandamudi20 Advantages of Vector Machines (cont’d) Control hazards can be reduced Vector machines organize data operands into regular sequences Suitable for pipelined access in hardware Vector operation loop Data hazards can be eliminated Due to structured nature of data Allows planned prefetching of data
21
Carleton University© S. Dandamudi21 Example Problem A Typical Vector Problem Y = a * X + Y X and Y are vectors This problem is known as SAXPY (single precision A*X Plus Y) DAXPY (double precision A*X Plus Y) SAXPY/DAXPY represents a small piece of code that takes most of the time in the benchmark
22
Carleton University© S. Dandamudi22 Example Problem (cont’d) Non-vector code fragment LD F0,a ADDI R4,Rx,#512 ;last address to load loop: LD F2,0(Rx) ;F2 := M[0+Rx] ; i.e., load X[i] MULT F2,F0,F2 ;a*X[i]
23
Carleton University© S. Dandamudi23 Example Problem (cont’d) LD F4,0(Ry) ;load Y[i] ADD F4,F2,F4 ;a*X[i] + y[i] SD F4,0(Ry) ;store into Y[i] ADDI Rx,Rx,#8 ;increment index to X ADDI Ry,Ry,#8 ;increment index to Y SUB R20,R4,Rx ;R20 := R4-Rx JNZ R20,loop ;jump if not done 9 instructions in the loop
24
Carleton University© S. Dandamudi24 Example Problem (cont’d) Vector code fragment LD F0,a ;load scalar a LV V1,Rx ;load vector X MULTSV V2,F0,V1 ;V2 := F0 * V1 LV V3,Ry ;load vector Y ADDV V4,V2,V3 ;V4 := V2 + V3 SV Ry,V4 ; store the result Only 6 vector instructions!
25
Carleton University© S. Dandamudi25 Example Problem (cont’d) Two main observations Execution efficiency Vector code Executes 6 instructions Non-vector code Nearly 600 instructions (9 * 64) Lots of control overhead 4 out of 9 instructions! Absent in the vector code
26
Carleton University© S. Dandamudi26 Example Problem (cont’d) Two main observations Frequency of pipeline interlock Non-vector code: Every ADD must wait for MULT Every SD must wait for ADD Loop unrolling can eliminate this interlock Vector code Each instruction is independent Pipeline stalls once per vector operation Not once per vector element
27
Carleton University© S. Dandamudi27 Vector Length Vector register has a natural vector length 64 elements in CRAY systems What if the vector has a different length? Three cases Vector length < Vector register length Use a vector length register to indicate the vector length Vector length = Vector register length Vector length > Vector register length
28
Carleton University© S. Dandamudi28 Vector Length (cont’d) Vector length > Vector register length Use strip mining Vector is partitioned into strips that are less than or equal to the vector register length Odd strip
29
Carleton University© S. Dandamudi29 Vector Stride Vector stride Distance separating the elements that are to be merged into a single vector In elements, not bytes Typically multidimensional matrices may have non-unit stride access patterns Example: matrix multiply
30
Carleton University© S. Dandamudi30 Vector Stride (cont’d) Matrix multiplication for (i = 1, 100) for (j = 1, 100) A[i,j] = 0 for (k = 1, 100) A[i,j] = A[i,j] + B[i,k] * C[k,j] Non-unit stride Unit stride
31
Carleton University© S. Dandamudi31 Vector Stride (cont’d) Access pattern of B and C depends on how the matrix is stored Row-major Matrix is stored row-by-row Used by most languages except FORTRAN Column-major Matrix is stored column-by-column Used by FORTRAN
32
Carleton University© S. Dandamudi32 Vector Stride (cont’d) 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44
33
Carleton University© S. Dandamudi33 Cray X-MP Instructions Integer addition Vi Vj+VkVi = Vj + Vk Vi Sj+VkVi = Sj + Vk Sj is a scalar Floating-point addition Vi Vj+FVkVi = Vj + Vk Vi Sj+FVkVi = Sj + Vk Sj is a scalar
34
Carleton University© S. Dandamudi34 Cray X-MP Instructions (cont’d) Load instructions Vi,A0,AkVi = M(A0)+Ak Vector load with stride Ak Loads VL elements from memory address A0 Vi,A0,1Vi = M(A0)+1 Vector load with stride 1 Special case
35
Carleton University© S. Dandamudi35 Cray X-MP Instructions (cont’d) Store instructions ,A0,Ak Vi Vector store with stride Ak Stores VL elements starting at memory address A0 ,A0,1 Vi Vector store with stride 1 Special case
36
Carleton University© S. Dandamudi36 Cray X-MP Instructions (cont’d) Logical AND instructions Vi Vj&VkVi = Vj & Vk Vi Sj&VkVi = Sj & Vk Sj is a scalar Shift instructions Vi Vj>AkVi = Vj >> Ak Vi Vj<AkVi = Vj << Ak Left/right shift each element of Vj and store the result in Vi
37
Carleton University© S. Dandamudi37 Sample Vector Functional Units Vector functional unit# StagesAvailable to chain Vector results Integer ADD (64-bit)38VL+8 64-bit shift38VL+8 128-bit shift49VL+9 Floating ADD611VL+11 Floating MULTIPLY712VL+12
38
Carleton University© S. Dandamudi38 X-MP Pipeline Operation Three phases Setup phase Sets functional units to perform the appropriate operation Establishes routes to source and destination vector registers Requires 3 clock cycles for all functional units Execution phase Shutdown phase
39
Carleton University© S. Dandamudi39 X-MP Pipeline Operation (Cont’d) Three phases (cont’d) Execution phase Source and destination vector registers are reserved Cannot be used by another instruction Source vector register is reserved for VL+3 clock cycles VL = vector length One pair of operands/clock cycle enter the first stage
40
Carleton University© S. Dandamudi40 X-MP Pipeline Operation (Cont’d) Three phases (cont’d) Shutdown phase Shutdown time = 3 clock cycles Shutdown time Time difference between when the last result emerges and when the destination vector register becomes available for other instructions
41
Carleton University© S. Dandamudi41 X-MP Pipeline Operation (Cont’d) Three phases (cont’d) Shutdown phase Destination register becomes available after 3 + n + (VL 1) + 3 = n + VL + 5 clock cycles Setup time = shutdown time = 3 clock cycles First result comes after n clock cycles Remaining (VL 1) results come out at one/clock cycle
42
Carleton University© S. Dandamudi42 A Simple Vector Add Operation A1 5 VL A1 V1 V2+FV3
43
Carleton University© S. Dandamudi43 Overlapped Vector Operations A1 5 VL A1 V1 V2+FV3 V4 V5*FV6
44
Carleton University© S. Dandamudi44 Chaining Example A1 5 VL A1 V1 V2+FV3 V4 V5*FV1
45
Carleton University© S. Dandamudi45 Vector Processing Performance
46
Carleton University© S. Dandamudi46 Interleaved Memories Traditional memory designs Provide sequential, non-overlapped access Use high-order interleaving Interleaved memories Facilitate overlapped, pipelined access Used by vector and high performance systems Use low-order interleaving
47
Carleton University© S. Dandamudi47 Interleaved Memories (cont’d)
48
Carleton University© S. Dandamudi48 Interleaved Memories (cont’d) Two types of designs Synchronized access organization Upper m bits are given to all memory banks simultaneously Requires output latches Does not efficiently support non-sequential access Independent access organization Supports pipelined access for arbitrary access pattern Require address registers
49
Carleton University© S. Dandamudi49 Interleaved Memories (cont’d) Synchronized access organization
50
Carleton University© S. Dandamudi50 Interleaved Memories (cont’d) Pipelined transfer of data in interleaved memories
51
Carleton University© S. Dandamudi51 Interleaved Memories (cont’d) Independent access organization
52
Carleton University© S. Dandamudi52 Interleaved Memories (cont’d) Number of banks B B M M = memory access time in cycles Sequential access if stride = B B = 8, M = 6 clock cycles, stride = 1 Time to read 16 words = 6 + 16 = 22 clock cycles If stride is 8, it takes 16 * 6 = 96 clock cycles Last slide
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.