Presentation is loading. Please wait.

Presentation is loading. Please wait.

Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University.

Similar presentations


Presentation on theme: "Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University."— Presentation transcript:

1 Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

2 © S. Dandamudi2 Pipelining  Vector machines exploit pipelining in all its activities  Computations  Movement of data from/to memory  Pipelining provides overlapped execution  Increases throughput  Hides latency …

3 Carleton University© S. Dandamudi3 Pipelining (cont’d) Pipeline overlaps execution: 6 versus 18 cycles

4 Carleton University© S. Dandamudi4 Pipelining (cont’d)  One measure of performance:  Ideal case:  n-stage pipeline should give a speedup of n  Two factors affect this:  Pipeline fill  Pipeline drain Non-pipelined execution time Pipelined execution time Speedup =

5 Carleton University© S. Dandamudi5 Pipelining (cont’d)  N computations, each takes n * T time  Non-pipelined time = N * n * T time  Pipelined time = n * T + (N – 1) T time = (n + N –1) T time n * Nn * N n + N  1 Speedup = 1/N + 1/n – 1/(n * N ) 1 =

6 Carleton University© S. Dandamudi6 Pipelining (cont’d) n = 9 n = 3 n = 6

7 Carleton University© S. Dandamudi7 Pipelining (cont’d) Pipeline depth, n

8 Carleton University© S. Dandamudi8 Vector Machines  Provide high-level operations  Work on vectors (linear arrays of numbers)  A typical vector operation  Add two 64-element floating-point vectors  Equivalent to an entire loop  CRAY format V3 V2 VOP V1  V3  V2 VOP V1

9 Carleton University© S. Dandamudi9 Vector Machines (cont’d)  Consists of  Scalar unit  Works on scalars  Address arithmetic  Vector unit  Responsible for vector operations  Several vector functional units  Integer add, FP add, FP multiply …

10 Carleton University© S. Dandamudi10 Vector Machines (cont’d)  Two types of architecture  Memory-to-memory architecture  Vectors are memory resident  First machines are of this type  Example: CDC Star 100, CYBER 205  Vector-register architecture  Vectors are stored in registers  Modern vector machines belong to this type  Examples: Cary 1/2/X-MP/YMP, NEC SX/2, Fujitsu VP200, Hitachi S820

11 Carleton University© S. Dandamudi11 Components  Primary components of vector-register machine  Vector registers  Each register can hold a small vector  Example: Cray-1 has 8 vector registers  Each vector register can hold 64 doublewords (64-bit values)  Two read ports and one write port  Allows overlap among the vector operations

12 Carleton University© S. Dandamudi12 Cray-1Architecture

13 Carleton University© S. Dandamudi13 Components  Vector functional units  Each unit is fully pipelined  Can start a new operation on every clock cycle  Cray-1 has six functional units  FP Add, FP multiply, FP reciprocal, Integer add, Logical, Shift  Scalar registers  Store scalars  Compute addresses to pass on to the load/store unit

14 Carleton University© S. Dandamudi14 Components  Vector load/store unit  Moves vectors between memory and vector registers  Load and store operations are pipelined  Some processors have more than one load/store unit  NEC SX/2 has 8 load/store units  Memory  Designed to allow pipelined access  Typically use interleaved memories  Will discuss later

15 Carleton University© S. Dandamudi15 Some Example Vector Machines MachineYear# VRVR size# LSUs CRAY-219858641 Cray Y-MP19888642 loads/1 store Fujitsu VP10019828-25632-10242 Hitachi S8101983322564 NEC SX/219848+8192256+var.8 Convex C-1198581281

16 Carleton University© S. Dandamudi16 Some Example Vector Machines (cont’d)  Vector functional units  Cray X-MP/Y-MP  8 units  FP add, FP multiply, FP reciprocal  Integer add,  2 logical  Shift  Population count/parity

17 Carleton University© S. Dandamudi17 Some Example Vector Machines (cont’d)  Vector functional units (cont’d)  NEX SX/2  16 units  4 FP add,  4 FP multiply/divide  4 Integer add/logical,  4 Shift

18 Carleton University© S. Dandamudi18 Advantages of Vector Machines  Flynn’s bottleneck can be reduced  Vector instructions significantly improve code density  A single vector instruction specifies a great deal of work  Reduce the number of instructions needed to execute a program  Eliminate control overhead of a loop  A vector instruction represents the entire loop  Loop overhead can be substantial

19 Carleton University© S. Dandamudi19 Advantages of Vector Machines (cont’d)  Impact of main memory latency can be reduced  Vector instructions that access memory have a known pattern  Pipelined access can be used  Can exploit interleaved memory  High latency associated with memory can be amortized over the entire vector  Latency is not associated with each data item  When accessing a floating-point number

20 Carleton University© S. Dandamudi20 Advantages of Vector Machines (cont’d)  Control hazards can be reduced  Vector machines organize data operands into regular sequences  Suitable for pipelined access in hardware  Vector operation  loop  Data hazards can be eliminated  Due to structured nature of data  Allows planned prefetching of data

21 Carleton University© S. Dandamudi21 Example Problem  A Typical Vector Problem Y = a * X + Y  X and Y are vectors  This problem is known as  SAXPY (single precision A*X Plus Y)  DAXPY (double precision A*X Plus Y)  SAXPY/DAXPY represents a small piece of code that takes most of the time in the benchmark

22 Carleton University© S. Dandamudi22 Example Problem (cont’d)  Non-vector code fragment LD F0,a ADDI R4,Rx,#512 ;last address to load loop: LD F2,0(Rx) ;F2 := M[0+Rx] ; i.e., load X[i] MULT F2,F0,F2 ;a*X[i]

23 Carleton University© S. Dandamudi23 Example Problem (cont’d) LD F4,0(Ry) ;load Y[i] ADD F4,F2,F4 ;a*X[i] + y[i] SD F4,0(Ry) ;store into Y[i] ADDI Rx,Rx,#8 ;increment index to X ADDI Ry,Ry,#8 ;increment index to Y SUB R20,R4,Rx ;R20 := R4-Rx JNZ R20,loop ;jump if not done 9 instructions in the loop

24 Carleton University© S. Dandamudi24 Example Problem (cont’d)  Vector code fragment LD F0,a ;load scalar a LV V1,Rx ;load vector X MULTSV V2,F0,V1 ;V2 := F0 * V1 LV V3,Ry ;load vector Y ADDV V4,V2,V3 ;V4 := V2 + V3 SV Ry,V4 ; store the result Only 6 vector instructions!

25 Carleton University© S. Dandamudi25 Example Problem (cont’d)  Two main observations  Execution efficiency  Vector code  Executes 6 instructions  Non-vector code  Nearly 600 instructions (9 * 64)  Lots of control overhead  4 out of 9 instructions!  Absent in the vector code

26 Carleton University© S. Dandamudi26 Example Problem (cont’d)  Two main observations  Frequency of pipeline interlock  Non-vector code:  Every ADD must wait for MULT  Every SD must wait for ADD  Loop unrolling can eliminate this interlock  Vector code  Each instruction is independent  Pipeline stalls once per vector operation  Not once per vector element

27 Carleton University© S. Dandamudi27 Vector Length  Vector register has a natural vector length  64 elements in CRAY systems  What if the vector has a different length?  Three cases  Vector length < Vector register length  Use a vector length register to indicate the vector length  Vector length = Vector register length  Vector length > Vector register length

28 Carleton University© S. Dandamudi28 Vector Length (cont’d)  Vector length > Vector register length  Use strip mining  Vector is partitioned into strips that are less than or equal to the vector register length Odd strip

29 Carleton University© S. Dandamudi29 Vector Stride  Vector stride  Distance separating the elements that are to be merged into a single vector  In elements, not bytes  Typically multidimensional matrices may have non-unit stride access patterns  Example: matrix multiply

30 Carleton University© S. Dandamudi30 Vector Stride (cont’d)  Matrix multiplication for (i = 1, 100) for (j = 1, 100) A[i,j] = 0 for (k = 1, 100) A[i,j] = A[i,j] + B[i,k] * C[k,j] Non-unit stride Unit stride

31 Carleton University© S. Dandamudi31 Vector Stride (cont’d)  Access pattern of B and C depends on how the matrix is stored  Row-major  Matrix is stored row-by-row  Used by most languages except FORTRAN  Column-major  Matrix is stored column-by-column  Used by FORTRAN

32 Carleton University© S. Dandamudi32 Vector Stride (cont’d) 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44

33 Carleton University© S. Dandamudi33 Cray X-MP Instructions  Integer addition  Vi Vj+VkVi = Vj + Vk  Vi Sj+VkVi = Sj + Vk  Sj is a scalar  Floating-point addition  Vi Vj+FVkVi = Vj + Vk  Vi Sj+FVkVi = Sj + Vk  Sj is a scalar

34 Carleton University© S. Dandamudi34 Cray X-MP Instructions (cont’d)  Load instructions  Vi,A0,AkVi = M(A0)+Ak  Vector load with stride Ak  Loads VL elements from memory address A0  Vi,A0,1Vi = M(A0)+1  Vector load with stride 1  Special case

35 Carleton University© S. Dandamudi35 Cray X-MP Instructions (cont’d)  Store instructions ,A0,Ak Vi  Vector store with stride Ak  Stores VL elements starting at memory address A0 ,A0,1 Vi  Vector store with stride 1  Special case

36 Carleton University© S. Dandamudi36 Cray X-MP Instructions (cont’d)  Logical AND instructions  Vi Vj&VkVi = Vj & Vk  Vi Sj&VkVi = Sj & Vk  Sj is a scalar  Shift instructions  Vi Vj>AkVi = Vj >> Ak  Vi Vj<AkVi = Vj << Ak  Left/right shift each element of Vj and store the result in Vi

37 Carleton University© S. Dandamudi37 Sample Vector Functional Units Vector functional unit# StagesAvailable to chain Vector results Integer ADD (64-bit)38VL+8 64-bit shift38VL+8 128-bit shift49VL+9 Floating ADD611VL+11 Floating MULTIPLY712VL+12

38 Carleton University© S. Dandamudi38 X-MP Pipeline Operation  Three phases  Setup phase  Sets functional units to perform the appropriate operation  Establishes routes to source and destination vector registers  Requires 3 clock cycles for all functional units  Execution phase  Shutdown phase

39 Carleton University© S. Dandamudi39 X-MP Pipeline Operation (Cont’d)  Three phases (cont’d)  Execution phase  Source and destination vector registers are reserved  Cannot be used by another instruction  Source vector register is reserved for VL+3 clock cycles  VL = vector length  One pair of operands/clock cycle enter the first stage

40 Carleton University© S. Dandamudi40 X-MP Pipeline Operation (Cont’d)  Three phases (cont’d)  Shutdown phase  Shutdown time = 3 clock cycles  Shutdown time  Time difference between  when the last result emerges and  when the destination vector register becomes available for other instructions

41 Carleton University© S. Dandamudi41 X-MP Pipeline Operation (Cont’d)  Three phases (cont’d)  Shutdown phase  Destination register becomes available after 3 + n + (VL  1) + 3 = n + VL + 5 clock cycles  Setup time = shutdown time = 3 clock cycles  First result comes after n clock cycles  Remaining (VL  1) results come out at one/clock cycle

42 Carleton University© S. Dandamudi42 A Simple Vector Add Operation A1 5 VL A1 V1 V2+FV3

43 Carleton University© S. Dandamudi43 Overlapped Vector Operations A1 5 VL A1 V1 V2+FV3 V4 V5*FV6

44 Carleton University© S. Dandamudi44 Chaining Example A1 5 VL A1 V1 V2+FV3 V4 V5*FV1

45 Carleton University© S. Dandamudi45 Vector Processing Performance

46 Carleton University© S. Dandamudi46 Interleaved Memories  Traditional memory designs  Provide sequential, non-overlapped access  Use high-order interleaving  Interleaved memories  Facilitate overlapped, pipelined access  Used by vector and high performance systems  Use low-order interleaving

47 Carleton University© S. Dandamudi47 Interleaved Memories (cont’d)

48 Carleton University© S. Dandamudi48 Interleaved Memories (cont’d)  Two types of designs  Synchronized access organization  Upper m bits are given to all memory banks simultaneously  Requires output latches  Does not efficiently support non-sequential access  Independent access organization  Supports pipelined access for arbitrary access pattern  Require address registers

49 Carleton University© S. Dandamudi49 Interleaved Memories (cont’d) Synchronized access organization

50 Carleton University© S. Dandamudi50 Interleaved Memories (cont’d) Pipelined transfer of data in interleaved memories

51 Carleton University© S. Dandamudi51 Interleaved Memories (cont’d) Independent access organization

52 Carleton University© S. Dandamudi52 Interleaved Memories (cont’d)  Number of banks B B  M M = memory access time in cycles  Sequential access if stride = B  B = 8, M = 6 clock cycles, stride = 1  Time to read 16 words = 6 + 16 = 22 clock cycles  If stride is 8, it takes 16 * 6 = 96 clock cycles Last slide


Download ppt "Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University."

Similar presentations


Ads by Google