Download presentation
Presentation is loading. Please wait.
1
1 Vector Architectures Sima, Fountain and Kacsuk Chapter 14 CSE462
2
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 2 A Generic Vector Machine l The basic idea in a vector processor is to combine two vectors and produce an output vector. l If A, B and C are vectors, each of N elements, then a vector processor can perform the following operation: –C := B + A l which is interpreted as: –c(i) := b (i) + a (i), 0 ≤ i ≤ N-1
3
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 3 A Generic Vector Processor l The memory subsystem need to support: –2 Reads per cycle –1 Write per cycle MultiPort Memory Subsystem Pipelined Adder Stream B Stream A Stream C = B + A
4
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 4 Un-vectorized computation Compute first address Fetch first data Compute second address Fetch second data Compute result Fetch destination address Store result precompute time t 1 compute time t 2 post- compute time t 3 Time Compute time for one result = t 1 + t 2 + t 3 Compute time for N results = N(t 1 + t 2 + t 3 )
5
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 5 Vectorized computation Compute first address Fetch first data Compute second address Fetch second data Compute result Fetch destination address Store result precompute time t 1 compute time t 2 per result post- compute time t 3 Time Compute time for N results = t 1 + Nt 2 + t 3
6
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 6 Non-pipelined computation Compute first address Fetch first data Compute second address Fetch second data Compute result Fetch destination address Store result Time Time to first resultRepetition time
7
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 7 Pipelined computation Compute first address Fetch first data Compute second address Fetch second data Compute result Fetch destination address Store result Time Time to first resultRepetition time
8
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 8 Pipelined repetition governed by slowest component Compute first address Fetch first data Compute second address Fetch second data Compute result Fetch destination address Store result Time Time to first resultRepetition time
9
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 9 Pipelined granularity increased to improve repetition rate Compute first address Fetch first data Compute second address Fetch second data Increased granularity for computation Time Fetch destination address Store result Time to first result Repetition time
10
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 10 Vectorizing speeds up computation 1500 1000 500 0 Execution time (ns) 01020304050100 Number of instructions Scalar performance Vector performance
11
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 11 Interleaving l If vector pipelining is to work then it must be possible to fetch instructions from memory quickly. l There are two main ways of achieving this: –Cache Memories –Interleaving l A conventional memory consists of a set of storage locations accessed via some sort of address decoder.
12
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 12 Interleaving l The problem with such a scheme is that the memory is busy during the memory access and no other access can proceed.
13
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 13 Interleaving l In an interleaved memory system there are a number of banks. l Each bank corresponds to a certain range of addresses.
14
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 14 Interleaving l A pipelined machine can be kept fed with instructions even though the main memory may be quite slow. l An interleaved memory system slows down when subsequent accesses are for the same bank of memory. l Rare when prefetching instructions, because they tend to be sequential. l Possible to access two locations at the same time if they reside in different banks. l Banks are usually selected using some of the low order bits of the address because sequential access will access different banks.
15
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 15 Memory Layout M M M M M M M M Pipelined Adder
16
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 16 Memory Layout of Arrays A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] B[0] B[1] B[2] B[3] B[4] B[5] B[6] B[7] C[0] C[1] C[2] C[3] C[4] C[5] C[6] C[7] 0 1 2 3 4 5 6 7 Module
17
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 17 Pipeline Utilisation Time (clock periods) 012345678910111213 Memory 0 Memory 1 Memory 2 Memory 3 Memory 4 Memory 5 Memory 6 Memory 7 Pipeline 0 Pipeline 1 Pipeline 2 Pipeline 3 RA0 RA1 RA2 RA3 RA4 RA5 RA6 RA7 RB0 RB1 RB2 RB3 RB4 RB5 RB6 RB7 01234567 01234567 01234567 01234567 W0 W1 W2 W3 W4 W5 W6
18
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 18 Memory Contention A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] B[0] B[1] B[2] B[3] B[4] B[5] B[6] B[7] C[0] C[1] C[2] C[3] C[4] C[5] C[6] C[7] 0 1 2 3 4 5 6 7 Module
19
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 19 Adding Delay Paths Pipelined Adder Variable Delay A B C
20
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 20 Pipeline with delay Time (clock periods) 012345678910111213 Memory 0 Memory 1 Memory 2 Memory 3 Memory 4 Memory 5 Memory 6 Memory 7 Pipeline 0 Pipeline 1 Pipeline 2 Pipeline 3 RA0 RA1 RA2 RA3 RA4 RA5 RA6 RA7 RB0 RB1 RB2 RB3 RB4 RB5 RB6 RB7 01234567 01234567 01234567 0123456 W0 W1 W2 W3 W4
21
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 21 CRAY 1 Vector Operations l Vector facility l The CRAY compilers are vectorising, and do not require vector notation – do 10i=1,64 –10x(i) = y(i) l Scalar arithmetic –64*n instructions, where n is the number of instructions per loop iteration. l Vector registers can also be sent to the floating point functional units.
22
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 22 CRAY Vector Section
23
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 23 Increasing complexity in Cray systems 250k 500k 750k 1.0M 1.25M Number of active circuit elements per processor CRAY1XMP4YMP8YMP16
24
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 24 Chaining l The CRAY-1 can achieve even faster vector operations by using chaining. l Result vector is not only sent to the destination vector register, but also directly to another functional unit. l Data is seen to chain from one functional unit to another possibly without any intermediate storage.
25
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 25 Vector Startup l Vector instructions may be issued at the rate of 1 instruction per clock period. –Providing there is no contention they will be issued at this rate. l The first result appears after some delay, and then each word of the vector will arrive at the rate of one word per clock period. l Vectors longer than 64 words are broken into 64 word chunks. – do 10 i = 1,n –10A(i) = B(i)
26
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 26 Vector Startup times l Note the second loop uses data chaining. l Note the effect of startup time. Loop Body1101001000Scalar 1000 A(i) = B(i)445.82.72.531 A(i) = B(i)*C(i) + D(i)*E(i) 110167.77.157
27
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 27 Effect of Stride on Interleaving l Most interleaving schemes simply take the bottom bits of the address and use these to select the memory bank. l This is very good for sequential address patterns (stride 1), and not too bad for random address patterns. l But for stride n, where n is the number of memory banks, the performance can be extremely bad. DO 10 I = 1,128DO 20 J = 1,128 10A(I,1) = 020A(1,J) = 0
28
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 28 Effect of Stride l These two code fragments will have quite different performance l If we assume the array is arranged in memory row by row, –first loop will access every 128'th word in sequential order, –second loop will access 128 contiguous words in sequential order l Thus, in loop 1 interleaving will fail if the number of memory banks is a factor of the stride. A(1,1-128)A(2,1-128)A(3,1-128)
29
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 29 Effect of Stride l Many research papers on how to improve the performance of stride m access on an n way interleaved memory system. l Two main approaches: –Arrange the data to match the stride (S/W) –Make the hardware insensitive to the stride (H/W)
30
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 30 Memory Layout for stride free access l Consider the layout of an 8 x 8 matrix. l Can be placed in memory in two possible ways –by row order or column order. l If we know that a particular program requires only row or column order, –it is possible to arrange the matrix so that conflict free access can always be guaranteed.
31
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 31 Memory layout for stride free access l Skew the matrix –each row starts in a different memory unit. –Possible to access the matrix by row order or column order without memory contention. l This requires a different address calculation
32
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 32 Address Modification in Hardware l Possible to use a different function to compute the module number. l If the address is passed to an arbitrary address computation function which emits a module number, –Produce stride free access for many different strides l There are schemes which give optimal packing and do not waste any space
33
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 33 Other Typical Access Patterns l Unfortunately, row and column access order is not the only requirement. l Other common patterns include: –Matrix diagonals –Square subarrays
34
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 34 Diagonal Access l To access diagonal, –stride is equal to column stride + 1 l If M, the number of modules isequal to power of 2, –both column stride and column stride +1 cannot both be efficient, –both cannot be relatively prime to M.
35
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 35 Vector Algorithms l Consider the solution of the linear equations given by: –Ax = b A is an NxN matrix and x and b are N x 1 column vectors. l Gaussian Elimination is an efficient algorithm for producing upper and lower diagonal matrices L and U, such that –A = LU
36
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 36 Gaussian Elimination l Given L and U it is possible to write: Ly = b and Ux = y l Using back substitution it is possible to solve for x. = L 0 yb = 0 U xy
37
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 37 A Vector Gausian elimination for i := 1 to N do begin imax := index of Max(abs(A[i..N,i])); Swap(A[i,i..N],A[imax,i..N]); if A[i,i] = 0 then Singular Matrix; A[I+1..N,i] := A[I+1..N,i]/A[i,j]; for k := i+1 to N do A[k,i+1..N] := A[k,i+1..N] - A[k,i]*A[i,i+1..N]; end;
38
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 38 Picture of algorithm U L PU' L'A l The algorithm produces a new row of U and column of L each iteration
39
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 39 Aspects of the algorithm l The algorithm as expressed accesses both rows and columns. l The majority of the vector operations have either two vector operands, or a scalar and a vector operand, and they produce a vector result. l The MAX operation on a vector returns the index of the maximum element, not the value of the maximum element. l The length of the vector items accessed decreases by 1 for each successive iteration.
40
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 40 Comments on Aspects l Row and Column access should be equally efficient (see discussion on vector stride) l Vector pipeline should handle a scalar on one input l MIN, MAX and SUM operators required which accept one or more vectors and return scalar l Vectors may start large, but get smaller. This may affect code generation due to vector startup cost.
41
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 41 Sparse Matrices l In many engineering computations, there may be very large matrices. –May also be sparsely occupied, –Contain many zero elements. l Problems with the normal method of storing the matrix: –Occupy far too much memory –Very slow to process
42
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 42 Sparse Matrices... Consider the following example: x 230...
43
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 43 Sparse Matrices... l Many packages solve the problem by using a software data representation. –The trick is not to store the elements which have zero values. l At least two sorts of access mechanism: –Random –Row/Column sequential.
44
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 44 Sparse Matrices... l Matrix multiplication then consists of moving down the row/column pointers and multiplying elements of they have the same index values. l Hardware to implement this type of data representation. Row Pointers Column Pointers
45
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 45 Figure 14.2 the design space for floating-point precision Floating-point representation 64-bit 32-bit Big-endianLittle-endianCrayIBMBig-endianLittle-endian IEEEVAXIEEEVAX
46
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 46 Figure 14.3 the design space for integer precision Integer precision 128 bit SegmentedFixed 64 bit SegmentedFixed 32 bit SegmentedFixed
47
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 47 Figure 14.8 parallel computation of floating- point and integer results Floating- point segment Integer segment Data buffer Instruction buffer Data buffer Instruction buffer Floating-point unit Integer unit Result unit Program
48
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 48 Figure 14.9 mixed function and data parallelism Floating- point segment Integer segment Result unit Data buffer 1 Data buffer 2 Data buffer 3 Data buffer 4 Control buff 1 F-P unit 1 Control buff 2 F-P unit 2 Control buff 3 F-P unit 4 Control buff 4 F-P unit 4 Program
49
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 49 Figure 14.10 The design space for parallel computational functionality Computational complexity Single unitMultiple unit IntegerFloating-point Integer 24+24+
50
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 50 Figure 14.11 Communication between CPU’s and memory Shared registers and real-time clock Central memory comprises 8 sections 64 sub-sections 1024 banks 1024 Mwords I/O subsystem CPU0 CPU1 CPU14 CPU1 CPU3 CPU15
51
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 51 Figure 14.13 the overall architecture of the Convex C4/XA system Communication registers CPU Memory Crossbar switch I/O processor Each port is 1.1 Gb/s Single crossbar port is 1.1 Gb/s Processing subsystem Memory subsystem I/O subsystem
52
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 52 Figure 14.14 The configuration of the crossbar switch CPU0 CPU1 CPU2 CPU3 SIA 0123SCU Maximum bandwidth 4.4 Gb/s All data paths are 64-bit Non-blocking switch Memories
53
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 53 Figure 14.15 The processor configuration 32 Address registers 28 Scalar registers 256 kb instruction cache 16 kb Data cache 32 KE AT cache Scalar Unit Four Vreg files Vector Crossbar Adder/multiply Logical processor Adder/multiply Logical processor Divider Square root Vector Unit Memory Interface
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.