Topic 2: Vector Processing and Vector Architectures 4/18/2019 \course\652-04F\Topic2-652.ppt
Reading List Slides: Topic2x Henn&Patt: Appendix G Other assigned readings from homework and classes 4/18/2019 \course\652-04F\Topic2-652.ppt
The basic structure of a vector-register architecture Main memory Vector load-store FP add/subtract Logical FP multiply FP divide Integer registers The basic structure of a vector-register architecture 4/18/2019 \course\652-04F\Topic2-652.ppt
Main Advantage of Vector Architectures The computation of each result is independent of the computation of previous results, allowing a very deep pipeline without generating any data hazards. A single vector instruction specifies a great deal of work - it is equivalent to executing an entire loop. Vector instructions that access memory have a known access pattern Because an entire loop is replaced by a vector instruction control hazards that would normally arise from the loop branch are nonexistent. 4/18/2019 \course\652-04F\Topic2-652.ppt
An Example of Vector Code for DAXPY Y:= a * X + Y Here is the DLX code: LD F0, a ADDI R4, Rx, #512 ; last address to load Loop: LD F2, 0(Rx) ; load X(i) MULTD F2, F0, F2 ; a x X(i) LD F4, 0 (Ry) ; load Y(i) ADDD F4, F2, F4 ; a x X(i) + Y(i) SD F4, 0 (Ry) ; store into Y(i) ADDI Rx, Rx, #8 ; increment index to X ADDI Ry, Ry, #8 ; increment index to Y SUB R20, R4, Rx ; compute bound BNZ R20, loop ; check if done 4/18/2019 \course\652-04F\Topic2-652.ppt
Here is the code for DLXV for DAXPY. LD F0, a ; load scalar a LV V1, Rx ; load vector X MULTSV V2, F0, V1 ; vector-scalar multiply LV V3, Ry ; load vector Y ADDV V4, V2, V3 ; add SV Ry, V4 ; store the result 4/18/2019 \course\652-04F\Topic2-652.ppt
4/18/2019 \course\652-04F\Topic2-652.ppt
Two Issues Vector-length control Vector stride 4/18/2019 \course\652-04F\Topic2-652.ppt
Issue 1: Vector Length Control An Example: do 10 i = 1, n 10 Y(i) = a * X(i) + Y(i) Question: assume the maximum hardware vector length is MVL which may be less than n. How should we do the above computation ? 4/18/2019 \course\652-04F\Topic2-652.ppt
Issue 1: Vector Length Control --- Strip-Minning An Example: do 10 i = 1, n 10 Y(i) = a * X(i) + Y(i) Strip-mining: low = 1 VL = (n mod MVL) /*find the odd size piece*/ do 1 j = 0, (n / MVL) /*outer loop*/ do 10 i = low, low+VL-1 /*runs for length VL*/ Y(i) = a*X(i) + Y(i) /*main operation*/ 10 continue low = low+VL /*start of next vector*/ VL = MVL /*reset the length to max*/ 1 continue 4/18/2019 \course\652-04F\Topic2-652.ppt
Example con’d A vector of arbitrary length processed with strip mining Value of j 0 1 2 3 … … n/MVL Range of i 1..m (m+1).. (m+ (m+2 * … … n-MVL m+MVL MVL+1) MVL+1) +1)..n ..m+2 * ..m+3 * MVL MVL A vector of arbitrary length processed with strip mining m = n mod MVL 4/18/2019 \course\652-04F\Topic2-652.ppt
Issue 2: Vector Stride Question 1: How should we vectorize the code ? do 10 i = 1,100 do 10 j = 1,100 C(i, j) = 0.0 do 10 k = 1,100 10 C(i, j) = C(i, j) + A(i, k) * B(k, j) Question 1: How should we vectorize the code ? Question 2: How is the ”stride” working here ? 4/18/2019 \course\652-04F\Topic2-652.ppt
The Cray Supercomputer - A Case Study 4/18/2019 \course\652-04F\Topic2-652.ppt
Cray-1 “The World Most Expensive Love Seat...” 4/18/2019 \course\652-04F\Topic2-652.ppt
The Cray-1 Architecture V7 V6 V5 V4 Vector registers V3 V2 V1 V0 Shift Logical Add Vector functional units Recip. Multiply Floating- point Scalar Pop/17 Addr Function VM RTC S7 S6 S5 S4 S3 S2 S1 S0 A7 A6 A5 A4 A3 T00 through T77 (AO) ((Ahj + jkm) A2 A1 A0 CA CU B00 B77 P 3 2 1 ((AO) + (Ak)) 00 77 Vj Vk Vl Sk Sl Sj VL control XA Exchange Aj Ak Al 318 NIP LIP CIP ((Ah) + jkm) I/O Tjk Ai -17 Instruction buffers Execution Memory Scalar reg Addr. regiter The Cray-1 Architecture 4/18/2019 \course\652-04F\Topic2-652.ppt
Architecture Features 16 memory banks (64k word/bank) 12 I/O subsystem channels 12 function units 4 Kbytes of high speed regs 4/18/2019 \course\652-04F\Topic2-652.ppt
Register-to-Register Arch. (like 7600/6600) ALU operands must be in Rs; Reg alloc. (in 6600/7600) are used; transfer between R M is treated separately from ALU ops; Reg are specialized by functions hence avoid conflicts; Question: Can you find the RISC idea here ? 4/18/2019 \course\652-04F\Topic2-652.ppt
Effective use of the Cray-1 requires careful planning to exploit its register resources. 4/18/2019 \course\652-04F\Topic2-652.ppt
Registers { 11 cycles 1 cycle ~ 2 cycle memory Reg. access time A - address Reg 8 x 24 bit S - Scalar Reg 8 x 64 bit V - Vector Reg 8 x 64 word B - addr. save Reg 64 x 24 bits T - scalar save Reg 64 x 64 bits VL: Vector-length register 0 <= VL <= 64 VM : Vector-mask register VM = 64 bit directly available into function units { to memory 4,888 byte (6ns) storage 4/18/2019 \course\652-04F\Topic2-652.ppt
Instruction format 16 ~ 32 bit insts 3 types of vector operation 4/18/2019 \course\652-04F\Topic2-652.ppt
Cray-1 functional pipeline units Functional pipelines Register Pipeline delays usage (clock periods) Address functional units Address add unit A 2 Address multiply unit A 6 Scalar functional units Scalar add unit S 3 Scalar shift unit S 2 or 3 Scalar logical unit S 1 Population/leading zero count unit S 3 Vector functional units Vector add unit V or S 3 Vector shift unit V or S 4 Vector logical unit V or S 2 Floating-point functional units Floating-point add unit S and V 6 Floating-point multiply unit S and V 7 Reciprocal approximation unit S and V 14 Cray-1 functional pipeline units 4/18/2019 \course\652-04F\Topic2-652.ppt
Vector Unit V-add unit V-logical unit: goal: boolean op on the corresponding bits of two 64-bit operands and XOR or merge generation mask V-Shift 4/18/2019 \course\652-04F\Topic2-652.ppt
Instruction Set 128 insts 10 vector types 13 scalar types 3-addr format 4/18/2019 \course\652-04F\Topic2-652.ppt
Implementation Philosophy Inst. Processing Memory hierarchy Reg. and function unit reservation Vector processing 4/18/2019 \course\652-04F\Topic2-652.ppt
Instruction Processing “...Issue one instruction per cycle ...” 4/18/2019 \course\652-04F\Topic2-652.ppt
Instruction Buffers 4 x 64 word 16 - 32 bit instructions instruction parcel prefetch branch in buffer 4 inst/cycle fetched to LRU I-buffer 4/18/2019 \course\652-04F\Topic2-652.ppt
Reg. and f-unit Reservation Note: when V=V1 op V2 is issued, V1, V2, op-unit are marked busy (like CDC 6600) Example 1 (independent) a) V1 = V2 * V3 b) V4 = V5 + V6 Example 2 (R-conflict) a) V1 = V2 * V3 b) V4 = V5 + V2 Example 3 (F-unit-conflict) a) V1 = V2 * V3 b) V4 = V5 * V6 } Total independent b) will be issued immediately at V2 being released (predictable) } } similar to Exp. 2 4/18/2019 \course\652-04F\Topic2-652.ppt
Four Types of Vector Instructions in the Cray-1 Vj Sj ~ ~ 1 1 Vk 2 Vk 3 2 3 . . . . . . ~ ~ ~ ~ n n Vi Vi ~ ~ ~ ~ (a) Type 1 vector instruction (b) Type 2 vector instruction 4/18/2019 \course\652-04F\Topic2-652.ppt
~ Vj Vi Memory (c) Type 3 vector instruction 1 2 3 4 5 6 7 (c) Type 3 vector instruction (d) Type 4 vector instruction 4/18/2019 \course\652-04F\Topic2-652.ppt
Vector Loops long vectors with N > 64 are sectioned each time through a vector loop 64 elems. are processed remainder handling “transparent” to the programmer 4/18/2019 \course\652-04F\Topic2-652.ppt
Chaining internal forwarding techniques of 360/91 a “linking process” that occurs when results obtained from one pipeline unit are directly fed into the operand registers of another function pipe. chaining allow operations to be issued as soon as the first result becomes available registers/F-units must be properly reserved. 4/18/2019 \course\652-04F\Topic2-652.ppt
Example V0 Mem (M-fetch) V2 V0 + V1 (V-add) V3 V2 < A3 (left shift) V5 V3 V4 (logical product) < 4/18/2019 \course\652-04F\Topic2-652.ppt
A pipeline chaining example o r y ... 1 2 3 4 5 6 7 a Memory fetch pipe VO V1 V2 V3 V4 V5 Right shift Vector add Logical Product Pipe d c g i j l f A pipeline chaining example in Cray-1 4/18/2019 \course\652-04F\Topic2-652.ppt
Multipipeline chaining on Cray 1 and Cray X-MP for executing Multiply Pipe V1 Add .. Access Memory Read/write port V3 V4 Vector register (Load Y) (Load X) V2 (S*) S (Store Y) (Vadd) Read port 2 write port (VAdd) Read port 1 Scalar register Limited chaining using only one memory-access pipe in the Gray 1 Complete chaining using three memory-access pipes in the Cray X-MP Multipipeline chaining on Cray 1 and Cray X-MP for executing the SAXPY code: Y(1:N) = S x X(1:N) + Y(1:N). 4/18/2019 \course\652-04F\Topic2-652.ppt
Chaining Length ? limited by # of V-R/F-U 2-5 4/18/2019 \course\652-04F\Topic2-652.ppt
Cray Performance 3 ~ 160 MFLOPs Applications programming skills Scalar performance for matrix * 12 MFLOPs Vector dot-prod: 22 MFLOPs 153 (peak) 20-80 MFLOPs 4/18/2019 \course\652-04F\Topic2-652.ppt
Cray X-MP Multiprocessor/multiprocessing 8 times of Cray-1 memory bandwidth Clock 9.5ns - 8.5 ns Scalar speedup 1.25 ~ 2.5 throughput 2.5 ~ 5 times 4/18/2019 \course\652-04F\Topic2-652.ppt
Irregular Vector Operations scatter: X[A[i]] = B[i] gather X[i] = B[C[i]] an example : do 10 i = 1, N 10 A[index(i)] = Y(i) before 1984, no special insts. poor performance : 2.5 MFLOPs Compress compress a vector according to VM 4/18/2019 \course\652-04F\Topic2-652.ppt
Gather Operation ? (a-1) Gather instruction V1[i] = A [V0[i]] 4 100 VL Register A0 V1 Register 2 7 V0 Register (a-1) Gather instruction 200 100 300 101 400 102 500 103 600 104 700 105 100 106 250 107 350 108 Memory Contents/ Address V1[i] = A [V0[i]] ? 600 400 250 200 V1 Register 4 2 7 V0 Register (a-2) Gather instruction 100 VL Register A0 200 100 300 101 400 102 500 103 600 104 700 105 100 106 250 107 350 108 Memory Contents/ Address V1[i] = A[V0[i]] exp: V1[i] = A[V0[1]] = A[4] = 600 Gather Operation 4/18/2019 \course\652-04F\Topic2-652.ppt
Scatter Operation (b-1) Scatter instruction A[V1[I]] = V[I] ? 4 2 7 V1 Register 200 300 400 500 V0 Register (b-1) Scatter instruction 100 VL Register A0 101 102 103 104 105 106 107 110 Memory Contents/ Address A[V1[I]] = V[I] ? Scatter Operation VL Register A[V1[i]] = V[i] exp: A[V0[3]] = V1[3] A[7] = V1[3] =400 4 500 100 x 101 300 102 x 103 200 104 x 105 x 106 400 107 x 110 Memory Contents/ Address 4 2 7 V1 Register 200 300 400 500 V0 Register A0 100 (b-2) Scatter instruction 4/18/2019 \course\652-04F\Topic2-652.ppt
Vector Compression Operation V0 Register (Tested) - 1 5 - 15 24 - 7 13 - 17 14 010110011101... VL Register VM Register V1 Register V1= Compress (V0, Vm Z) 01 03 04 07 10 11 (c-1) before compression (c-2) after compression Vector Compression Operation 4/18/2019 \course\652-04F\Topic2-652.ppt
The Role of Vectorizing Compilers What does a vectorizing compiler do ? How well does it do ? 4/18/2019 \course\652-04F\Topic2-652.ppt
Characteristics of Several Vector-register Architectures 4/18/2019 \course\652-04F\Topic2-652.ppt
The VMIPS vector instructions 4/18/2019 \course\652-04F\Topic2-652.ppt
4/18/2019 \course\652-04F\Topic2-652.ppt
4/18/2019 \course\652-04F\Topic2-652.ppt
4/18/2019 \course\652-04F\Topic2-652.ppt
4/18/2019 \course\652-04F\Topic2-652.ppt