Presentation is loading. Please wait.

Presentation is loading. Please wait.

Topic 2: Vector Processing and Vector Architectures

Similar presentations


Presentation on theme: "Topic 2: Vector Processing and Vector Architectures"— Presentation transcript:

1 Topic 2: Vector Processing and Vector Architectures
4/18/2019 \course\652-04F\Topic2-652.ppt

2 Reading List Slides: Topic2x Henn&Patt: Appendix G
Other assigned readings from homework and classes 4/18/2019 \course\652-04F\Topic2-652.ppt

3 The basic structure of a vector-register architecture
Main memory Vector load-store FP add/subtract Logical FP multiply FP divide Integer registers The basic structure of a vector-register architecture 4/18/2019 \course\652-04F\Topic2-652.ppt

4 Main Advantage of Vector Architectures
The computation of each result is independent of the computation of previous results, allowing a very deep pipeline without generating any data hazards. A single vector instruction specifies a great deal of work - it is equivalent to executing an entire loop. Vector instructions that access memory have a known access pattern Because an entire loop is replaced by a vector instruction control hazards that would normally arise from the loop branch are nonexistent. 4/18/2019 \course\652-04F\Topic2-652.ppt

5 An Example of Vector Code for DAXPY Y:= a * X + Y
Here is the DLX code: LD F0, a ADDI R4, Rx, #512 ; last address to load Loop: LD F2, 0(Rx) ; load X(i) MULTD F2, F0, F2 ; a x X(i) LD F4, 0 (Ry) ; load Y(i) ADDD F4, F2, F4 ; a x X(i) + Y(i) SD F4, 0 (Ry) ; store into Y(i) ADDI Rx, Rx, #8 ; increment index to X ADDI Ry, Ry, #8 ; increment index to Y SUB R20, R4, Rx ; compute bound BNZ R20, loop ; check if done 4/18/2019 \course\652-04F\Topic2-652.ppt

6 Here is the code for DLXV for DAXPY.
LD F0, a ; load scalar a LV V1, Rx ; load vector X MULTSV V2, F0, V1 ; vector-scalar multiply LV V3, Ry ; load vector Y ADDV V4, V2, V3 ; add SV Ry, V4 ; store the result 4/18/2019 \course\652-04F\Topic2-652.ppt

7 4/18/2019 \course\652-04F\Topic2-652.ppt

8 Two Issues Vector-length control Vector stride 4/18/2019
\course\652-04F\Topic2-652.ppt

9 Issue 1: Vector Length Control
An Example: do 10 i = 1, n 10 Y(i) = a * X(i) + Y(i) Question: assume the maximum hardware vector length is MVL which may be less than n. How should we do the above computation ? 4/18/2019 \course\652-04F\Topic2-652.ppt

10 Issue 1: Vector Length Control ---
Strip-Minning An Example: do 10 i = 1, n 10 Y(i) = a * X(i) + Y(i) Strip-mining: low = 1 VL = (n mod MVL) /*find the odd size piece*/ do 1 j = 0, (n / MVL) /*outer loop*/ do 10 i = low, low+VL /*runs for length VL*/ Y(i) = a*X(i) + Y(i) /*main operation*/ continue low = low+VL /*start of next vector*/ VL = MVL /*reset the length to max*/ 1 continue 4/18/2019 \course\652-04F\Topic2-652.ppt

11 Example con’d A vector of arbitrary length processed with strip mining
Value of j … … n/MVL Range of i 1..m (m+1) (m (m+2 * … … n-MVL m+MVL MVL+1) MVL+1) )..n ..m+2 * m+3 * MVL MVL A vector of arbitrary length processed with strip mining m = n mod MVL 4/18/2019 \course\652-04F\Topic2-652.ppt

12 Issue 2: Vector Stride Question 1: How should we vectorize the code ?
do 10 i = 1,100 do 10 j = 1,100 C(i, j) = 0.0 do 10 k = 1,100 10 C(i, j) = C(i, j) + A(i, k) * B(k, j) Question 1: How should we vectorize the code ? Question 2: How is the ”stride” working here ? 4/18/2019 \course\652-04F\Topic2-652.ppt

13 The Cray Supercomputer - A Case Study
4/18/2019 \course\652-04F\Topic2-652.ppt

14 Cray-1 “The World Most Expensive Love Seat...”
4/18/2019 \course\652-04F\Topic2-652.ppt

15 The Cray-1 Architecture
V7 V6 V5 V4 Vector registers V3 V2 V1 V0 Shift Logical Add Vector functional units Recip. Multiply Floating- point Scalar Pop/17 Addr Function VM RTC S7 S6 S5 S4 S3 S2 S1 S0 A7 A6 A5 A4 A3 T00 through T77 (AO) ((Ahj + jkm) A2 A1 A0 CA CU B00 B77 P 3 2 1 ((AO) + (Ak)) 00 77 Vj Vk Vl Sk Sl Sj VL control XA Exchange Aj Ak Al 318 NIP LIP CIP ((Ah) + jkm) I/O Tjk Ai -17 Instruction buffers Execution Memory Scalar reg Addr. regiter The Cray-1 Architecture 4/18/2019 \course\652-04F\Topic2-652.ppt

16 Architecture Features
16 memory banks (64k word/bank) 12 I/O subsystem channels 12 function units 4 Kbytes of high speed regs 4/18/2019 \course\652-04F\Topic2-652.ppt

17 Register-to-Register Arch. (like 7600/6600)
ALU operands must be in Rs; Reg alloc. (in 6600/7600) are used; transfer between R M is treated separately from ALU ops; Reg are specialized by functions hence avoid conflicts; Question: Can you find the RISC idea here ? 4/18/2019 \course\652-04F\Topic2-652.ppt

18 Effective use of the Cray-1 requires careful planning to exploit its register resources.
4/18/2019 \course\652-04F\Topic2-652.ppt

19 Registers { 11 cycles 1 cycle ~ 2 cycle memory Reg. access time
A - address Reg 8 x 24 bit S - Scalar Reg 8 x 64 bit V - Vector Reg 8 x 64 word B - addr. save Reg 64 x 24 bits T - scalar save Reg 64 x 64 bits VL: Vector-length register 0 <= VL <= 64 VM : Vector-mask register VM = 64 bit directly available into function units { to memory 4,888 byte (6ns) storage 4/18/2019 \course\652-04F\Topic2-652.ppt

20 Instruction format 16 ~ 32 bit insts 3 types of vector operation
4/18/2019 \course\652-04F\Topic2-652.ppt

21 Cray-1 functional pipeline units
Functional pipelines Register Pipeline delays usage (clock periods) Address functional units Address add unit A 2 Address multiply unit A 6 Scalar functional units Scalar add unit S 3 Scalar shift unit S 2 or 3 Scalar logical unit S 1 Population/leading zero count unit S 3 Vector functional units Vector add unit V or S 3 Vector shift unit V or S 4 Vector logical unit V or S 2 Floating-point functional units Floating-point add unit S and V 6 Floating-point multiply unit S and V 7 Reciprocal approximation unit S and V 14 Cray-1 functional pipeline units 4/18/2019 \course\652-04F\Topic2-652.ppt

22 Vector Unit V-add unit V-logical unit:
goal: boolean op on the corresponding bits of two 64-bit operands and XOR or merge generation mask V-Shift 4/18/2019 \course\652-04F\Topic2-652.ppt

23 Instruction Set 128 insts 10 vector types 13 scalar types
3-addr format 4/18/2019 \course\652-04F\Topic2-652.ppt

24 Implementation Philosophy
Inst. Processing Memory hierarchy Reg. and function unit reservation Vector processing 4/18/2019 \course\652-04F\Topic2-652.ppt

25 Instruction Processing
“...Issue one instruction per cycle ...” 4/18/2019 \course\652-04F\Topic2-652.ppt

26 Instruction Buffers 4 x 64 word 16 - 32 bit instructions
instruction parcel prefetch branch in buffer 4 inst/cycle fetched to LRU I-buffer 4/18/2019 \course\652-04F\Topic2-652.ppt

27 Reg. and f-unit Reservation
Note: when V=V1 op V2 is issued, V1, V2, op-unit are marked busy (like CDC 6600) Example 1 (independent) a) V1 = V2 * V3 b) V4 = V5 + V6 Example 2 (R-conflict) a) V1 = V2 * V3 b) V4 = V5 + V2 Example 3 (F-unit-conflict) a) V1 = V2 * V3 b) V4 = V5 * V6 } Total independent b) will be issued immediately at V2 being released (predictable) } } similar to Exp. 2 4/18/2019 \course\652-04F\Topic2-652.ppt

28 Four Types of Vector Instructions in the Cray-1
Vj Sj ~ ~ 1 1 Vk 2 Vk 3 2 3 . . . . . . ~ ~ ~ ~ n n Vi Vi ~ ~ ~ ~ (a) Type 1 vector instruction (b) Type 2 vector instruction 4/18/2019 \course\652-04F\Topic2-652.ppt

29 ~ Vj Vi Memory (c) Type 3 vector instruction
1 2 3 4 5 6 7 (c) Type 3 vector instruction (d) Type 4 vector instruction 4/18/2019 \course\652-04F\Topic2-652.ppt

30 Vector Loops long vectors with N > 64 are sectioned
each time through a vector loop 64 elems. are processed remainder handling “transparent” to the programmer 4/18/2019 \course\652-04F\Topic2-652.ppt

31 Chaining internal forwarding techniques of 360/91
a “linking process” that occurs when results obtained from one pipeline unit are directly fed into the operand registers of another function pipe. chaining allow operations to be issued as soon as the first result becomes available registers/F-units must be properly reserved. 4/18/2019 \course\652-04F\Topic2-652.ppt

32 Example V0 Mem (M-fetch) V2 V0 + V1 (V-add) V3 V2 < A3 (left shift)
V5 V3 V4 (logical product) < 4/18/2019 \course\652-04F\Topic2-652.ppt

33 A pipeline chaining example
o r y ... 1 2 3 4 5 6 7 a Memory fetch pipe VO V1 V2 V3 V4 V5 Right shift Vector add Logical Product Pipe d c g i j l f A pipeline chaining example in Cray-1 4/18/2019 \course\652-04F\Topic2-652.ppt

34 Multipipeline chaining on Cray 1 and Cray X-MP for executing
Multiply Pipe V1 Add .. Access Memory Read/write port V3 V4 Vector register (Load Y) (Load X) V2 (S*) S (Store Y) (Vadd) Read port 2 write port (VAdd) Read port 1 Scalar register Limited chaining using only one memory-access pipe in the Gray 1 Complete chaining using three memory-access pipes in the Cray X-MP Multipipeline chaining on Cray 1 and Cray X-MP for executing the SAXPY code: Y(1:N) = S x X(1:N) + Y(1:N). 4/18/2019 \course\652-04F\Topic2-652.ppt

35 Chaining Length ? limited by # of V-R/F-U 2-5 4/18/2019
\course\652-04F\Topic2-652.ppt

36 Cray Performance 3 ~ 160 MFLOPs Applications programming skills
Scalar performance for matrix * 12 MFLOPs Vector dot-prod: 22 MFLOPs 153 (peak) 20-80 MFLOPs 4/18/2019 \course\652-04F\Topic2-652.ppt

37 Cray X-MP Multiprocessor/multiprocessing
8 times of Cray-1 memory bandwidth Clock 9.5ns ns Scalar speedup 1.25 ~ 2.5 throughput 2.5 ~ 5 times 4/18/2019 \course\652-04F\Topic2-652.ppt

38 Irregular Vector Operations
scatter: X[A[i]] = B[i] gather X[i] = B[C[i]] an example : do 10 i = 1, N 10 A[index(i)] = Y(i) before 1984, no special insts. poor performance : 2.5 MFLOPs Compress compress a vector according to VM 4/18/2019 \course\652-04F\Topic2-652.ppt

39 Gather Operation ? (a-1) Gather instruction V1[i] = A [V0[i]]
4 100 VL Register A0 V1 Register 2 7 V0 Register (a-1) Gather instruction Memory Contents/ Address V1[i] = A [V0[i]] ? 600 400 250 200 V1 Register 4 2 7 V0 Register (a-2) Gather instruction 100 VL Register A0 Memory Contents/ Address V1[i] = A[V0[i]] exp: V1[i] = A[V0[1]] = A[4] = 600 Gather Operation 4/18/2019 \course\652-04F\Topic2-652.ppt

40 Scatter Operation (b-1) Scatter instruction A[V1[I]] = V[I] ?
4 2 7 V1 Register 200 300 400 500 V0 Register (b-1) Scatter instruction 100 VL Register A0 101 102 103 104 105 106 107 110 Memory Contents/ Address A[V1[I]] = V[I] ? Scatter Operation VL Register A[V1[i]] = V[i] exp: A[V0[3]] = V1[3] A[7] = V1[3] =400 4 x x x x x Memory Contents/ Address 4 2 7 V1 Register 200 300 400 500 V0 Register A0 100 (b-2) Scatter instruction 4/18/2019 \course\652-04F\Topic2-652.ppt

41 Vector Compression Operation
V0 Register (Tested) - 1 5 - 15 24 - 7 13 - 17 14 VL Register VM Register V1 Register V1= Compress (V0, Vm Z) 01 03 04 07 10 11 (c-1) before compression (c-2) after compression Vector Compression Operation 4/18/2019 \course\652-04F\Topic2-652.ppt

42 The Role of Vectorizing Compilers
What does a vectorizing compiler do ? How well does it do ? 4/18/2019 \course\652-04F\Topic2-652.ppt

43 Characteristics of Several Vector-register Architectures
4/18/2019 \course\652-04F\Topic2-652.ppt

44 The VMIPS vector instructions
4/18/2019 \course\652-04F\Topic2-652.ppt

45 4/18/2019 \course\652-04F\Topic2-652.ppt

46 4/18/2019 \course\652-04F\Topic2-652.ppt

47 4/18/2019 \course\652-04F\Topic2-652.ppt

48 4/18/2019 \course\652-04F\Topic2-652.ppt


Download ppt "Topic 2: Vector Processing and Vector Architectures"

Similar presentations


Ads by Google