Download presentation
Presentation is loading. Please wait.
Published byJordan Bailey Modified over 5 years ago
1
Topic 2: Vector Processing and Vector Architectures
4/18/2019 \course\652-04F\Topic2-652.ppt
2
Reading List Slides: Topic2x Henn&Patt: Appendix G
Other assigned readings from homework and classes 4/18/2019 \course\652-04F\Topic2-652.ppt
3
The basic structure of a vector-register architecture
Main memory Vector load-store FP add/subtract Logical FP multiply FP divide Integer registers The basic structure of a vector-register architecture 4/18/2019 \course\652-04F\Topic2-652.ppt
4
Main Advantage of Vector Architectures
The computation of each result is independent of the computation of previous results, allowing a very deep pipeline without generating any data hazards. A single vector instruction specifies a great deal of work - it is equivalent to executing an entire loop. Vector instructions that access memory have a known access pattern Because an entire loop is replaced by a vector instruction control hazards that would normally arise from the loop branch are nonexistent. 4/18/2019 \course\652-04F\Topic2-652.ppt
5
An Example of Vector Code for DAXPY Y:= a * X + Y
Here is the DLX code: LD F0, a ADDI R4, Rx, #512 ; last address to load Loop: LD F2, 0(Rx) ; load X(i) MULTD F2, F0, F2 ; a x X(i) LD F4, 0 (Ry) ; load Y(i) ADDD F4, F2, F4 ; a x X(i) + Y(i) SD F4, 0 (Ry) ; store into Y(i) ADDI Rx, Rx, #8 ; increment index to X ADDI Ry, Ry, #8 ; increment index to Y SUB R20, R4, Rx ; compute bound BNZ R20, loop ; check if done 4/18/2019 \course\652-04F\Topic2-652.ppt
6
Here is the code for DLXV for DAXPY.
LD F0, a ; load scalar a LV V1, Rx ; load vector X MULTSV V2, F0, V1 ; vector-scalar multiply LV V3, Ry ; load vector Y ADDV V4, V2, V3 ; add SV Ry, V4 ; store the result 4/18/2019 \course\652-04F\Topic2-652.ppt
7
4/18/2019 \course\652-04F\Topic2-652.ppt
8
Two Issues Vector-length control Vector stride 4/18/2019
\course\652-04F\Topic2-652.ppt
9
Issue 1: Vector Length Control
An Example: do 10 i = 1, n 10 Y(i) = a * X(i) + Y(i) Question: assume the maximum hardware vector length is MVL which may be less than n. How should we do the above computation ? 4/18/2019 \course\652-04F\Topic2-652.ppt
10
Issue 1: Vector Length Control ---
Strip-Minning An Example: do 10 i = 1, n 10 Y(i) = a * X(i) + Y(i) Strip-mining: low = 1 VL = (n mod MVL) /*find the odd size piece*/ do 1 j = 0, (n / MVL) /*outer loop*/ do 10 i = low, low+VL /*runs for length VL*/ Y(i) = a*X(i) + Y(i) /*main operation*/ continue low = low+VL /*start of next vector*/ VL = MVL /*reset the length to max*/ 1 continue 4/18/2019 \course\652-04F\Topic2-652.ppt
11
Example con’d A vector of arbitrary length processed with strip mining
Value of j … … n/MVL Range of i 1..m (m+1) (m (m+2 * … … n-MVL m+MVL MVL+1) MVL+1) )..n ..m+2 * m+3 * MVL MVL A vector of arbitrary length processed with strip mining m = n mod MVL 4/18/2019 \course\652-04F\Topic2-652.ppt
12
Issue 2: Vector Stride Question 1: How should we vectorize the code ?
do 10 i = 1,100 do 10 j = 1,100 C(i, j) = 0.0 do 10 k = 1,100 10 C(i, j) = C(i, j) + A(i, k) * B(k, j) Question 1: How should we vectorize the code ? Question 2: How is the ”stride” working here ? 4/18/2019 \course\652-04F\Topic2-652.ppt
13
The Cray Supercomputer - A Case Study
4/18/2019 \course\652-04F\Topic2-652.ppt
14
Cray-1 “The World Most Expensive Love Seat...”
4/18/2019 \course\652-04F\Topic2-652.ppt
15
The Cray-1 Architecture
V7 V6 V5 V4 Vector registers V3 V2 V1 V0 Shift Logical Add Vector functional units Recip. Multiply Floating- point Scalar Pop/17 Addr Function VM RTC S7 S6 S5 S4 S3 S2 S1 S0 A7 A6 A5 A4 A3 T00 through T77 (AO) ((Ahj + jkm) A2 A1 A0 CA CU B00 B77 P 3 2 1 ((AO) + (Ak)) 00 77 Vj Vk Vl Sk Sl Sj VL control XA Exchange Aj Ak Al 318 NIP LIP CIP ((Ah) + jkm) I/O Tjk Ai -17 Instruction buffers Execution Memory Scalar reg Addr. regiter The Cray-1 Architecture 4/18/2019 \course\652-04F\Topic2-652.ppt
16
Architecture Features
16 memory banks (64k word/bank) 12 I/O subsystem channels 12 function units 4 Kbytes of high speed regs 4/18/2019 \course\652-04F\Topic2-652.ppt
17
Register-to-Register Arch. (like 7600/6600)
ALU operands must be in Rs; Reg alloc. (in 6600/7600) are used; transfer between R M is treated separately from ALU ops; Reg are specialized by functions hence avoid conflicts; Question: Can you find the RISC idea here ? 4/18/2019 \course\652-04F\Topic2-652.ppt
18
Effective use of the Cray-1 requires careful planning to exploit its register resources.
4/18/2019 \course\652-04F\Topic2-652.ppt
19
Registers { 11 cycles 1 cycle ~ 2 cycle memory Reg. access time
A - address Reg 8 x 24 bit S - Scalar Reg 8 x 64 bit V - Vector Reg 8 x 64 word B - addr. save Reg 64 x 24 bits T - scalar save Reg 64 x 64 bits VL: Vector-length register 0 <= VL <= 64 VM : Vector-mask register VM = 64 bit directly available into function units { to memory 4,888 byte (6ns) storage 4/18/2019 \course\652-04F\Topic2-652.ppt
20
Instruction format 16 ~ 32 bit insts 3 types of vector operation
4/18/2019 \course\652-04F\Topic2-652.ppt
21
Cray-1 functional pipeline units
Functional pipelines Register Pipeline delays usage (clock periods) Address functional units Address add unit A 2 Address multiply unit A 6 Scalar functional units Scalar add unit S 3 Scalar shift unit S 2 or 3 Scalar logical unit S 1 Population/leading zero count unit S 3 Vector functional units Vector add unit V or S 3 Vector shift unit V or S 4 Vector logical unit V or S 2 Floating-point functional units Floating-point add unit S and V 6 Floating-point multiply unit S and V 7 Reciprocal approximation unit S and V 14 Cray-1 functional pipeline units 4/18/2019 \course\652-04F\Topic2-652.ppt
22
Vector Unit V-add unit V-logical unit:
goal: boolean op on the corresponding bits of two 64-bit operands and XOR or merge generation mask V-Shift 4/18/2019 \course\652-04F\Topic2-652.ppt
23
Instruction Set 128 insts 10 vector types 13 scalar types
3-addr format 4/18/2019 \course\652-04F\Topic2-652.ppt
24
Implementation Philosophy
Inst. Processing Memory hierarchy Reg. and function unit reservation Vector processing 4/18/2019 \course\652-04F\Topic2-652.ppt
25
Instruction Processing
“...Issue one instruction per cycle ...” 4/18/2019 \course\652-04F\Topic2-652.ppt
26
Instruction Buffers 4 x 64 word 16 - 32 bit instructions
instruction parcel prefetch branch in buffer 4 inst/cycle fetched to LRU I-buffer 4/18/2019 \course\652-04F\Topic2-652.ppt
27
Reg. and f-unit Reservation
Note: when V=V1 op V2 is issued, V1, V2, op-unit are marked busy (like CDC 6600) Example 1 (independent) a) V1 = V2 * V3 b) V4 = V5 + V6 Example 2 (R-conflict) a) V1 = V2 * V3 b) V4 = V5 + V2 Example 3 (F-unit-conflict) a) V1 = V2 * V3 b) V4 = V5 * V6 } Total independent b) will be issued immediately at V2 being released (predictable) } } similar to Exp. 2 4/18/2019 \course\652-04F\Topic2-652.ppt
28
Four Types of Vector Instructions in the Cray-1
Vj Sj ~ ~ 1 1 Vk 2 Vk 3 2 3 . . . . . . ~ ~ ~ ~ n n Vi Vi ~ ~ ~ ~ (a) Type 1 vector instruction (b) Type 2 vector instruction 4/18/2019 \course\652-04F\Topic2-652.ppt
29
~ Vj Vi Memory (c) Type 3 vector instruction
1 2 3 4 5 6 7 (c) Type 3 vector instruction (d) Type 4 vector instruction 4/18/2019 \course\652-04F\Topic2-652.ppt
30
Vector Loops long vectors with N > 64 are sectioned
each time through a vector loop 64 elems. are processed remainder handling “transparent” to the programmer 4/18/2019 \course\652-04F\Topic2-652.ppt
31
Chaining internal forwarding techniques of 360/91
a “linking process” that occurs when results obtained from one pipeline unit are directly fed into the operand registers of another function pipe. chaining allow operations to be issued as soon as the first result becomes available registers/F-units must be properly reserved. 4/18/2019 \course\652-04F\Topic2-652.ppt
32
Example V0 Mem (M-fetch) V2 V0 + V1 (V-add) V3 V2 < A3 (left shift)
V5 V3 V4 (logical product) < 4/18/2019 \course\652-04F\Topic2-652.ppt
33
A pipeline chaining example
o r y ... 1 2 3 4 5 6 7 a Memory fetch pipe VO V1 V2 V3 V4 V5 Right shift Vector add Logical Product Pipe d c g i j l f A pipeline chaining example in Cray-1 4/18/2019 \course\652-04F\Topic2-652.ppt
34
Multipipeline chaining on Cray 1 and Cray X-MP for executing
Multiply Pipe V1 Add .. Access Memory Read/write port V3 V4 Vector register (Load Y) (Load X) V2 (S*) S (Store Y) (Vadd) Read port 2 write port (VAdd) Read port 1 Scalar register Limited chaining using only one memory-access pipe in the Gray 1 Complete chaining using three memory-access pipes in the Cray X-MP Multipipeline chaining on Cray 1 and Cray X-MP for executing the SAXPY code: Y(1:N) = S x X(1:N) + Y(1:N). 4/18/2019 \course\652-04F\Topic2-652.ppt
35
Chaining Length ? limited by # of V-R/F-U 2-5 4/18/2019
\course\652-04F\Topic2-652.ppt
36
Cray Performance 3 ~ 160 MFLOPs Applications programming skills
Scalar performance for matrix * 12 MFLOPs Vector dot-prod: 22 MFLOPs 153 (peak) 20-80 MFLOPs 4/18/2019 \course\652-04F\Topic2-652.ppt
37
Cray X-MP Multiprocessor/multiprocessing
8 times of Cray-1 memory bandwidth Clock 9.5ns ns Scalar speedup 1.25 ~ 2.5 throughput 2.5 ~ 5 times 4/18/2019 \course\652-04F\Topic2-652.ppt
38
Irregular Vector Operations
scatter: X[A[i]] = B[i] gather X[i] = B[C[i]] an example : do 10 i = 1, N 10 A[index(i)] = Y(i) before 1984, no special insts. poor performance : 2.5 MFLOPs Compress compress a vector according to VM 4/18/2019 \course\652-04F\Topic2-652.ppt
39
Gather Operation ? (a-1) Gather instruction V1[i] = A [V0[i]]
4 100 VL Register A0 V1 Register 2 7 V0 Register (a-1) Gather instruction Memory Contents/ Address V1[i] = A [V0[i]] ? 600 400 250 200 V1 Register 4 2 7 V0 Register (a-2) Gather instruction 100 VL Register A0 Memory Contents/ Address V1[i] = A[V0[i]] exp: V1[i] = A[V0[1]] = A[4] = 600 Gather Operation 4/18/2019 \course\652-04F\Topic2-652.ppt
40
Scatter Operation (b-1) Scatter instruction A[V1[I]] = V[I] ?
4 2 7 V1 Register 200 300 400 500 V0 Register (b-1) Scatter instruction 100 VL Register A0 101 102 103 104 105 106 107 110 Memory Contents/ Address A[V1[I]] = V[I] ? Scatter Operation VL Register A[V1[i]] = V[i] exp: A[V0[3]] = V1[3] A[7] = V1[3] =400 4 x x x x x Memory Contents/ Address 4 2 7 V1 Register 200 300 400 500 V0 Register A0 100 (b-2) Scatter instruction 4/18/2019 \course\652-04F\Topic2-652.ppt
41
Vector Compression Operation
V0 Register (Tested) - 1 5 - 15 24 - 7 13 - 17 14 VL Register VM Register V1 Register V1= Compress (V0, Vm Z) 01 03 04 07 10 11 (c-1) before compression (c-2) after compression Vector Compression Operation 4/18/2019 \course\652-04F\Topic2-652.ppt
42
The Role of Vectorizing Compilers
What does a vectorizing compiler do ? How well does it do ? 4/18/2019 \course\652-04F\Topic2-652.ppt
43
Characteristics of Several Vector-register Architectures
4/18/2019 \course\652-04F\Topic2-652.ppt
44
The VMIPS vector instructions
4/18/2019 \course\652-04F\Topic2-652.ppt
45
4/18/2019 \course\652-04F\Topic2-652.ppt
46
4/18/2019 \course\652-04F\Topic2-652.ppt
47
4/18/2019 \course\652-04F\Topic2-652.ppt
48
4/18/2019 \course\652-04F\Topic2-652.ppt
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.