Topic 2: Vector Processing and Vector Architectures

Topic 2: Vector Processing and Vector Architectures
4/18/2019 \course\652-04F\Topic2-652.ppt

Reading List Slides: Topic2x Henn&Patt: Appendix G
Other assigned readings from homework and classes 4/18/2019 \course\652-04F\Topic2-652.ppt

The basic structure of a vector-register architecture
Main memory Vector load-store FP add/subtract Logical FP multiply FP divide Integer registers The basic structure of a vector-register architecture 4/18/2019 \course\652-04F\Topic2-652.ppt

Main Advantage of Vector Architectures
The computation of each result is independent of the computation of previous results, allowing a very deep pipeline without generating any data hazards. A single vector instruction specifies a great deal of work - it is equivalent to executing an entire loop. Vector instructions that access memory have a known access pattern Because an entire loop is replaced by a vector instruction control hazards that would normally arise from the loop branch are nonexistent. 4/18/2019 \course\652-04F\Topic2-652.ppt

An Example of Vector Code for DAXPY Y:= a * X + Y
Here is the DLX code: LD F0, a ADDI R4, Rx, #512 ; last address to load Loop: LD F2, 0(Rx) ; load X(i) MULTD F2, F0, F2 ; a x X(i) LD F4, 0 (Ry) ; load Y(i) ADDD F4, F2, F4 ; a x X(i) + Y(i) SD F4, 0 (Ry) ; store into Y(i) ADDI Rx, Rx, #8 ; increment index to X ADDI Ry, Ry, #8 ; increment index to Y SUB R20, R4, Rx ; compute bound BNZ R20, loop ; check if done 4/18/2019 \course\652-04F\Topic2-652.ppt

Here is the code for DLXV for DAXPY.
LD F0, a ; load scalar a LV V1, Rx ; load vector X MULTSV V2, F0, V1 ; vector-scalar multiply LV V3, Ry ; load vector Y ADDV V4, V2, V3 ; add SV Ry, V4 ; store the result 4/18/2019 \course\652-04F\Topic2-652.ppt

Two Issues Vector-length control Vector stride 4/18/2019
\course\652-04F\Topic2-652.ppt

Issue 1: Vector Length Control
An Example: do 10 i = 1, n 10 Y(i) = a * X(i) + Y(i) Question: assume the maximum hardware vector length is MVL which may be less than n. How should we do the above computation ? 4/18/2019 \course\652-04F\Topic2-652.ppt

Issue 1: Vector Length Control ---
Strip-Minning An Example: do 10 i = 1, n 10 Y(i) = a * X(i) + Y(i) Strip-mining: low = 1 VL = (n mod MVL) /*find the odd size piece*/ do 1 j = 0, (n / MVL) /*outer loop*/ do 10 i = low, low+VL /*runs for length VL*/ Y(i) = a*X(i) + Y(i) /*main operation*/ continue low = low+VL /*start of next vector*/ VL = MVL /*reset the length to max*/ 1 continue 4/18/2019 \course\652-04F\Topic2-652.ppt

Example con’d A vector of arbitrary length processed with strip mining
Value of j … … n/MVL Range of i 1..m (m+1) (m (m+2 * … … n-MVL m+MVL MVL+1) MVL+1) )..n ..m+2 * m+3 * MVL MVL A vector of arbitrary length processed with strip mining m = n mod MVL 4/18/2019 \course\652-04F\Topic2-652.ppt

Issue 2: Vector Stride Question 1: How should we vectorize the code ?
do 10 i = 1,100 do 10 j = 1,100 C(i, j) = 0.0 do 10 k = 1,100 10 C(i, j) = C(i, j) + A(i, k) * B(k, j) Question 1: How should we vectorize the code ? Question 2: How is the ”stride” working here ? 4/18/2019 \course\652-04F\Topic2-652.ppt

The Cray Supercomputer - A Case Study

Cray-1 “The World Most Expensive Love Seat...”

The Cray-1 Architecture
V7 V6 V5 V4 Vector registers V3 V2 V1 V0 Shift Logical Add Vector functional units Recip. Multiply Floating- point Scalar Pop/17 Addr Function VM RTC S7 S6 S5 S4 S3 S2 S1 S0 A7 A6 A5 A4 A3 T00 through T77 (AO) ((Ahj + jkm) A2 A1 A0 CA CU B00 B77 P 3 2 1 ((AO) + (Ak)) 00 77 Vj Vk Vl Sk Sl Sj VL control XA Exchange Aj Ak Al 318 NIP LIP CIP ((Ah) + jkm) I/O Tjk Ai -17 Instruction buffers Execution Memory Scalar reg Addr. regiter The Cray-1 Architecture 4/18/2019 \course\652-04F\Topic2-652.ppt

Architecture Features
16 memory banks (64k word/bank) 12 I/O subsystem channels 12 function units 4 Kbytes of high speed regs 4/18/2019 \course\652-04F\Topic2-652.ppt

Register-to-Register Arch. (like 7600/6600)
ALU operands must be in Rs; Reg alloc. (in 6600/7600) are used; transfer between R M is treated separately from ALU ops; Reg are specialized by functions hence avoid conflicts; Question: Can you find the RISC idea here ? 4/18/2019 \course\652-04F\Topic2-652.ppt

Effective use of the Cray-1 requires careful planning to exploit its register resources.

Registers { 11 cycles 1 cycle ~ 2 cycle memory Reg. access time
A - address Reg 8 x 24 bit S - Scalar Reg 8 x 64 bit V - Vector Reg 8 x 64 word B - addr. save Reg 64 x 24 bits T - scalar save Reg 64 x 64 bits VL: Vector-length register 0 <= VL <= 64 VM : Vector-mask register VM = 64 bit directly available into function units { to memory 4,888 byte (6ns) storage 4/18/2019 \course\652-04F\Topic2-652.ppt

Instruction format 16 ~ 32 bit insts 3 types of vector operation

Cray-1 functional pipeline units
Functional pipelines Register Pipeline delays usage (clock periods) Address functional units Address add unit A 2 Address multiply unit A 6 Scalar functional units Scalar add unit S 3 Scalar shift unit S 2 or 3 Scalar logical unit S 1 Population/leading zero count unit S 3 Vector functional units Vector add unit V or S 3 Vector shift unit V or S 4 Vector logical unit V or S 2 Floating-point functional units Floating-point add unit S and V 6 Floating-point multiply unit S and V 7 Reciprocal approximation unit S and V 14 Cray-1 functional pipeline units 4/18/2019 \course\652-04F\Topic2-652.ppt

Vector Unit V-add unit V-logical unit:
goal: boolean op on the corresponding bits of two 64-bit operands and XOR or merge generation mask V-Shift 4/18/2019 \course\652-04F\Topic2-652.ppt

Instruction Set 128 insts 10 vector types 13 scalar types
3-addr format 4/18/2019 \course\652-04F\Topic2-652.ppt

Implementation Philosophy
Inst. Processing Memory hierarchy Reg. and function unit reservation Vector processing 4/18/2019 \course\652-04F\Topic2-652.ppt

Instruction Processing
“...Issue one instruction per cycle ...” 4/18/2019 \course\652-04F\Topic2-652.ppt

Instruction Buffers 4 x 64 word 16 - 32 bit instructions
instruction parcel prefetch branch in buffer 4 inst/cycle fetched to LRU I-buffer 4/18/2019 \course\652-04F\Topic2-652.ppt

Reg. and f-unit Reservation
Note: when V=V1 op V2 is issued, V1, V2, op-unit are marked busy (like CDC 6600) Example 1 (independent) a) V1 = V2 * V3 b) V4 = V5 + V6 Example 2 (R-conflict) a) V1 = V2 * V3 b) V4 = V5 + V2 Example 3 (F-unit-conflict) a) V1 = V2 * V3 b) V4 = V5 * V6 } Total independent b) will be issued immediately at V2 being released (predictable) } } similar to Exp. 2 4/18/2019 \course\652-04F\Topic2-652.ppt

Four Types of Vector Instructions in the Cray-1
Vj Sj ~ ~ 1 1 Vk 2 Vk 3 2 3 . . . . . . ~ ~ ~ ~ n n Vi Vi ~ ~ ~ ~ (a) Type 1 vector instruction (b) Type 2 vector instruction 4/18/2019 \course\652-04F\Topic2-652.ppt

~ Vj Vi Memory (c) Type 3 vector instruction
1 2 3 4 5 6 7 (c) Type 3 vector instruction (d) Type 4 vector instruction 4/18/2019 \course\652-04F\Topic2-652.ppt

Vector Loops long vectors with N > 64 are sectioned
each time through a vector loop 64 elems. are processed remainder handling “transparent” to the programmer 4/18/2019 \course\652-04F\Topic2-652.ppt

Chaining internal forwarding techniques of 360/91
a “linking process” that occurs when results obtained from one pipeline unit are directly fed into the operand registers of another function pipe. chaining allow operations to be issued as soon as the first result becomes available registers/F-units must be properly reserved. 4/18/2019 \course\652-04F\Topic2-652.ppt

Example V0 Mem (M-fetch) V2 V0 + V1 (V-add) V3 V2 < A3 (left shift)
V5 V3 V4 (logical product) < 4/18/2019 \course\652-04F\Topic2-652.ppt

A pipeline chaining example
o r y ... 1 2 3 4 5 6 7 a Memory fetch pipe VO V1 V2 V3 V4 V5 Right shift Vector add Logical Product Pipe d c g i j l f A pipeline chaining example in Cray-1 4/18/2019 \course\652-04F\Topic2-652.ppt

Multipipeline chaining on Cray 1 and Cray X-MP for executing
Multiply Pipe V1 Add .. Access Memory Read/write port V3 V4 Vector register (Load Y) (Load X) V2 (S*) S (Store Y) (Vadd) Read port 2 write port (VAdd) Read port 1 Scalar register Limited chaining using only one memory-access pipe in the Gray 1 Complete chaining using three memory-access pipes in the Cray X-MP Multipipeline chaining on Cray 1 and Cray X-MP for executing the SAXPY code: Y(1:N) = S x X(1:N) + Y(1:N). 4/18/2019 \course\652-04F\Topic2-652.ppt

Chaining Length ? limited by # of V-R/F-U 2-5 4/18/2019
\course\652-04F\Topic2-652.ppt

Cray Performance 3 ~ 160 MFLOPs Applications programming skills
Scalar performance for matrix * 12 MFLOPs Vector dot-prod: 22 MFLOPs 153 (peak) 20-80 MFLOPs 4/18/2019 \course\652-04F\Topic2-652.ppt

Cray X-MP Multiprocessor/multiprocessing
8 times of Cray-1 memory bandwidth Clock 9.5ns ns Scalar speedup 1.25 ~ 2.5 throughput 2.5 ~ 5 times 4/18/2019 \course\652-04F\Topic2-652.ppt

Irregular Vector Operations
scatter: X[A[i]] = B[i] gather X[i] = B[C[i]] an example : do 10 i = 1, N 10 A[index(i)] = Y(i) before 1984, no special insts. poor performance : 2.5 MFLOPs Compress compress a vector according to VM 4/18/2019 \course\652-04F\Topic2-652.ppt

Gather Operation ? (a-1) Gather instruction V1[i] = A [V0[i]]
4 100 VL Register A0 V1 Register 2 7 V0 Register (a-1) Gather instruction Memory Contents/ Address V1[i] = A [V0[i]] ? 600 400 250 200 V1 Register 4 2 7 V0 Register (a-2) Gather instruction 100 VL Register A0 Memory Contents/ Address V1[i] = A[V0[i]] exp: V1[i] = A[V0[1]] = A[4] = 600 Gather Operation 4/18/2019 \course\652-04F\Topic2-652.ppt

Scatter Operation (b-1) Scatter instruction A[V1[I]] = V[I] ?
4 2 7 V1 Register 200 300 400 500 V0 Register (b-1) Scatter instruction 100 VL Register A0 101 102 103 104 105 106 107 110 Memory Contents/ Address A[V1[I]] = V[I] ? Scatter Operation VL Register A[V1[i]] = V[i] exp: A[V0[3]] = V1[3] A[7] = V1[3] =400 4 x x x x x Memory Contents/ Address 4 2 7 V1 Register 200 300 400 500 V0 Register A0 100 (b-2) Scatter instruction 4/18/2019 \course\652-04F\Topic2-652.ppt

Vector Compression Operation
V0 Register (Tested) - 1 5 - 15 24 - 7 13 - 17 14 VL Register VM Register V1 Register V1= Compress (V0, Vm Z) 01 03 04 07 10 11 (c-1) before compression (c-2) after compression Vector Compression Operation 4/18/2019 \course\652-04F\Topic2-652.ppt

The Role of Vectorizing Compilers
What does a vectorizing compiler do ? How well does it do ? 4/18/2019 \course\652-04F\Topic2-652.ppt

Characteristics of Several Vector-register Architectures

The VMIPS vector instructions

Topic 2: Vector Processing and Vector Architectures

Similar presentations

Presentation on theme: "Topic 2: Vector Processing and Vector Architectures"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Topic 2: Vector Processing and Vector Architectures

Similar presentations

Presentation on theme: "Topic 2: Vector Processing and Vector Architectures"— Presentation transcript:

Similar presentations

About project

Feedback