Topic 2: Vector Processing and Vector Architectures

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

CPE 631: Vector Processing (Appendix F in COA4)
Vector Processors Part 2 Performance. Vector Execution Time Enhancing Performance Compiler Vectorization Performance of Vector Processors Fallacies and.
1 ITCS 3181 Logic and Computer Systems B. Wilkinson Slides9.ppt Modification date: March 30, 2015 Processor Design.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
The University of Adelaide, School of Computer Science
1 RISC Machines Because of their load-store ISAs, RISC architectures require a large number of CPU registers. These register provide fast access to data.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Instruction Set Architecture & Design
9/18/2006ELEG652-06F1 Topic 2 Vector Processing & Vector Architectures Lasciate Ogne Speranza, Voi Ch’Intrate Dante’s Inferno.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
DLX Instruction Format
\course\ELEG652-03Fall\Topic Exploitation of Instruction-Level Parallelism (ILP)
Krste Asanovic Electrical Engineering and Computer Sciences
Instruction Sets and Pipelining Cover basics of instruction set types and fundamental ideas of pipelining Later in the course we will go into more depth.
Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.
Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Chapter One Introduction to Pipelined Processors
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
Vector computers.
IBM System 360. Common architecture for a set of machines
Computer Organization
CS2100 Computer Organization
ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 17 Vectors
ARM Organization and Implementation
14: Vector Computers: an old-fashioned approach
Central Processing Unit Architecture
William Stallings Computer Organization and Architecture 8th Edition
Massachusetts Institute of Technology
Static Compiler Optimization Techniques
ELEN 468 Advanced Logic Design
Pipelining.
CS203 – Advanced Computer Architecture
COMP4211 : Advance Computer Architecture
Pipelining.
The University of Adelaide, School of Computer Science
Pipelining and Vector Processing
Out of Order Processors
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
CSC 4250 Computer Architectures
Ka-Ming Keung Swamy D Ponpandi
Topic 5: Processor Architecture Implementation Methodology
Multivector and SIMD Computers
Static Compiler Optimization Techniques
Static Compiler Optimization Techniques
Static Compiler Optimization Techniques
Topic 5: Processor Architecture
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Computer Architecture
Static Compiler Optimization Techniques
Midterm 2 review Chapter
Static Compiler Optimization Techniques
ARM Introduction.
ARM ORGANISATION.
Memory System Performance Chapter 3
CMSC 611: Advanced Computer Architecture
Static Compiler Optimization Techniques
Static Compiler Optimization Techniques
Static Compiler Optimization Techniques
Static Compiler Optimization Techniques
Loop-Level Parallelism
Chapter 11 Processor Structure and function
Lecture 5: Pipeline Wrap-up, Static ILP
Guest Lecturer: Justin Hsia
Static Compiler Optimization Techniques
Ka-Ming Keung Swamy D Ponpandi
Pipelining Hazards.
Presentation transcript:

Topic 2: Vector Processing and Vector Architectures 4/18/2019 \course\652-04F\Topic2-652.ppt

Reading List Slides: Topic2x Henn&Patt: Appendix G Other assigned readings from homework and classes 4/18/2019 \course\652-04F\Topic2-652.ppt

The basic structure of a vector-register architecture Main memory Vector load-store FP add/subtract Logical FP multiply FP divide Integer registers The basic structure of a vector-register architecture 4/18/2019 \course\652-04F\Topic2-652.ppt

Main Advantage of Vector Architectures The computation of each result is independent of the computation of previous results, allowing a very deep pipeline without generating any data hazards. A single vector instruction specifies a great deal of work - it is equivalent to executing an entire loop. Vector instructions that access memory have a known access pattern Because an entire loop is replaced by a vector instruction control hazards that would normally arise from the loop branch are nonexistent. 4/18/2019 \course\652-04F\Topic2-652.ppt

An Example of Vector Code for DAXPY Y:= a * X + Y Here is the DLX code: LD F0, a ADDI R4, Rx, #512 ; last address to load Loop: LD F2, 0(Rx) ; load X(i) MULTD F2, F0, F2 ; a x X(i) LD F4, 0 (Ry) ; load Y(i) ADDD F4, F2, F4 ; a x X(i) + Y(i) SD F4, 0 (Ry) ; store into Y(i) ADDI Rx, Rx, #8 ; increment index to X ADDI Ry, Ry, #8 ; increment index to Y SUB R20, R4, Rx ; compute bound BNZ R20, loop ; check if done 4/18/2019 \course\652-04F\Topic2-652.ppt

Here is the code for DLXV for DAXPY. LD F0, a ; load scalar a LV V1, Rx ; load vector X MULTSV V2, F0, V1 ; vector-scalar multiply LV V3, Ry ; load vector Y ADDV V4, V2, V3 ; add SV Ry, V4 ; store the result 4/18/2019 \course\652-04F\Topic2-652.ppt

4/18/2019 \course\652-04F\Topic2-652.ppt

Two Issues Vector-length control Vector stride 4/18/2019 \course\652-04F\Topic2-652.ppt

Issue 1: Vector Length Control An Example: do 10 i = 1, n 10 Y(i) = a * X(i) + Y(i) Question: assume the maximum hardware vector length is MVL which may be less than n. How should we do the above computation ? 4/18/2019 \course\652-04F\Topic2-652.ppt

Issue 1: Vector Length Control --- Strip-Minning An Example: do 10 i = 1, n 10 Y(i) = a * X(i) + Y(i) Strip-mining: low = 1 VL = (n mod MVL) /*find the odd size piece*/ do 1 j = 0, (n / MVL) /*outer loop*/ do 10 i = low, low+VL-1 /*runs for length VL*/ Y(i) = a*X(i) + Y(i) /*main operation*/ 10 continue low = low+VL /*start of next vector*/ VL = MVL /*reset the length to max*/ 1 continue 4/18/2019 \course\652-04F\Topic2-652.ppt

Example con’d A vector of arbitrary length processed with strip mining Value of j 0 1 2 3 … … n/MVL Range of i 1..m (m+1).. (m+ (m+2 * … … n-MVL m+MVL MVL+1) MVL+1) +1)..n ..m+2 * ..m+3 * MVL MVL A vector of arbitrary length processed with strip mining m = n mod MVL 4/18/2019 \course\652-04F\Topic2-652.ppt

Issue 2: Vector Stride Question 1: How should we vectorize the code ? do 10 i = 1,100 do 10 j = 1,100 C(i, j) = 0.0 do 10 k = 1,100 10 C(i, j) = C(i, j) + A(i, k) * B(k, j) Question 1: How should we vectorize the code ? Question 2: How is the ”stride” working here ? 4/18/2019 \course\652-04F\Topic2-652.ppt

The Cray Supercomputer - A Case Study 4/18/2019 \course\652-04F\Topic2-652.ppt

Cray-1 “The World Most Expensive Love Seat...” 4/18/2019 \course\652-04F\Topic2-652.ppt

The Cray-1 Architecture V7 V6 V5 V4 Vector registers V3 V2 V1 V0 Shift Logical Add Vector functional units Recip. Multiply Floating- point Scalar Pop/17 Addr Function VM RTC S7 S6 S5 S4 S3 S2 S1 S0 A7 A6 A5 A4 A3 T00 through T77 (AO) ((Ahj + jkm) A2 A1 A0 CA CU B00 B77 P 3 2 1 ((AO) + (Ak)) 00 77 Vj Vk Vl Sk Sl Sj VL control XA Exchange Aj Ak Al 318 NIP LIP CIP ((Ah) + jkm) I/O Tjk Ai -17 Instruction buffers Execution Memory Scalar reg Addr. regiter The Cray-1 Architecture 4/18/2019 \course\652-04F\Topic2-652.ppt

Architecture Features 16 memory banks (64k word/bank) 12 I/O subsystem channels 12 function units 4 Kbytes of high speed regs 4/18/2019 \course\652-04F\Topic2-652.ppt

Register-to-Register Arch. (like 7600/6600) ALU operands must be in Rs; Reg alloc. (in 6600/7600) are used; transfer between R M is treated separately from ALU ops; Reg are specialized by functions hence avoid conflicts; Question: Can you find the RISC idea here ? 4/18/2019 \course\652-04F\Topic2-652.ppt

Effective use of the Cray-1 requires careful planning to exploit its register resources. 4/18/2019 \course\652-04F\Topic2-652.ppt

Registers { 11 cycles 1 cycle ~ 2 cycle memory Reg. access time A - address Reg 8 x 24 bit S - Scalar Reg 8 x 64 bit V - Vector Reg 8 x 64 word B - addr. save Reg 64 x 24 bits T - scalar save Reg 64 x 64 bits VL: Vector-length register 0 <= VL <= 64 VM : Vector-mask register VM = 64 bit directly available into function units { to memory 4,888 byte (6ns) storage 4/18/2019 \course\652-04F\Topic2-652.ppt

Instruction format 16 ~ 32 bit insts 3 types of vector operation 4/18/2019 \course\652-04F\Topic2-652.ppt

Cray-1 functional pipeline units Functional pipelines Register Pipeline delays usage (clock periods) Address functional units Address add unit A 2 Address multiply unit A 6 Scalar functional units Scalar add unit S 3 Scalar shift unit S 2 or 3 Scalar logical unit S 1 Population/leading zero count unit S 3 Vector functional units Vector add unit V or S 3 Vector shift unit V or S 4 Vector logical unit V or S 2 Floating-point functional units Floating-point add unit S and V 6 Floating-point multiply unit S and V 7 Reciprocal approximation unit S and V 14 Cray-1 functional pipeline units 4/18/2019 \course\652-04F\Topic2-652.ppt

Vector Unit V-add unit V-logical unit: goal: boolean op on the corresponding bits of two 64-bit operands and XOR or merge generation mask V-Shift 4/18/2019 \course\652-04F\Topic2-652.ppt

Instruction Set 128 insts 10 vector types 13 scalar types 3-addr format 4/18/2019 \course\652-04F\Topic2-652.ppt

Implementation Philosophy Inst. Processing Memory hierarchy Reg. and function unit reservation Vector processing 4/18/2019 \course\652-04F\Topic2-652.ppt

Instruction Processing “...Issue one instruction per cycle ...” 4/18/2019 \course\652-04F\Topic2-652.ppt

Instruction Buffers 4 x 64 word 16 - 32 bit instructions instruction parcel prefetch branch in buffer 4 inst/cycle fetched to LRU I-buffer 4/18/2019 \course\652-04F\Topic2-652.ppt

Reg. and f-unit Reservation Note: when V=V1 op V2 is issued, V1, V2, op-unit are marked busy (like CDC 6600) Example 1 (independent) a) V1 = V2 * V3 b) V4 = V5 + V6 Example 2 (R-conflict) a) V1 = V2 * V3 b) V4 = V5 + V2 Example 3 (F-unit-conflict) a) V1 = V2 * V3 b) V4 = V5 * V6 } Total independent b) will be issued immediately at V2 being released (predictable) } } similar to Exp. 2 4/18/2019 \course\652-04F\Topic2-652.ppt

Four Types of Vector Instructions in the Cray-1 Vj Sj ~ ~ 1 1 Vk 2 Vk 3 2 3 . . . . . . ~ ~ ~ ~ n n Vi Vi ~ ~ ~ ~ (a) Type 1 vector instruction (b) Type 2 vector instruction 4/18/2019 \course\652-04F\Topic2-652.ppt

~ Vj Vi Memory (c) Type 3 vector instruction 1 2 3 4 5 6 7 (c) Type 3 vector instruction (d) Type 4 vector instruction 4/18/2019 \course\652-04F\Topic2-652.ppt

Vector Loops long vectors with N > 64 are sectioned each time through a vector loop 64 elems. are processed remainder handling “transparent” to the programmer 4/18/2019 \course\652-04F\Topic2-652.ppt

Chaining internal forwarding techniques of 360/91 a “linking process” that occurs when results obtained from one pipeline unit are directly fed into the operand registers of another function pipe. chaining allow operations to be issued as soon as the first result becomes available registers/F-units must be properly reserved. 4/18/2019 \course\652-04F\Topic2-652.ppt

Example V0 Mem (M-fetch) V2 V0 + V1 (V-add) V3 V2 < A3 (left shift) V5 V3 V4 (logical product) < 4/18/2019 \course\652-04F\Topic2-652.ppt

A pipeline chaining example o r y ... 1 2 3 4 5 6 7 a Memory fetch pipe VO V1 V2 V3 V4 V5 Right shift Vector add Logical Product Pipe d c g i j l f A pipeline chaining example in Cray-1 4/18/2019 \course\652-04F\Topic2-652.ppt

Multipipeline chaining on Cray 1 and Cray X-MP for executing Multiply Pipe V1 Add .. Access Memory Read/write port V3 V4 Vector register (Load Y) (Load X) V2 (S*) S (Store Y) (Vadd) Read port 2 write port (VAdd) Read port 1 Scalar register Limited chaining using only one memory-access pipe in the Gray 1 Complete chaining using three memory-access pipes in the Cray X-MP Multipipeline chaining on Cray 1 and Cray X-MP for executing the SAXPY code: Y(1:N) = S x X(1:N) + Y(1:N). 4/18/2019 \course\652-04F\Topic2-652.ppt

Chaining Length ? limited by # of V-R/F-U 2-5 4/18/2019 \course\652-04F\Topic2-652.ppt

Cray Performance 3 ~ 160 MFLOPs Applications programming skills Scalar performance for matrix * 12 MFLOPs Vector dot-prod: 22 MFLOPs 153 (peak) 20-80 MFLOPs 4/18/2019 \course\652-04F\Topic2-652.ppt

Cray X-MP Multiprocessor/multiprocessing 8 times of Cray-1 memory bandwidth Clock 9.5ns - 8.5 ns Scalar speedup 1.25 ~ 2.5 throughput 2.5 ~ 5 times 4/18/2019 \course\652-04F\Topic2-652.ppt

Irregular Vector Operations scatter: X[A[i]] = B[i] gather X[i] = B[C[i]] an example : do 10 i = 1, N 10 A[index(i)] = Y(i) before 1984, no special insts. poor performance : 2.5 MFLOPs Compress compress a vector according to VM 4/18/2019 \course\652-04F\Topic2-652.ppt

Gather Operation ? (a-1) Gather instruction V1[i] = A [V0[i]] 4 100 VL Register A0 V1 Register 2 7 V0 Register (a-1) Gather instruction 200 100 300 101 400 102 500 103 600 104 700 105 100 106 250 107 350 108 Memory Contents/ Address V1[i] = A [V0[i]] ? 600 400 250 200 V1 Register 4 2 7 V0 Register (a-2) Gather instruction 100 VL Register A0 200 100 300 101 400 102 500 103 600 104 700 105 100 106 250 107 350 108 Memory Contents/ Address V1[i] = A[V0[i]] exp: V1[i] = A[V0[1]] = A[4] = 600 Gather Operation 4/18/2019 \course\652-04F\Topic2-652.ppt

Scatter Operation (b-1) Scatter instruction A[V1[I]] = V[I] ? 4 2 7 V1 Register 200 300 400 500 V0 Register (b-1) Scatter instruction 100 VL Register A0 101 102 103 104 105 106 107 110 Memory Contents/ Address A[V1[I]] = V[I] ? Scatter Operation VL Register A[V1[i]] = V[i] exp: A[V0[3]] = V1[3] A[7] = V1[3] =400 4 500 100 x 101 300 102 x 103 200 104 x 105 x 106 400 107 x 110 Memory Contents/ Address 4 2 7 V1 Register 200 300 400 500 V0 Register A0 100 (b-2) Scatter instruction 4/18/2019 \course\652-04F\Topic2-652.ppt

Vector Compression Operation V0 Register (Tested) - 1 5 - 15 24 - 7 13 - 17 14 010110011101... VL Register VM Register V1 Register V1= Compress (V0, Vm Z) 01 03 04 07 10 11 (c-1) before compression (c-2) after compression Vector Compression Operation 4/18/2019 \course\652-04F\Topic2-652.ppt

The Role of Vectorizing Compilers What does a vectorizing compiler do ? How well does it do ? 4/18/2019 \course\652-04F\Topic2-652.ppt

Characteristics of Several Vector-register Architectures 4/18/2019 \course\652-04F\Topic2-652.ppt

The VMIPS vector instructions 4/18/2019 \course\652-04F\Topic2-652.ppt

4/18/2019 \course\652-04F\Topic2-652.ppt

4/18/2019 \course\652-04F\Topic2-652.ppt

4/18/2019 \course\652-04F\Topic2-652.ppt

4/18/2019 \course\652-04F\Topic2-652.ppt