1 Vector Architectures Sima, Fountain and Kacsuk Chapter 14 CSE462.

Slides:



Advertisements
Similar presentations
RISC Instruction Pipelines and Register Windows
Advertisements

Memory.
DSPs Vs General Purpose Microprocessors
PIPELINE AND VECTOR PROCESSING
CSCI 4717/5717 Computer Architecture
Lecture 19: Parallel Algorithms
The University of Adelaide, School of Computer Science
Tuan Tran. What is CISC? CISC stands for Complex Instruction Set Computer. CISC are chips that are easy to program and which make efficient use of memory.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Computer Organization. This module surveys the physical resources of a computer system. –Basic components CPUMemoryBus I/O devices –CPU structure Registers.
VLIW Machines Sima, Fountain and Kacsuk Chapter 6 CSE3304.
1 Introduction to MIMD Architectures Sima, Fountain and Kacsuk Chapter 15 CSE462.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Lecture 21: Parallel Algorithms
CSCE 121, Sec 200, 507, 508 Fall 2010 Prof. Jennifer L. Welch.
Assembly Language for Intel-Based Computers Chapter 2: IA-32 Processor Architecture Kip Irvine.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Appendix A Pipelining: Basic and Intermediate Concepts
Basic Processing Unit (Week 6)
Prince Sultan College For Woman
Advanced Computer Architectures
The Structure of the CPU
Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
Very Long Instruction Word (VLIW) Architecture. VLIW Machine It consists of many functional units connected to a large central register file Each functional.
Anshul Kumar, CSE IITD CS718 : Data Parallel Processors 27 th April, 2006.
1 Chapter 04 Authors: John Hennessy & David Patterson.
CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
Chapter 8 Problems Prof. Sin-Min Lee Department of Mathematics and Computer Science.
Speeding up of pipeline segments © Fr Dr Jaison Mulerikkal CMI.
Parallel architecture Technique. Pipelining Processor Pipelining is a technique of decomposing a sequential process into sub-processes, with each sub-process.
Computer Architecture And Organization UNIT-II General System Architecture.
Principles of Linear Pipelining
Chapter 4 MARIE: An Introduction to a Simple Computer.
Introduction to Microprocessors
Principles of Linear Pipelining. In pipelining, we divide a task into set of subtasks. The precedence relation of a set of subtasks {T 1, T 2,…, T k }
Chapter One Introduction to Pipelined Processors
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
The Alpha Thomas Daniels Other Dude Matt Ziegler.
1 CSCI 2510 Computer Organization Memory System II Cache In Action.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Chapter One Introduction to Pipelined Processors
Chapter One Introduction to Pipelined Processors
M211 – Central Processing Unit
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Chapter One Introduction to Pipelined Processors.
Vector computers.
1 Chapter 1 Basic Structures Of Computers. Computer : Introduction A computer is an electronic machine,devised for performing calculations and controlling.
Computer Organization and Architecture + Networks
A Closer Look at Instruction Set Architectures
Introduction to 8086 Microprocessor
5.2 Eleven Advanced Optimizations of Cache Performance
A Closer Look at Instruction Set Architectures
COMP4211 : Advance Computer Architecture
Parallel and Multiprocessor Architectures
Lecture 22: Parallel Algorithms
Pipelining and Vector Processing
Array Processor.
Multivector and SIMD Computers
Memory System Performance Chapter 3
Main Memory Background
Presentation transcript:

1 Vector Architectures Sima, Fountain and Kacsuk Chapter 14 CSE462

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley A Generic Vector Machine l The basic idea in a vector processor is to combine two vectors and produce an output vector. l If A, B and C are vectors, each of N elements, then a vector processor can perform the following operation: –C := B + A l which is interpreted as: –c(i) := b (i) + a (i), 0 ≤ i ≤ N-1

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley A Generic Vector Processor l The memory subsystem need to support: –2 Reads per cycle –1 Write per cycle MultiPort Memory Subsystem Pipelined Adder Stream B Stream A Stream C = B + A

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Un-vectorized computation Compute first address Fetch first data Compute second address Fetch second data Compute result Fetch destination address Store result precompute time t 1 compute time t 2 post- compute time t 3 Time Compute time for one result = t 1 + t 2 + t 3 Compute time for N results = N(t 1 + t 2 + t 3 )

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Vectorized computation Compute first address Fetch first data Compute second address Fetch second data Compute result Fetch destination address Store result precompute time t 1 compute time t 2 per result post- compute time t 3 Time Compute time for N results = t 1 + Nt 2 + t 3

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Non-pipelined computation Compute first address Fetch first data Compute second address Fetch second data Compute result Fetch destination address Store result Time Time to first resultRepetition time

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Pipelined computation Compute first address Fetch first data Compute second address Fetch second data Compute result Fetch destination address Store result Time Time to first resultRepetition time

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Pipelined repetition governed by slowest component Compute first address Fetch first data Compute second address Fetch second data Compute result Fetch destination address Store result Time Time to first resultRepetition time

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Pipelined granularity increased to improve repetition rate Compute first address Fetch first data Compute second address Fetch second data Increased granularity for computation Time Fetch destination address Store result Time to first result Repetition time

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Vectorizing speeds up computation Execution time (ns) Number of instructions Scalar performance Vector performance

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Interleaving l If vector pipelining is to work then it must be possible to fetch instructions from memory quickly. l There are two main ways of achieving this: –Cache Memories –Interleaving l A conventional memory consists of a set of storage locations accessed via some sort of address decoder.

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Interleaving l The problem with such a scheme is that the memory is busy during the memory access and no other access can proceed.

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Interleaving l In an interleaved memory system there are a number of banks. l Each bank corresponds to a certain range of addresses.

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Interleaving l A pipelined machine can be kept fed with instructions even though the main memory may be quite slow. l An interleaved memory system slows down when subsequent accesses are for the same bank of memory. l Rare when prefetching instructions, because they tend to be sequential. l Possible to access two locations at the same time if they reside in different banks. l Banks are usually selected using some of the low order bits of the address because sequential access will access different banks.

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Memory Layout M M M M M M M M Pipelined Adder

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Memory Layout of Arrays A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] B[0] B[1] B[2] B[3] B[4] B[5] B[6] B[7] C[0] C[1] C[2] C[3] C[4] C[5] C[6] C[7] Module

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Pipeline Utilisation Time (clock periods) Memory 0 Memory 1 Memory 2 Memory 3 Memory 4 Memory 5 Memory 6 Memory 7 Pipeline 0 Pipeline 1 Pipeline 2 Pipeline 3 RA0 RA1 RA2 RA3 RA4 RA5 RA6 RA7 RB0 RB1 RB2 RB3 RB4 RB5 RB6 RB W0 W1 W2 W3 W4 W5 W6

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Memory Contention A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] B[0] B[1] B[2] B[3] B[4] B[5] B[6] B[7] C[0] C[1] C[2] C[3] C[4] C[5] C[6] C[7] Module

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Adding Delay Paths Pipelined Adder Variable Delay A B C

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Pipeline with delay Time (clock periods) Memory 0 Memory 1 Memory 2 Memory 3 Memory 4 Memory 5 Memory 6 Memory 7 Pipeline 0 Pipeline 1 Pipeline 2 Pipeline 3 RA0 RA1 RA2 RA3 RA4 RA5 RA6 RA7 RB0 RB1 RB2 RB3 RB4 RB5 RB6 RB W0 W1 W2 W3 W4

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley CRAY 1 Vector Operations l Vector facility l The CRAY compilers are vectorising, and do not require vector notation – do 10i=1,64 –10x(i) = y(i) l Scalar arithmetic –64*n instructions, where n is the number of instructions per loop iteration. l Vector registers can also be sent to the floating point functional units.

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley CRAY Vector Section

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Increasing complexity in Cray systems 250k 500k 750k 1.0M 1.25M Number of active circuit elements per processor CRAY1XMP4YMP8YMP16

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Chaining l The CRAY-1 can achieve even faster vector operations by using chaining. l Result vector is not only sent to the destination vector register, but also directly to another functional unit. l Data is seen to chain from one functional unit to another possibly without any intermediate storage.

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Vector Startup l Vector instructions may be issued at the rate of 1 instruction per clock period. –Providing there is no contention they will be issued at this rate. l The first result appears after some delay, and then each word of the vector will arrive at the rate of one word per clock period. l Vectors longer than 64 words are broken into 64 word chunks. – do 10 i = 1,n –10A(i) = B(i)

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Vector Startup times l Note the second loop uses data chaining. l Note the effect of startup time. Loop Body Scalar 1000 A(i) = B(i) A(i) = B(i)*C(i) + D(i)*E(i)

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Effect of Stride on Interleaving l Most interleaving schemes simply take the bottom bits of the address and use these to select the memory bank. l This is very good for sequential address patterns (stride 1), and not too bad for random address patterns. l But for stride n, where n is the number of memory banks, the performance can be extremely bad. DO 10 I = 1,128DO 20 J = 1,128 10A(I,1) = 020A(1,J) = 0

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Effect of Stride l These two code fragments will have quite different performance l If we assume the array is arranged in memory row by row, –first loop will access every 128'th word in sequential order, –second loop will access 128 contiguous words in sequential order l Thus, in loop 1 interleaving will fail if the number of memory banks is a factor of the stride. A(1,1-128)A(2,1-128)A(3,1-128)

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Effect of Stride l Many research papers on how to improve the performance of stride m access on an n way interleaved memory system. l Two main approaches: –Arrange the data to match the stride (S/W) –Make the hardware insensitive to the stride (H/W)

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Memory Layout for stride free access l Consider the layout of an 8 x 8 matrix. l Can be placed in memory in two possible ways –by row order or column order. l If we know that a particular program requires only row or column order, –it is possible to arrange the matrix so that conflict free access can always be guaranteed.

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Memory layout for stride free access l Skew the matrix –each row starts in a different memory unit. –Possible to access the matrix by row order or column order without memory contention. l This requires a different address calculation

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Address Modification in Hardware l Possible to use a different function to compute the module number. l If the address is passed to an arbitrary address computation function which emits a module number, –Produce stride free access for many different strides l There are schemes which give optimal packing and do not waste any space

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Other Typical Access Patterns l Unfortunately, row and column access order is not the only requirement. l Other common patterns include: –Matrix diagonals –Square subarrays

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Diagonal Access l To access diagonal, –stride is equal to column stride + 1 l If M, the number of modules isequal to power of 2, –both column stride and column stride +1 cannot both be efficient, –both cannot be relatively prime to M.

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Vector Algorithms l Consider the solution of the linear equations given by: –Ax = b A is an NxN matrix and x and b are N x 1 column vectors. l Gaussian Elimination is an efficient algorithm for producing upper and lower diagonal matrices L and U, such that –A = LU

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Gaussian Elimination l Given L and U it is possible to write: Ly = b and Ux = y l Using back substitution it is possible to solve for x. = L 0 yb = 0 U xy

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley A Vector Gausian elimination for i := 1 to N do begin imax := index of Max(abs(A[i..N,i])); Swap(A[i,i..N],A[imax,i..N]); if A[i,i] = 0 then Singular Matrix; A[I+1..N,i] := A[I+1..N,i]/A[i,j]; for k := i+1 to N do A[k,i+1..N] := A[k,i+1..N] - A[k,i]*A[i,i+1..N]; end;

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Picture of algorithm U L PU' L'A l The algorithm produces a new row of U and column of L each iteration

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Aspects of the algorithm l The algorithm as expressed accesses both rows and columns. l The majority of the vector operations have either two vector operands, or a scalar and a vector operand, and they produce a vector result. l The MAX operation on a vector returns the index of the maximum element, not the value of the maximum element. l The length of the vector items accessed decreases by 1 for each successive iteration.

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Comments on Aspects l Row and Column access should be equally efficient (see discussion on vector stride) l Vector pipeline should handle a scalar on one input l MIN, MAX and SUM operators required which accept one or more vectors and return scalar l Vectors may start large, but get smaller. This may affect code generation due to vector startup cost.

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Sparse Matrices l In many engineering computations, there may be very large matrices. –May also be sparsely occupied, –Contain many zero elements. l Problems with the normal method of storing the matrix: –Occupy far too much memory –Very slow to process

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Sparse Matrices... Consider the following example: x

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Sparse Matrices... l Many packages solve the problem by using a software data representation. –The trick is not to store the elements which have zero values. l At least two sorts of access mechanism: –Random –Row/Column sequential.

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Sparse Matrices... l Matrix multiplication then consists of moving down the row/column pointers and multiplying elements of they have the same index values. l Hardware to implement this type of data representation. Row Pointers Column Pointers

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Figure 14.2 the design space for floating-point precision Floating-point representation 64-bit 32-bit Big-endianLittle-endianCrayIBMBig-endianLittle-endian IEEEVAXIEEEVAX

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Figure 14.3 the design space for integer precision Integer precision 128 bit SegmentedFixed 64 bit SegmentedFixed 32 bit SegmentedFixed

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Figure 14.8 parallel computation of floating- point and integer results Floating- point segment Integer segment Data buffer Instruction buffer Data buffer Instruction buffer Floating-point unit Integer unit Result unit Program

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Figure 14.9 mixed function and data parallelism Floating- point segment Integer segment Result unit Data buffer 1 Data buffer 2 Data buffer 3 Data buffer 4 Control buff 1 F-P unit 1 Control buff 2 F-P unit 2 Control buff 3 F-P unit 4 Control buff 4 F-P unit 4 Program

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Figure The design space for parallel computational functionality Computational complexity Single unitMultiple unit IntegerFloating-point Integer

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Figure Communication between CPU’s and memory Shared registers and real-time clock Central memory comprises 8 sections 64 sub-sections 1024 banks 1024 Mwords I/O subsystem CPU0 CPU1 CPU14 CPU1 CPU3 CPU15

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Figure the overall architecture of the Convex C4/XA system Communication registers CPU Memory Crossbar switch I/O processor Each port is 1.1 Gb/s Single crossbar port is 1.1 Gb/s Processing subsystem Memory subsystem I/O subsystem

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Figure The configuration of the crossbar switch CPU0 CPU1 CPU2 CPU3 SIA 0123SCU Maximum bandwidth 4.4 Gb/s All data paths are 64-bit Non-blocking switch Memories

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Figure The processor configuration 32 Address registers 28 Scalar registers 256 kb instruction cache 16 kb Data cache 32 KE AT cache Scalar Unit Four Vreg files Vector Crossbar Adder/multiply Logical processor Adder/multiply Logical processor Divider Square root Vector Unit Memory Interface