Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

CPE 631: Vector Processing (Appendix F in COA4)
1 ECE462/562 ISA and Datapath Review Ali Akoglu. 2 Instruction Set Architecture A very important abstraction –interface between hardware and low-level.
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
Vector Processors Part 2 Performance. Vector Execution Time Enhancing Performance Compiler Vectorization Performance of Vector Processors Fallacies and.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
The University of Adelaide, School of Computer Science
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
COMP25212 Advanced Pipelining Out of Order Processors.
1 RISC Machines Because of their load-store ISAs, RISC architectures require a large number of CPU registers. These register provide fast access to data.
Chapter 3 Pipelining. 3.1 Pipeline Model n Terminology –task –subtask –stage –staging register n Total processing time for each task. –T pl =, where t.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Appendix A Pipelining: Basic and Intermediate Concepts
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.
Basics and Architectures
Pipelining and Vector Processing Chapter 8 S. Dandamudi.
1 Chapter 04 Authors: John Hennessy & David Patterson.
9.2 Pipelining Suppose we want to perform the combined multiply and add operations with a stream of numbers: A i * B i + C i for i =1,2,3,…,7.
CDA 5155 Superscalar, VLIW, Vector, Decoupled Week 4.
Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.
Pipelining and Parallelism Mark Staveley
Chapter One Introduction to Pipelined Processors
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Vector computers.
Static Compiler Optimization Techniques
CS203 – Advanced Computer Architecture
Morgan Kaufmann Publishers
CPE 631: Vector Processing (Appendix F in COA4)
COMP4211 : Advance Computer Architecture
The University of Adelaide, School of Computer Science
Pipelining and Vector Processing
Out of Order Processors
Array Processor.
Superscalar Processors & VLIW Processors
Multivector and SIMD Computers
Static Compiler Optimization Techniques
Static Compiler Optimization Techniques
Static Compiler Optimization Techniques
Advanced Computer Architecture
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Static Compiler Optimization Techniques
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Static Compiler Optimization Techniques
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CPE 631 Lecture 23: Vector Processing
Topic 2: Vector Processing and Vector Architectures
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CMSC 611: Advanced Computer Architecture
Static Compiler Optimization Techniques
Static Compiler Optimization Techniques
Static Compiler Optimization Techniques
Static Compiler Optimization Techniques
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CPE 631 Lecture 24: Vector Processing
Static Compiler Optimization Techniques
Presentation transcript:

Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

© S. Dandamudi2 Pipelining  Vector machines exploit pipelining in all its activities  Computations  Movement of data from/to memory  Pipelining provides overlapped execution  Increases throughput  Hides latency …

Carleton University© S. Dandamudi3 Pipelining (cont’d) Pipeline overlaps execution: 6 versus 18 cycles

Carleton University© S. Dandamudi4 Pipelining (cont’d)  One measure of performance:  Ideal case:  n-stage pipeline should give a speedup of n  Two factors affect this:  Pipeline fill  Pipeline drain Non-pipelined execution time Pipelined execution time Speedup =

Carleton University© S. Dandamudi5 Pipelining (cont’d)  N computations, each takes n * T time  Non-pipelined time = N * n * T time  Pipelined time = n * T + (N – 1) T time = (n + N –1) T time n * Nn * N n + N  1 Speedup = 1/N + 1/n – 1/(n * N ) 1 =

Carleton University© S. Dandamudi6 Pipelining (cont’d) n = 9 n = 3 n = 6

Carleton University© S. Dandamudi7 Pipelining (cont’d) Pipeline depth, n

Carleton University© S. Dandamudi8 Vector Machines  Provide high-level operations  Work on vectors (linear arrays of numbers)  A typical vector operation  Add two 64-element floating-point vectors  Equivalent to an entire loop  CRAY format V3 V2 VOP V1  V3  V2 VOP V1

Carleton University© S. Dandamudi9 Vector Machines (cont’d)  Consists of  Scalar unit  Works on scalars  Address arithmetic  Vector unit  Responsible for vector operations  Several vector functional units  Integer add, FP add, FP multiply …

Carleton University© S. Dandamudi10 Vector Machines (cont’d)  Two types of architecture  Memory-to-memory architecture  Vectors are memory resident  First machines are of this type  Example: CDC Star 100, CYBER 205  Vector-register architecture  Vectors are stored in registers  Modern vector machines belong to this type  Examples: Cary 1/2/X-MP/YMP, NEC SX/2, Fujitsu VP200, Hitachi S820

Carleton University© S. Dandamudi11 Components  Primary components of vector-register machine  Vector registers  Each register can hold a small vector  Example: Cray-1 has 8 vector registers  Each vector register can hold 64 doublewords (64-bit values)  Two read ports and one write port  Allows overlap among the vector operations

Carleton University© S. Dandamudi12 Cray-1Architecture

Carleton University© S. Dandamudi13 Components  Vector functional units  Each unit is fully pipelined  Can start a new operation on every clock cycle  Cray-1 has six functional units  FP Add, FP multiply, FP reciprocal, Integer add, Logical, Shift  Scalar registers  Store scalars  Compute addresses to pass on to the load/store unit

Carleton University© S. Dandamudi14 Components  Vector load/store unit  Moves vectors between memory and vector registers  Load and store operations are pipelined  Some processors have more than one load/store unit  NEC SX/2 has 8 load/store units  Memory  Designed to allow pipelined access  Typically use interleaved memories  Will discuss later

Carleton University© S. Dandamudi15 Some Example Vector Machines MachineYear# VRVR size# LSUs CRAY Cray Y-MP loads/1 store Fujitsu VP Hitachi S NEC SX/ var.8 Convex C

Carleton University© S. Dandamudi16 Some Example Vector Machines (cont’d)  Vector functional units  Cray X-MP/Y-MP  8 units  FP add, FP multiply, FP reciprocal  Integer add,  2 logical  Shift  Population count/parity

Carleton University© S. Dandamudi17 Some Example Vector Machines (cont’d)  Vector functional units (cont’d)  NEX SX/2  16 units  4 FP add,  4 FP multiply/divide  4 Integer add/logical,  4 Shift

Carleton University© S. Dandamudi18 Advantages of Vector Machines  Flynn’s bottleneck can be reduced  Vector instructions significantly improve code density  A single vector instruction specifies a great deal of work  Reduce the number of instructions needed to execute a program  Eliminate control overhead of a loop  A vector instruction represents the entire loop  Loop overhead can be substantial

Carleton University© S. Dandamudi19 Advantages of Vector Machines (cont’d)  Impact of main memory latency can be reduced  Vector instructions that access memory have a known pattern  Pipelined access can be used  Can exploit interleaved memory  High latency associated with memory can be amortized over the entire vector  Latency is not associated with each data item  When accessing a floating-point number

Carleton University© S. Dandamudi20 Advantages of Vector Machines (cont’d)  Control hazards can be reduced  Vector machines organize data operands into regular sequences  Suitable for pipelined access in hardware  Vector operation  loop  Data hazards can be eliminated  Due to structured nature of data  Allows planned prefetching of data

Carleton University© S. Dandamudi21 Example Problem  A Typical Vector Problem Y = a * X + Y  X and Y are vectors  This problem is known as  SAXPY (single precision A*X Plus Y)  DAXPY (double precision A*X Plus Y)  SAXPY/DAXPY represents a small piece of code that takes most of the time in the benchmark

Carleton University© S. Dandamudi22 Example Problem (cont’d)  Non-vector code fragment LD F0,a ADDI R4,Rx,#512 ;last address to load loop: LD F2,0(Rx) ;F2 := M[0+Rx] ; i.e., load X[i] MULT F2,F0,F2 ;a*X[i]

Carleton University© S. Dandamudi23 Example Problem (cont’d) LD F4,0(Ry) ;load Y[i] ADD F4,F2,F4 ;a*X[i] + y[i] SD F4,0(Ry) ;store into Y[i] ADDI Rx,Rx,#8 ;increment index to X ADDI Ry,Ry,#8 ;increment index to Y SUB R20,R4,Rx ;R20 := R4-Rx JNZ R20,loop ;jump if not done 9 instructions in the loop

Carleton University© S. Dandamudi24 Example Problem (cont’d)  Vector code fragment LD F0,a ;load scalar a LV V1,Rx ;load vector X MULTSV V2,F0,V1 ;V2 := F0 * V1 LV V3,Ry ;load vector Y ADDV V4,V2,V3 ;V4 := V2 + V3 SV Ry,V4 ; store the result Only 6 vector instructions!

Carleton University© S. Dandamudi25 Example Problem (cont’d)  Two main observations  Execution efficiency  Vector code  Executes 6 instructions  Non-vector code  Nearly 600 instructions (9 * 64)  Lots of control overhead  4 out of 9 instructions!  Absent in the vector code

Carleton University© S. Dandamudi26 Example Problem (cont’d)  Two main observations  Frequency of pipeline interlock  Non-vector code:  Every ADD must wait for MULT  Every SD must wait for ADD  Loop unrolling can eliminate this interlock  Vector code  Each instruction is independent  Pipeline stalls once per vector operation  Not once per vector element

Carleton University© S. Dandamudi27 Vector Length  Vector register has a natural vector length  64 elements in CRAY systems  What if the vector has a different length?  Three cases  Vector length < Vector register length  Use a vector length register to indicate the vector length  Vector length = Vector register length  Vector length > Vector register length

Carleton University© S. Dandamudi28 Vector Length (cont’d)  Vector length > Vector register length  Use strip mining  Vector is partitioned into strips that are less than or equal to the vector register length Odd strip

Carleton University© S. Dandamudi29 Vector Stride  Vector stride  Distance separating the elements that are to be merged into a single vector  In elements, not bytes  Typically multidimensional matrices may have non-unit stride access patterns  Example: matrix multiply

Carleton University© S. Dandamudi30 Vector Stride (cont’d)  Matrix multiplication for (i = 1, 100) for (j = 1, 100) A[i,j] = 0 for (k = 1, 100) A[i,j] = A[i,j] + B[i,k] * C[k,j] Non-unit stride Unit stride

Carleton University© S. Dandamudi31 Vector Stride (cont’d)  Access pattern of B and C depends on how the matrix is stored  Row-major  Matrix is stored row-by-row  Used by most languages except FORTRAN  Column-major  Matrix is stored column-by-column  Used by FORTRAN

Carleton University© S. Dandamudi32 Vector Stride (cont’d)

Carleton University© S. Dandamudi33 Cray X-MP Instructions  Integer addition  Vi Vj+VkVi = Vj + Vk  Vi Sj+VkVi = Sj + Vk  Sj is a scalar  Floating-point addition  Vi Vj+FVkVi = Vj + Vk  Vi Sj+FVkVi = Sj + Vk  Sj is a scalar

Carleton University© S. Dandamudi34 Cray X-MP Instructions (cont’d)  Load instructions  Vi,A0,AkVi = M(A0)+Ak  Vector load with stride Ak  Loads VL elements from memory address A0  Vi,A0,1Vi = M(A0)+1  Vector load with stride 1  Special case

Carleton University© S. Dandamudi35 Cray X-MP Instructions (cont’d)  Store instructions ,A0,Ak Vi  Vector store with stride Ak  Stores VL elements starting at memory address A0 ,A0,1 Vi  Vector store with stride 1  Special case

Carleton University© S. Dandamudi36 Cray X-MP Instructions (cont’d)  Logical AND instructions  Vi Vj&VkVi = Vj & Vk  Vi Sj&VkVi = Sj & Vk  Sj is a scalar  Shift instructions  Vi Vj>AkVi = Vj >> Ak  Vi Vj<AkVi = Vj << Ak  Left/right shift each element of Vj and store the result in Vi

Carleton University© S. Dandamudi37 Sample Vector Functional Units Vector functional unit# StagesAvailable to chain Vector results Integer ADD (64-bit)38VL+8 64-bit shift38VL bit shift49VL+9 Floating ADD611VL+11 Floating MULTIPLY712VL+12

Carleton University© S. Dandamudi38 X-MP Pipeline Operation  Three phases  Setup phase  Sets functional units to perform the appropriate operation  Establishes routes to source and destination vector registers  Requires 3 clock cycles for all functional units  Execution phase  Shutdown phase

Carleton University© S. Dandamudi39 X-MP Pipeline Operation (Cont’d)  Three phases (cont’d)  Execution phase  Source and destination vector registers are reserved  Cannot be used by another instruction  Source vector register is reserved for VL+3 clock cycles  VL = vector length  One pair of operands/clock cycle enter the first stage

Carleton University© S. Dandamudi40 X-MP Pipeline Operation (Cont’d)  Three phases (cont’d)  Shutdown phase  Shutdown time = 3 clock cycles  Shutdown time  Time difference between  when the last result emerges and  when the destination vector register becomes available for other instructions

Carleton University© S. Dandamudi41 X-MP Pipeline Operation (Cont’d)  Three phases (cont’d)  Shutdown phase  Destination register becomes available after 3 + n + (VL  1) + 3 = n + VL + 5 clock cycles  Setup time = shutdown time = 3 clock cycles  First result comes after n clock cycles  Remaining (VL  1) results come out at one/clock cycle

Carleton University© S. Dandamudi42 A Simple Vector Add Operation A1 5 VL A1 V1 V2+FV3

Carleton University© S. Dandamudi43 Overlapped Vector Operations A1 5 VL A1 V1 V2+FV3 V4 V5*FV6

Carleton University© S. Dandamudi44 Chaining Example A1 5 VL A1 V1 V2+FV3 V4 V5*FV1

Carleton University© S. Dandamudi45 Vector Processing Performance

Carleton University© S. Dandamudi46 Interleaved Memories  Traditional memory designs  Provide sequential, non-overlapped access  Use high-order interleaving  Interleaved memories  Facilitate overlapped, pipelined access  Used by vector and high performance systems  Use low-order interleaving

Carleton University© S. Dandamudi47 Interleaved Memories (cont’d)

Carleton University© S. Dandamudi48 Interleaved Memories (cont’d)  Two types of designs  Synchronized access organization  Upper m bits are given to all memory banks simultaneously  Requires output latches  Does not efficiently support non-sequential access  Independent access organization  Supports pipelined access for arbitrary access pattern  Require address registers

Carleton University© S. Dandamudi49 Interleaved Memories (cont’d) Synchronized access organization

Carleton University© S. Dandamudi50 Interleaved Memories (cont’d) Pipelined transfer of data in interleaved memories

Carleton University© S. Dandamudi51 Interleaved Memories (cont’d) Independent access organization

Carleton University© S. Dandamudi52 Interleaved Memories (cont’d)  Number of banks B B  M M = memory access time in cycles  Sequential access if stride = B  B = 8, M = 6 clock cycles, stride = 1  Time to read 16 words = = 22 clock cycles  If stride is 8, it takes 16 * 6 = 96 clock cycles Last slide