9/18/2006ELEG652-06F1 Topic 2 Vector Processing & Vector Architectures Lasciate Ogne Speranza, Voi Ch’Intrate Dante’s Inferno.

Slides:



Advertisements
Similar presentations
Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks.
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
CPE 631: Vector Processing (Appendix F in COA4)
DSPs Vs General Purpose Microprocessors
PIPELINE AND VECTOR PROCESSING
CSCI 4717/5717 Computer Architecture
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Vector Processors Part 2 Performance. Vector Execution Time Enhancing Performance Compiler Vectorization Performance of Vector Processors Fallacies and.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
The University of Adelaide, School of Computer Science
1 RISC Machines Because of their load-store ISAs, RISC architectures require a large number of CPU registers. These register provide fast access to data.
Computer Organization and Architecture
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Computer Organization and Architecture
1 Vector Architectures Sima, Fountain and Kacsuk Chapter 14 CSE462.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Chapter 12 Pipelining Strategies Performance Hazards.
Chapter 12 CPU Structure and Function. Example Register Organizations.
DLX Instruction Format
Informationsteknologi Friday, October 19, 2007Computer Architecture I - Class 61 Today’s class Floating point numbers Computer systems organization.
\course\ELEG652-03Fall\Topic Exploitation of Instruction-Level Parallelism (ILP)
Appendix A Pipelining: Basic and Intermediate Concepts
Pipelining By Toan Nguyen.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
Computer Organization and Architecture Instruction-Level Parallelism and Superscalar Processors.
1 Chapter 04 Authors: John Hennessy & David Patterson.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.
Speeding up of pipeline segments © Fr Dr Jaison Mulerikkal CMI.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
Principles of Linear Pipelining
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University.
Chapter One Introduction to Pipelined Processors
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Winter 2002CSE Topic Branch Hazards in the Pipelined Processor.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010
Processor Types And Instruction sets Chapter- 5.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Chapter One Introduction to Pipelined Processors
Vector computers.
Computer Organization
ARM Organization and Implementation
William Stallings Computer Organization and Architecture 8th Edition
CSC 4250 Computer Architectures
Cache Memory Presentation I
COMP4211 : Advance Computer Architecture
Pipelining: Advanced ILP
The University of Adelaide, School of Computer Science
Pipelining and Vector Processing
Out of Order Processors
Superscalar Processors & VLIW Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Multivector and SIMD Computers
Instruction Execution Cycle
Topic 2: Vector Processing and Vector Architectures
Memory System Performance Chapter 3
ADSP 21065L.
Presentation transcript:

9/18/2006ELEG652-06F1 Topic 2 Vector Processing & Vector Architectures Lasciate Ogne Speranza, Voi Ch’Intrate Dante’s Inferno

9/18/2006ELEG652-06F2 Reading List Slides: Topic2x Henn&Patt: Appendix G Other assigned readings from homework and classes

9/18/2006ELEG652-06F3 Vector Architectures Register-Register Archs Memory-Memory Archs Vector Arch Components: Vector Register Banks Capable of holding a n number of vector elements. Two extra registers Vector Functional Units Fully pipelined, hazard detection (structural and data) Vector Load-Store Unit A Scalar Unit A set of registers, FUs and CUs Types:

9/18/2006ELEG652-06F4 An Intro to DLXV A simplified vector architecture Consist of one lane per functional unit –Lane: The number of vector instructions that can be executed in parallel by a functional unit Loosely based on Cray 1 arch and ISA Extension of DLX ISA for vector arch

9/18/2006ELEG652-06F5 DLXV Configuration Vector Registers –Eight Vector regs / 64 element each. –Two read ports and one write port per register –Sixteen read ports and eight write ports in total Vector Functional Unit –Five Functional Units Vector Load and Store Unit –A bandwidth of 1 word per cycle –Double as a scalar load / store unit A set of scalar registers –32 general and 32 FP regs

9/18/2006ELEG652-06F6 A Vector / Register Arch Main Memory Vector Load & Store FP Add FP Multiply FP Divide Logical Integer Scalar Register File Vector Register File

9/18/2006ELEG652-06F7 Advantages A single vector instruction  A lot of work No data hazards –No need to check for data hazards inside vector instructions –Parallelism inside the vector operation Deep pipeline or array of processing elements Known Access Pattern –Latency only paid once per vector (pipelined loading) –Memory address can be mapped to memory modules to reduce contentions Reduction in code size and simplification of hazards –Loop related control hazards from loop are eliminated.

9/18/2006ELEG652-06F8 DAXPY: DLX Code Y = a * X + Y LDF0, a ADDIR4, Rx, #512; last address to load Loop: LDF2, 0(Rx); load X(i) MULTDF2, F0, F2; a x X(i) LDF4, 0 (Ry); load Y(i) ADDDF4, F2, F4; a x X(i) + Y(i) SDF4, 0 (Ry); store into Y(i) ADDIRx, Rx, #8; increment index to X ADDIRy, Ry, #8; increment index to Y SUBR20, R4, Rx; compute bound BNZR20, loop; check if done The bold instructions are part of the loop index calculation and branching

9/18/2006ELEG652-06F9 DAXPY: DLXV Code Y = a * X + Y LDF0, a; load scalar a LVV1, Rx; load vector X MULTSVV2, F0, V1; vector-scalar multiply LVV3, Ry; load vector Y ADDVV4, V2, V3; add SVRy, V4; store the result Instruction Number [Bandwidth] for 64 elements DLX Code 578 Instructions DLXV Code 6 Instructions

9/18/2006ELEG652-06F10 Dead Time The time that it takes the pipeline to be ready for the next vector instruction

9/18/2006ELEG652-06F11 Issues Vector Length Control –Vector lengths are not usually less or even a multiple of the hardware vector length Vector Stride –Access to vectors may not be consequently Solutions: –Two special registers One for vector length up to a maximum vector length One for vector mask

9/18/2006ELEG652-06F12 Vector Length Control Question: Assume the maximum hardware vector length is MVL which may be less than n. How should we do the above computation ? for(i = 0; i < n; ++i) y[i] = a * x[i] +y[i] An Example:

9/18/2006ELEG652-06F13 Vector Length Control Strip Mining Low = 0 VL = n % MVL for(j = 0; j < (int)(n / (float)(MVL) + 0.5); ++j){ for(i = Low; i < Low + VL – 1; ++i) y[i] = a * x[i] + y[i]; Low += VL; VL = MVL; } Strip Mined Code for(i = 0; i < n; ++i) y[i] = a * x[i] +y[i] Original Code

9/18/2006ELEG652-06F14 Vector Length Control Strip Mining 0123n/MVL 0 M-1 M M-MVL-1 M-MVL M-2*MVL-1 M-2*MVL M-3*MVL-1 n-MVL-1 n-1 Value of j Value of i For a vector of arbitrary length M =n % MVL The vector length control register takes values similar to the Vector Length variable (VL) in the C code

9/18/2006ELEG652-06F15 Vector Stride for(i = 0; i < n; ++i) for(j = 0; j < n; ++j){ c[i][j] = 0.0; for(k = 0; k < n; ++k) c[i][j] += a[i][k] * b[k][j]; } Matrix Multiply Code How to vectorize this code? How stride works here? Consider that in C the arrays are saved in memory row-wise. Therefore, a and c are loaded correctly. How about b?

9/18/2006ELEG652-06F16 Vector Stride Stride for a and c  1 –Also called unit stride Stride for b  n elements Use special instructions –Load and store with stride (LVWS, SVWS) Memory banks  Stride can complicate access patterns == Contention

9/18/2006ELEG652-06F17 Vector Stride & Memory Systems Example Memory System Eight Memory Banks Busy Time: 6 cycles Memory Latency: 12 cycles Vector 64 elements Stride of 1 and 32 Time to Complete a Load  76 Cycles for Stride 1 or 1.2 cycle per element * 63  391 Cycles for Stride 32 or 6.2 cycle per element

9/18/2006ELEG652-06F18 Cray-1 “The World Most Expensive Love Seat...” Picture courtesy of Cray Original Source: Cray 1 Computer System Hardware Reference Manual

9/18/2006ELEG652-06F19 Cray 1 Data Sheet Designed by: Seymour Cray Price: 5 to 8.8 Millions dollars Units Shipped: 85 Technology: SIMD, deep pipelined functional units Performance: up to 160 MFLOPS Date Released: 1976 Best Know for: The computer that made the term supercomputer mainstream

9/18/2006ELEG652-06F20 Architectural Components Registers Functional Units Instruction Buffers Computation Section Memory Section From 0.25 to 1 Million 64-bit words I/O Section 12 Input channels 12 Output channels MCU Mass Storage Subsystem Front End Comp. I/O Stations Peripheral Equipment

9/18/2006ELEG652-06F21 The Cray-1 Architecture Computation Section Vector Components Scalar Components Address & Instruction Calculation Components

9/18/2006ELEG652-06F22 Architecture Features 64-bit word 12.5 nanosecond clock period 2’s Complement arithmetic Scalar and Vector Processing modes Twelve functional units Eight 24-bit address registers (A) Sixty Four 24-bit intermediate address (B) registers Eight 64-bit scalar (S) registers Sixty four 64-bit intermediate scalar (T) registers Eight 64-element vector (V) registers, 64-bit per element Four Instructions buffers of bit parcels each Integer and floating point arithmetic 128 Instruction codes

9/18/2006ELEG652-06F23 Architecture Features Up to 1,048,576 words memory –64 data bits and 8 error correction bits Eight or sixteen banks of 65,536 words each Busy Bank time: 4 Transfer rate: –B, T, V registers  One word per cycle –A, S registers  One word per two cycles –Instruction Buffers  Four words per clock cycle SEC - DED Twelve input and twelve output channels Loss data detection Channel group –Contains either six input or six output channels –Served equally by memory (scanned every 4 cycles) –Priority resolved within the groups –Sixteen data bits, 3 controls bits per channel and 4 parity bits Original Source: Cray 1 Computer System Hardware Reference Manual

9/18/2006ELEG652-06F24 Register-Register Architecture All ALU operands are in registers Registers are specialized by function (A, B, T, etc) thus avoiding conflict Transfer between Memory and registers is treated differently than ALU RISC based idea Effective use of the Cray-1 requires careful planning to exploit its register resources –4 Kbytes of high speed registers

9/18/2006ELEG652-06F25 Registers Primary Registers Intermediate Registers Directly addressable by the Functional Units Named V (vector), S (scalar) and A (address) Used as buffers for the functional units Named T (Scalar transport) and B (Address buffering) M IR PR FU

9/18/2006ELEG652-06F26 Registers Memory Access Time: 11 cycles Register Access Time: 1 ~ 2 cycles Primary Registers: –Address Regs: 8 x 24 bits –Scalar Regs: 8 x 64 bits –Vector Regs: 8 x 64 words Intermediate Registers: –B Regs: 64 x 24 bits –T Regs: 64 x 64 bits Special Registers: –Vector Length Register: 0 <= VL <= 64 –Vector Masks Register: 64 bits Total Size: 4,888 bytes

9/18/2006ELEG652-06F27 Instruction Format A parcel  16-bit Instruction word 16 (one parcel) or 32 (two parcels) according to type Op Code Result Reg Operand reg A One Parcel Instruction: Arithmetic Logical Instruction Word A Two Parcels Instruction: Memory Instruction Word Op code Addr Index Reg Result Reg Address

9/18/2006ELEG652-06F28 Functional Unit Pipelines Functional pipelines Register Pipeline delays usage (clock periods) Address functional units Address add unitA 2 Address multiply unitA 6 Scalar functional units Scalar add unitS 3 Scalar shift unitS 2 or 3 Scalar logical unitS 1 Population/leading zero count unitS 3 Vector functional units Vector add unitV or S 3 Vector shift unitV or S 4 Vector logical unitV or S 2 Floating-point functional units Floating-point add unitS and V 6 Floating-point multiply unitS and V 7 Reciprocal approximation unitS and V14

9/18/2006ELEG652-06F29 Vector Units Vector Addition / Subtraction –Functional Unit Delay: 3 Vector Logical Unit –Boolean Ops between 64 bit elements of the vectors AND, XOR, OR, MERGE, MASK GENERATION –Functional Unit Delay: 2 Vector Shift –Shift values of a 64 bit (or 128 bits) element of a vector –Functional Unit Delay: 4

9/18/2006ELEG652-06F30 Instruction Set 128 Instructions Ten Vector Types Thirteen Scalar Types Three Addressing Modes

9/18/2006ELEG652-06F31 Implementation Philosophy Instruction Processing –Instruction Buffering: Four Instructions buffers of bit parcels each Memory Hierarchy –Memory Banks, T and B register banks Register and Function Unit Reservation –Example: Vector ops, register operands, register result and FU are checked as reserved Vector Processing

9/18/2006ELEG652-06F32 Instruction Processing “Issue one instruction per cycle” 4 x 64 word bit instructions Instruction parcel pre-fetch Branch in buffer 4 inst/cycle fetched to LRU I-buffer Special Resources P, NIP, CIP and LIP

9/18/2006ELEG652-06F33 Reservations Vector operands, results and functional unit are marked reserved The vector result reservation is lifted when the chain slot time has passed –Chain Slot: Functional Unit delay plus two clock cycles Examples: V1 = V2 * V3 V4 = V5 + V6 V1 = V2 * V3 V4 = V5 + V2 V1 = V2 * V3 V4 = V5 * V6 Second Instruction cannot begin until First is finished Ditto Independent

9/18/2006ELEG652-06F34 (a) Type 1 vector instruction (b) Type 2 vector instruction VjVj ~ ~ ~~ VkVk ~~ ViVi ~~ VkVk ~~ ViVi n n... SjSj V. Instructions in the Cray-1

9/18/2006ELEG652-06F35 ~~ ~~ ViVi Memory (c) Type 3 vector instruction (d) Type 4 vector instruction VjVj V. Instructions in the Cray-1

9/18/2006ELEG652-06F36 Vector Loops Long vectors with N > 64 are sectioned Each time through a vector loop 64 elements are processed Remainder handling “transparent” to the programmer

9/18/2006ELEG652-06F37 V. Chaining Internal forwarding techniques of 360/91 A “linking process” that occurs when results obtained from one pipeline unit are directly fed into the operand registers of another function pipe. Chaining allow operations to be issued as soon as the first result becomes available Registers/F-units must be properly reserved. Limited by the number of Vector Registers and Functional Units From 2 to 5

9/18/2006ELEG652-06F38 MemoryMemory a Memory fetch pipe VOVO V1V1 V2V2 V3V3 V4V4 V5V5 Right shift pipe Vector add pipe Logical Product Pipe d c d g i j j l f Chaining Example Mem  V0(M-fetch) V0 + V1  V2(V-add) V2 < A3  V3(left shift) V3 ^ V4  V5(logical product) FetchingAdding Fetching Adding Chain Slot

9/18/2006ELEG652-06F39 Multipipeline chaining SAXPY code Y(1:N) = A x X(1:N) + Y(1:N)

9/18/2006ELEG652-06F40 Cray 1 Performance 3 to 160 MFLOPS –Application and Programming Skills Scalar Performance: 12 MFLOPS Vector Dot Product: 22 MFLOPS Peak Performance: 153 MFLOPS

9/18/2006ELEG652-06F41 Cray X-MP Data Sheet Designed by: Steve Chen Price: 15 Millions dollars Units Shipped: N/A Technology: SIMD, deep pipelined functional units Performance: up to 200 MFLOPS (for single CPU) Date Released: 1982 Best Know for: The Successor of Cray 1 and the first parallel vector computer from Cray Inc. An NSA CRAY X-MP/24, on exhibit at the National Crypto logic Museum.

9/18/2006ELEG652-06F42 Cray X-MP Multiprocessor/multiprocessing 8 times of Cray-1 memory bandwidth Clock 9.5 ns ns Scalar speedup*: 1.25 ~ 2.5 Throughput*: 2.5 ~ 5 times *Speedup with respect to Cray 1

9/18/2006ELEG652-06F43 Irregular Vector Ops Scatter: Use a vector to scatter another vector elements across Memory –X[A[i]] = B[i] Gather: The reverse operation of scatter –X[i] = B[C[i]] Compress –Using a Vector Mask, compress a vector No Single instruction to do these before 1984 –Poor Performance: 2.5 MFLOPS

9/18/2006ELEG652-06F44 Gather Operation V1[i] = A[V0[i]] VL A V0V C C Memory Contents / Addresses V V C C Memory Contents / Addresses Example: V1[2] = A[V0[2]] = A[7] = 250

9/18/2006ELEG652-06F45 Scatter Operation A[V0[i]] = V1[i] VL A V V1 x100 x140 x180 x1C0 x200 x240 x280 x2C0 x300 Memory Contents / Addresses V V x x1C0 200 x240 x C0 x300 Memory Contents / Addresses Example: A[V0[0]] = V1[0] A[4] = 200 *(0x200)= 200

9/18/2006ELEG652-06F46 Vector Compression Operation VL VM … V0V1 V1 = Compress(V0, VM, Z) V0V1

9/18/2006ELEG652-06F47 Characteristics of Several Vector Architectures

9/18/2006ELEG652-06F48 The VMIPS Vector Instructions A MIPS ISA extended to support Vector Instructions. The same as DLXV

9/18/2006ELEG652-06F49 Multiple Lanes

9/18/2006ELEG652-06F50 Vectorizing Compilers

9/18/2006ELEG652-06F51 Performance Analysis of Vector Architectures Topic 2a

9/18/2006ELEG652-06F52 Serial, Parallel and Pipelines xy z = x + y Serial Pipeline Array OverlapReplicate x1y1 x2y2 z1 z2 z1 x1y1 x2y2 z2 z1 x1y1x2y2 1 result per cycle 1 result per 4 cycles N result per cycle

9/18/2006ELEG652-06F53 Generic Performance Formula (R. Hockney & C. Jesshope 81) MFLOPS performance of the architecture with an infinite length vector The vector length needed to achieve half of the peak performance Vector length

9/18/2006ELEG652-06F54 Serial Architecture Generic Formula: Parameters: z1 x2y2 z2 Number of stages Time per stage Start up time

9/18/2006ELEG652-06F55 Pipeline Architecture Generic Formula: Parameters: Number of stages Time per stage Start up time s l The number of elements that will come out of the pipeline after the initial penalty has been paid Initial Penalty

9/18/2006ELEG652-06F56 Thus … The Asymptotic Performance Parameter –Memory Limited Peak Performance Factor. A scale factor applied to the performance of a particular architecture implemented with a particular technology. Equivalent to a serial processor with a 100% cache miss The N half Parameter –The amount of parallelism that is presented in a given architecture. –Determined by a combination of vector unit startup and vector unit latency

9/18/2006ELEG652-06F57 The N Half Range Serial MachineInfinite array of processor 0 ∞ The relative performance of different algorithms on a computer is determined by the value of N half (matching problem parallelism with architecture parallelism)

9/18/2006ELEG652-06F58 Vector Length v.s. Vector Performance Slope: n t

9/18/2006ELEG652-06F59 Pseudo code Asymptotic and N half Calculation overhead = -clock(); overhead += clock(); for(N = 0; N < NMAX; ++N){ time[N] = -clock(); for(i = 0; i < N; ++i) A[i] = B[i] * C[i]; time[N] += clock(); time[N] = time[N] - overhead; }

9/18/2006ELEG652-06F60 Parameters of Several Parallel Architectures ComputerN halfR infinityNvNv CRAY BSP pipe CDC CYBER pipe TIASC30127 CDC STAR (64 x 64) ICL DAP

9/18/2006ELEG652-06F61 Another Example The Chaining Effect Assume that all s’s are the same. The same goes for the l’s Assume m vector operations unchained Thus So

9/18/2006ELEG652-06F62 Another Example The Chaining Effect Assume m vector operations chained Thus So

9/18/2006ELEG652-06F63 Explanation Vop 1 Vop 2 Vop 3 s+l Vop 1Vop 2Vop 3 Unchained Chained Usually this would mean an increase in r ∞ and N 1/2 by m

9/18/2006ELEG652-06F64 Topic 2b Memory System Design in Vector Machines

9/18/2006ELEG652-06F65 Multi Port Memory System Pipelined Adder Processor Array Stream A Stream C = A + B Stream B A vector addition is possible when two vectors (data stream) are processed by vector hardware. This being vector adder or a processor array. Bandwidth Challenge

9/18/2006ELEG652-06F66 Memory in Vector Architectures Increase bandwidth –Make “wider” memory Wider busses –Create several memory banks Interleaved Memory Banks –Several Memory Banks –Shared resources –Broadcast of memory access –Interleaving factor –Number of Banks >= Bank Busy Time Independent Memory Banks –The same idea as Interleaved memory –Independent resources Pipelined Adder Stream A Stream B Stream C = A + B M M M M M M M M

9/18/2006ELEG652-06F67 However … Vector machines prefer independent memory banks. –Multiple loads or stores per clock Memory bank cycle time > CPU cycle time –Striding –Parallel processors

9/18/2006ELEG652-06F68 Bandwidth Requirement Assume d being the cycles to access a module Thus BW 1 is equal to 1/d words per cycle –It means that after d cycles you will get a word Restrictions: Most [Vector] memory systems requires at least a word per cycle!!!! Imagine that a requirement of 3 words per cycle (two reads and one write) for total bandwidth (TBW) Thus, Total Number of Modules = TBW / BW 1 –For a total of 3d Modules (3/1/d  3d)

9/18/2006ELEG652-06F69 Bandwidth Example Busy Time  15 / = 6.9 ~ 7 Cycles Memory Requests  6 * 32 = 192 In flight requests Number of Banks  TBW / BW 1 = 192 * 7  1344 Historical Note: Cray T90 early configurations only supported up to 1024 Memory Banks The Cray T90 Clock Cycle: ns Largest Configuration: 32 Maximum Memory requests per cycle: 6 (Four Loads and Two stores) SRAM Time: 15 ns

9/18/2006ELEG652-06F70 Bandwidth v.s. Latency If each element is in a different memory module, does this guarantees maximum bandwidth?? For each module An initialization rate named rA latency to access a word d r * d <= 100% r <= 1/d NO r m <= m / d  BW <= m / d

9/18/2006ELEG652-06F71 An Example How arrays A, B, and W should be layout in memory such that the memory can sustain one operation per cycle for the following operation: W = A + B; If elements from W, A and B should start from module 0, Conflict will result!!!

9/18/2006ELEG652-06F72 The Memory Accesses Assume: Pipeline adder has 4 stages, memory cycle time = 2 A[0] is ready!!! B[0] is ready!!! W[0] has been computed but it cannot be written back!! W[0] is written back

9/18/2006ELEG652-06F73 Another Memory Layout A[0]B[6]W[4] Modulo 0 A[1]B[7]W[5] Modulo 1 A[2]B[0]W[6] Modulo 2 A[3]B[1]W[7] Modulo 3 A[4]B[2]W[0] Modulo 4 A[5]B[3]W[1] Modulo 5 A[6]B[4]W[2] Modulo 6 A[7]B[5]W[3] Modulo 7

9/18/2006ELEG652-06F74 Memory Access Assume: Pipeline adder has 4 stages, memory cycle time = 2 Both A[0] and B[0] are ready W[0] is ready and writing back

9/18/2006ELEG652-06F75 2-D Array Layout a(1,0) a(2,0) a(3,0) a(4,0) a(5,0) a(6,0) a(7,0) a(0,0) a(1,1) a(2,1) a(3,1) a(4,1) a(5,1) a(6,1) a(7,1) a(0,1) a(1,2) a(2,2) a(3,2) a(4,2) a(5,2) a(6,2) a(7,2) a(0,2) a(1,3) a(2,3) a(3,3) a(4,3) a(5,3) a(6,3) a(7,3) a(0,3) a(1,4) a(2,4) a(3,4) a(4,4) a(5,4) a(6,4) a(7,4) a(0,4) a(1,5) a(2,5) a(3,5) a(4,5) a(5,5) a(6,5) a(7,5) a(0,5) a(1,6) a(2,6) a(3,6) a(4,6) a(5,6) a(6,6) a(7,6) a(0,6) a(1,7) a(2,7) a(3,7) a(4,7) a(5,7) a(6,7) a(7,7) a(0,7) M0 M1 M2M3M4M5M6M7

9/18/2006ELEG652-06F76 Row and Column Access a(1,0) a(2,0) a(3,0) a(4,0) a(5,0) a(6,0) a(7,0) a(0,0) a(1,1) a(2,1) a(3,1) a(4,1) a(5,1) a(6,1) a(7,1) a(0,1) a(1,2) a(2,2) a(3,2) a(4,2) a(5,2) a(6,2) a(7,2) a(0,2) a(1,3) a(2,3) a(3,3) a(4,3) a(5,3) a(6,3) a(7,3) a(0,3) a(1,4) a(2,4) a(3,4) a(4,4) a(5,4) a(6,4) a(7,4) a(0,4) a(1,5) a(2,5) a(3,5) a(4,5) a(5,5) a(6,5) a(7,5) a(0,5) a(1,6) a(2,6) a(3,6) a(4,6) a(5,6) a(6,6) a(7,6) a(0,6) a(1,7) a(2,7) a(3,7) a(4,7) a(5,7) a(6,7) a(7,7) a(0,7) M0 M1 M2M3M4M5M6M7 a(0,1) a(0,2) a(0,3) a(0,4) a(0,5) a(0,6) a(0,7) a(0,0) a(1,1) a(1,2) a(1,3) a(1,4) a(1,5) a(1,6) a(1,7) a(1,0) a(2,1) a(2,2) a(2,3) a(2,4) a(2,5) a(2,6) a(2,7) a(2,0) a(3,1) a(3,2) a(3,3) a(3,4) a(3,5) a(3,6) a(3,7) a(3,0) a(4,1) a(4,2) a(4,3) a(4,4) a(4,5) a(4,6) a(4,7) a(4,0) a(5,1) a(5,2) a(5,3) a(5,4) a(5,5) a(5,6) a(5,7) a(5,0) a(6,1) a(6,2) a(6,3) a(6,4) a(6,5) a(6,6) a(6,7) a(6,0) a(7,1) a(7,2) a(7,3) a(7,4) a(7,5) a(7,6) a(7,7) a(7,0) M0 M1 M2M3M4M5M6M7 Good for Row Vectors Good for Column Vectors

9/18/2006ELEG652-06F77 What to do? Use algorithms that –Access row or column exclusively –Possible for some (i.e. LU decomposition) but not for others Skew the matrix

9/18/2006ELEG652-06F78 Row and Column Layout Problem: Diagonals!!!!! Row stride  1 Column Stride  ? a(1,7) a(2,6) a(3,5) a(4,4) a(5,3) a(6,2) a(0,0) a(1,0) a(2,7) a(3,6) a(4,5) a(5,4) a(6,3) a(0,1) a(1,1) a(2,0) a(3,7) a(4,6) a(5,5) a(6,4) a(0,2) a(1,2) a(2,1) a(3,0) a(4,7) a(5,6) a(6,5) a(0,3) a(1,3) a(2,2) a(3,1) a(4,0) a(5,7) a(6,6) a(0,4) a(1,4) a(2,3) a(3,2) a(4,1) a(5,0) a(6,7) a(0,5) a(1,5) a(2,4) a(3,3) a(4,2) a(5,1) a(6,0) a(0,6) a(1,6) a(2,5) a(3,4) a(4,3) a(5,2) a(6,1) a(7,0) a(0,7) M0 M1 M2M3M4M5M6M7 a(7,1)a(7,2)a(7,3)a(7,4)a(7,5)a(7,6)a(7,7)

9/18/2006ELEG652-06F79 Vector Access Specs Initial Address –Association between addresses and memory banks in independent memory banks Number of elements –Needed for the vector control length and the possible multiple access to memory Precision of the elements –Vector or Floating point (if shared) Stride –Memory access and memory bank detection

9/18/2006ELEG652-06F80 Tricks for Strided Access If array A was saved consequently in across M memory modules Assume that V is a subset of A and it has an access stride of S Then M consequently access to V will be dropped in Memory Modules

9/18/2006ELEG652-06F81 Our Example Number of Memory Modules (M): 8 Row stride is 1 Column stride is 9 GCD(M,S) = 1 In general, if M = 2 k Select S = M + 1 (odd) Guarantees GCD(M,S) = 1

9/18/2006ELEG652-06F82 Another Approach Let M be prime –All GCD(M,S) where S < M will be 1 Example: The Burroughs Scientific Processor (BSP) [Kuck & Budnick 71] 16 PE 17 Memory 2 Alignment Networks

9/18/2006ELEG652-06F83 The Burroughs Scientific Processor

9/18/2006ELEG652-06F84 Special Note and Other Approaches If memory time is one cycle (i.e. d = 1) then consecutive access causes no problem (not the case at all) Another solution to reduce contention –RANDOMIZE!!!! –Generate a random bank to save a word –Used by the Tera MTA

9/18/2006ELEG652-06F85 Questions?? Comments?? Questions?? Comments??

9/18/2006ELEG652-06F86 Bibliography Stork, Christian. “Exploring the Tera MTA by Example” May 10, 2000 Cray 1 Computer System Hardware Reference Manual Koopman, Phillips, “17. Vector Performance”, November 9, A lecture in Carnegie Mellon University

9/18/2006ELEG652-06F87 Side Note 2 Gordon Bells’ Rules for Supercomputers Put attention to performance Orchestrate your resources Scalars are bottlenecks –Plan your scalar section very carefully Use existing (affordable) technology to provide a peak vector performance –Vector Performance: Bandwidth and vector register capacity. –Rule of Thumb: At least two results per cycle Do not ignore expensive ops –Division are uncommon but they are used!!!! Design to Market –Create something to brag about

9/18/2006ELEG652-06F88 Side Note 2 Gordon Bells’ Rules for Supercomputers 10 ~ 7 –Support around two extra bits of addressing per 3 years (10 years ~= 7 bits) Increase Productivity Stand on the shoulders of giants Pipeline your design –Design for one and then other and then other… Give you enough lay time in your schedule for unexpected –Beware of Murphy’s Laws!!!!