COSC3330 Computer Architecture Lecture 18. Vector Machine

COSC3330 Computer Architecture Lecture 18. Vector Machine
Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

Topics Vector Machine

VLIW In a classic VLIW, compiler is responsible for avoiding all hazards -> simple hardware, complex compiler. Static scheduling difficult in presence of unpredictable branches and variable latency memory. VLIWs somewhat successful in embedded computing (e.g., TI DSP), no clear success in general-purpose computing despite several attempts.

Supercomputers Definition of a supercomputer:
Fastest machine in world at given task A device to turn a compute-bound problem into an I/O bound problem CDC6600 (Cray, 1964) regarded as first supercomputer In 70s-80s, Supercomputer  Vector Machine

Vector Supercomputers
Epitomized by Cray-1, 1976: Scalar Unit Load/Store Architecture Vector Extension Vector Registers Vector Instructions Implementation Hardwired Control Highly Pipelined Functional Units Interleaved Memory System No Data Caches No Virtual Memory

SIMD SIMD (single instruction multiple data) architecture performs the same operation on multiple data elements in parallel PADDW MM0, MM1

Vector Length Register
Vector Register Scalar Registers Vector Registers v15 r15 v0 r0 [0] [1] [2] [VLRMAX-1] Vector Length Register VLR

Vector Arithmetic Instructions
ADDV v3, v1, v2 v1 + + + + + + v2 v3 [0] [1] [VLR-1]

Vector Load and Store Instructions
Vector Load/Store Vector Load and Store Instructions LV v1, r1, r2 Vector Register v1 Memory Base, r1 Stride, r2

Vector Code Example # C code for (i=0; i<64; i++)
C[i] = A[i] + B[i]; # Scalar Code LI R4, 64 loop: L.D F0, 0(R1) L.D F2, 0(R2) ADD.D F4, F2, F0 S.D F4, 0(R3) DADDIU R1, 8 DADDIU R2, 8 DADDIU R3, 8 DSUBIU R4, 1 BNEZ R4, loop # Vector Code LI VLR, 64 LV V1, R1 LV V2, R2 ADDV.D V3, V1, V2 SV V3, R3

Vector Instruction Set Advantages
Compact one short instruction encodes N operations Expressive, tells hardware that these N operations: are independent use the same functional unit access disjoint registers access registers in same pattern as previous instructions access a contiguous block of memory (unit-stride load/store) Scalable can run same code on more parallel pipelines (lanes)

Vector Arithmetic Execution
Use deep pipeline (=> fast clock) to execute element operations Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!) V1 V2 V3 Six stage multiply pipeline

Vector Instruction Execution
ADDV C,A,B C[1] C[2] C[0] A[3] B[3] A[4] B[4] A[5] B[5] A[6] B[6] Execution using one pipelined functional unit C[4] C[8] C[0] A[12] B[12] A[16] B[16] A[20] B[20] A[24] B[24] C[5] C[9] C[1] A[13] B[13] A[17] B[17] A[21] B[21] A[25] B[25] C[6] C[10] C[2] A[14] B[14] A[18] B[18] A[22] B[22] A[26] B[26] C[7] C[11] C[3] A[15] B[15] A[19] B[19] A[23] B[23] A[27] B[27] Execution using four pipelined functional units

Vector Memory System 1 2 3 4 5 6 7 8 9 A B C D E F +
Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency Bank busy time: Cycles between accesses to same bank 1 2 3 4 5 6 7 8 9 A B C D E F + Base Stride Vector Registers Memory Banks Address Generator

Vector Unit Structure Functional Unit Vector Registers Lane
Elements 0, 4, 8, … Elements 1, 5, 9, … Elements 2, 6, 10, … Elements 3, 7, 11, … Memory Subsystem

Vector Instruction Parallelism
Can overlap execution of multiple vector instructions example machine has 32 elements per vector register and 8 lanes Load Unit Multiply Unit Add Unit load mul add time load mul add Complete 24 operations/cycle while issuing 1 short instruction/cycle

Vector Chaining Vector version of register bypassing
introduced with Cray-1 Memory V1 Load Unit Mult. V2 V3 Chain Add V4 V5 Chain LV v1 MULV v3,v1,v2 ADDV v5, v3, v4

Vector Chaining Advantage
Load Mul Add Time Without chaining, must wait for last element of result to be written before starting dependent instruction With chaining, can start dependent instruction as soon as first result appears Load Mul Add

Automatic Code Vectorization
for (i=0; i < N; i++) C[i] = A[i] + B[i]; load add store Iter. 1 Iter. 2 Scalar Sequential Code Vector Instruction load add store Iter. 1 Iter. 2 Vectorized Code Time Vectorization is a massive compile-time reordering of operation sequencing  requires extensive loop dependence analysis

Vector Stripmining Problem: Vector registers have finite length
Solution: Break loops into pieces that fit in registers, “Stripmining” ANDI R1, N, 63 # N mod 64 MTC1 VLR, R1 # Do remainder loop: LV V1, RA DSLL R2, R1, 3 # Multiply by 8 DADDU RA, RA, R2 # Bump pointer LV V2, RB DADDU RB, RB, R2 ADDV.D V3, V1, V2 SV V3, RC DADDU RC, RC, R2 DSUBU N, N, R1 # Subtract elements LI R1, 64 MTC1 VLR, R1 # Reset full length BGTZ N, loop # Any more to do? for (i=0; i<N; i++) C[i] = A[i]+B[i]; + A B C 64 elements Remainder

Vector Scatter/Gather
Want to vectorize loops with indirect accesses: for (i=0; i<N; i++) A[i] = B[i] + C[D[i]] Indexed load instruction (Gather) LV vD, rD # Load indices in D vector LVI vC, rC, vD # Load indirect from rC base LV vB, rB # Load B vector ADDV.D vA,vB,vC # Do add SV vA, rA # Store result rD rC vD vC vB vA 1 3 5 2 11 30 +

Vector Scatter/Gather
Scatter example: for (i=0; i<N; i++) A[B[i]]++; Is following a correct translation? LV vB, rB # Load indices in B vector LVI vA, rA, vB # Gather initial A values ADDV vA, vA, 1 # Increment SVI vA, rA, vB # Scatter incremented values rB rA vB vA 1 3 5 2 11 30 +1

Vector Conditional Execution
Problem: Want to vectorize loops with conditional code: for (i=0; i<N; i++) if (A[i]>0) then A[i] = B[i]; Solution: Add vector mask (or flag) registers 1 bit per element …and maskable vector instructions vector operation becomes NOP at elements where mask bit is clear Code example: CVM # Turn on all elements LV vA, rA # Load entire A vector SGTVS.D vA, F0 # Set bits in mask register where A>0 LV vA, rB # Load B vector into A under mask SV vA, rA # Store A back to memory under mask

Masked Vector Instructions
B[3] A[4] B[4] A[5] B[5] A[6] B[6] M[3]=0 M[4]=1 M[5]=1 M[6]=0 M[2]=0 M[1]=1 M[0]=0 Write data port Write Enable A[7] B[7] M[7]=1 Simple Implementation execute all N operations, turn off result writeback according to mask C[4] C[5] C[1] Write data port A[7] B[7] M[3]=0 M[4]=1 M[5]=1 M[6]=0 M[2]=0 M[1]=1 M[0]=0 M[7]=1 Density-Time Implementation scan mask vector and only execute elements with non-zero masks

Compress/Expand Operations
Compress packs non-masked elements from one vector register contiguously at start of destination vector register population count of mask vector gives packed vector length Expand performs inverse operation A[7] A[1] A[4] A[5] Compress M[3]=0 M[4]=1 M[5]=1 M[6]=0 M[2]=0 M[1]=1 M[0]=0 M[7]=1 A[3] A[4] A[5] A[6] A[7] A[0] A[1] A[2] M[3]=0 M[4]=1 M[5]=1 M[6]=0 M[2]=0 M[1]=1 M[0]=0 M[7]=1 B[3] A[4] A[5] B[6] A[7] B[0] A[1] B[2] Expand Used for density-time conditionals and also for general selection operations

Vector Reductions Problem: Loop-carried dependence on reduction variables sum = 0; for (i=0; i<N; i++) sum += A[i]; # Loop-carried dependence on sum Solution: Re-associate operations if possible, use binary tree to perform reduction # Rearrange as: sum[0:VL-1] = # Vector of VL partial sums for(i=0; i<N; i+=VL) # Stripmine VL-sized chunks sum[0:VL-1] += A[i:i+VL-1]; # Vector sum # Now have VL partial sums in one vector register do { VL = VL/2; # Halve vector length sum[0:VL-1] += sum[VL:2*VL-1] # Halve no. of partials } while (VL>1)

Thread Level Parallelism

TLP ILP of a single program is hard
Large ILP is Far-flung We are human after all, program w/ sequential mind Reality: running multiple threads or programs Thread Level Parallelism Time Multiplexing Throughput computing Multiple program workloads Multiple concurrent threads

Multithreading Difficult to continue to extract instruction-level parallelism (ILP) from a single sequential thread of control Many workloads can make use of thread-level parallelism (TLP) TLP from multiprogramming (run independent sequential jobs) TLP from multithreaded applications (run one job faster using parallel threads) Multithreading uses TLP to improve utilization of a single processor

UNIX Threads A thread is a basic unit of CPU utilization; it consists of: Program counter Register set Stack space A thread shares with its peer threads its: Code segment Data segment Operating-system resources, An OS may supports multiple processes, a process can have multiple threads.

Multi-Tasking Paradigm
Virtual memory makes it easy Context switch could be expensive or requires extra HW VIVT cache VIPT cache TLBs FU1 FU2 FU3 FU4 Unused Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Execution Time Quantum Conventional Superscalar Single Threaded

Conventional Multithreading
Zero-overhead context switch Duplicated contexts for threads 0:r0 0:r7 1:r0 CtxtPtr 1:r7 2:r0 2:r7 3:r0 3:r7 Register file Memory (shared by threads)

Cycle Interleaving MT Per-cycle, Per-thread instruction fetching
Examples: HEP, Horizon, Tera MTA, MIT M-machine Interesting questions to consider Does it need a sophisticated branch predictor? Or does it need any speculative execution at all? Get rid of “branch prediction”? Does it need any out-of-order execution capability?

Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe
Multithreading How can we guarantee no dependencies between instructions in a pipeline? -- One way is to interleave execution of instructions from different program threads on same pipeline F D X M W t0 t1 t2 t3 t4 t5 t6 t7 t8 T1: LW r1, 0(r2) T2: ADD r7, r1, r4 T3: XORI r5, r4, #12 T4: SW 0(r7), r5 T1: LW r5, 12(r1) t9 Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe Prior instruction in a thread always completes write-back before next instruction in same thread reads register file

CDC 6600 Peripheral Processors (Cray, 1964)
First multithreaded hardware 10 “virtual” I/O processors Fixed interleave on simple pipeline Pipeline has 100ns cycle time Each virtual processor executes one instruction every 1000ns Was objective was to cope with long I/O latencies?

Simple Multithreaded Pipeline
X PC 1 PC 1 GPR1 I$ IR GPR1 PC 1 GPR1 GPR1 PC 1 D$ Y +1 2 2 Thread select Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage Appears to software (including OS) as multiple, albeit slower, CPUs

Thread Scheduling Policies
Fixed interleave (CDC 6600 PPUs, 1964) Each of N threads executes one instruction every N cycles If thread not ready to go in its slot, insert pipeline bubble Hardware-controlled thread scheduling (HEP, 1982) Hardware keeps track of which threads are ready to go Picks next thread to execute based on hardware priority scheme

Denelcor HEP (Burton Smith, 1982)
First commercial machine to use hardware threading in main CPU 120 threads per processor 10 MHz clock rate Up to 8 processors precursor to Tera MTA (Multithreaded Architecture)

Tera MTA (1990-97) Multi-Threaded Architecture
Up to 256 processors Up to 128 active threads per processor Flat, shared main memory No data cache Sustains one main memory access per cycle per processor

Key Architecture Details
Each MTA processor has 128 “streams” each of which is hardware thread (including 32 registers and a program counter that is devoted to running single thread of control) The processor executes instructions from streams, that are not blocked, in a fair round robin fashion A stream can issue an instruction every 21 cycles (the length of the instruction pipeline) so at least 21 ready threads are required to keep a processor fully busy The processor makes a context switch on each cycle, choosing the next instruction from one of the streams that is ready to execute

Multithreading on One Processor
Unused streams

Coarse-Grain Multithreading
Tera MTA designed for supercomputing applications with large data sets and low locality No data cache Many parallel threads needed to hide large memory latency Other applications are more cache friendly Few pipeline bubbles if cache mostly has hits Just add a few threads to hide occasional cache miss latencies Swap threads on cache misses

Superscalar Machine Efficiency
Issue width Time Instruction issue Completely idle cycle (vertical waste) Partially filled cycle, i.e., IPC < 4 (horizontal waste)

Vertical Multithreading
Issue width Instruction issue Second thread interleaved cycle-by-cycle Time Partially filled cycle, i.e., IPC < 4 (horizontal waste) What is the effect of cycle-by-cycle interleaving? removes vertical waste, but leaves some horizontal waste

Chip Multiprocessing (CMP)
Issue width Time What is the effect of splitting into multiple processors? reduces horizontal waste, leaves some vertical waste, and puts upper limit on peak throughput of each thread.

Ideal Superscalar Multithreading [Tullsen, Eggers, Levy, UW, 1995]
Issue width Time Interleave multiple threads to multiple issue slots with no restrictions

Simultaneous Multithreading (SMT) for OoO Superscalars
Techniques presented so far have all been “vertical” multithreading where each pipeline stage works on one thread at a time SMT uses fine-grain control already present inside an OoO superscalar to allow instructions from multiple threads to enter execution on same clock cycle. Gives better utilization of machine resources.

Simultaneous Multithreading (SMT)
Intel’s HyperThreading (2-way SMT) IBM Power7 (4/6/8 cores, 4-way SMT); IBM Power5/6 (2 cores. Each 2-way SMT) Basic ideas: Conventional MT + Simultaneous issue + Sharing common resources Fdiv, unpipe (16 cycles) Fetch Unit RS & ROB plus Physical Register File Decode FMult (4 cycles) Reg File Reg File Reg File Register Renamer FAdd (2 cyc) Reg File Register Renamer Reg File Register Renamer Reg File Register Renamer Reg File PC Register Renamer Reg File PC Register Renamer PC Register Renamer PC Register Renamer PC PC PC PC ALU1 ALU2 I-CACHE D-CACHE Load/Store (variable)

Single-threaded predecessor to Power 5. 8 execution units in
IBM Power 4 Single-threaded predecessor to Power execution units in out-of-order engine, each may issue an instruction each cycle.

2 commits (architected register sets) 2 fetch (PC), 2 initial decodes
Power 4 2 commits (architected register sets) Power 5 2 fetch (PC), 2 initial decodes

Power 5 Data Flow ... Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck

Pentium-4 Hyperthreading (2002)
First commercial SMT design (2-way SMT) Hyperthreading == SMT Logical processors share nearly all resources of the physical processor Caches, execution units, branch predictors Die area overhead of hyperthreading ~ 5% When one logical processor is stalled, the other can make progress Hyperthreading dropped on OoO P6 based followons to Pentium-4 (Pentium-M, Core Duo, Core 2 Duo), until revived with Nehalem generation machines in 2008. Intel Atom (in-order x86 core) has two-way vertical multithreading

Multi-threading Paradigm
Unused Execution Time FU1 FU2 FU3 FU4 Conventional Superscalar Single Threaded Fine-grained Multithreading (cycle-by-cycle Interleaving) Thread 2 Thread 3 Thread 4 Thread 5 Coarse-grained Multithreading (Block Interleaving) Chip Multiprocessor (CMP or MultiCore) Simultaneous Multithreading (SMT)

COSC3330 Computer Architecture Lecture 18. Vector Machine

Similar presentations

Presentation on theme: "COSC3330 Computer Architecture Lecture 18. Vector Machine"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COSC3330 Computer Architecture Lecture 18. Vector Machine

Similar presentations

Presentation on theme: "COSC3330 Computer Architecture Lecture 18. Vector Machine"— Presentation transcript:

Similar presentations

About project

Feedback