PSU CS 106 Computing Fundamentals II Sample Architectures HM 4/14/2008
2 © Dr. Herbert G. Mayer Agenda Single Accumulator Architecture General-Purpose Register Architecture Stack Machine Architecture Pipelined Architecture Vector Architecture Shared-Memory Multiprocessor Architecture Distributed-Memory Multiprocessor Architecture Systolic Architecture Superscalar Architecture VLIW Architecture
3 © Dr. Herbert G. Mayer Single Accumulator Architecture Single Accumulator (SAA) Architecture: Single register to hold operation results Conventionally called accumulator Accumulator used as destination of arithmetic operations, and as (one) source Has central processing unit, memory unit, connecting memory bus pc points to next instruction (in memory) to be executed next Sample: ENIAC
4 © Dr. Herbert G. Mayer General-Purpose Reg. Architecture General-Purpose Register (GPR) Architecture Accumulates ALU results in more than one register, n typically 4, 8, 16,.. 64 Allows register-to-register operations, fast! Essentially a multi-register extension of SA architecture Two-address architecture specifies one source operand plus destination Three-address architecture specifies two source operands plus destination Variations allow additional index registers, base registers etc.
5 © Dr. Herbert G. Mayer Stack Machine Architecture Stack Machine Architecture (SA) AKA zero-address architecture, as operations need no explicit operands Pure Stack Machine has no registers; of course there are no pure SAs Hence performance will be slow/poor, as all operations involve memory However: implement n top of stack elements as registers: Cache Sample architectures: Burroughs B5000, HP 3000 Implement impure stack operations that bypass tos operand addressing Example code sequence to compute res := a * ( b ) -- high-level source push a pushlit 145 push b add mult pop res
6 © Dr. Herbert G. Mayer Pipelined Architecture Pipelined Architecture (PA) Arithmetic Logic Unit (ALU) split into separate, sequential units Each of which can be initiated once per cycle Yet each subunit is implemented in HW just once Multiple subunits operate in parallel on different sub-ops, at different stages Ideally, all subunits require unit time (1 cycle) Ideally, all operations (add, fetch, store) take the same # of steps of time Non-unit time, differing # of cycles per operation cause different termination moments Operation aborted in case of branch, exception, call etc. Operation must stall in case of operand dependence: stall, caused by the interlock
7 © Dr. Herbert G. Mayer Pipelined Architecture, Cont’d
8 © Dr. Herbert G. Mayer Vector Architecture (VA) Register implemented as HW array of identical registers, named Vr0, Vr1 VA may also have scalar registers, named r0, r1, etc. Vector registers can load/store block of contiguous data Still in sequence, but overlapped; number of steps to complete load/store of a vector also depends on width of bus Vector registers can perform multiple operations of the same kind on blocks of operands Still sequentially, but overlapped, and all operands are readily available Otherwise operation of VA is similar to GPR architecture
9 © Dr. Herbert G. Mayer Vector Architecture, Cont’d Sample operations: ldv vr1,memi-- loads e.g. 64 memory locations stv vr2,memj-- stores Vr2 in 64 contiguous locs vadd vr1,vr2, vr3-- register-register vector addition cvaddf r0, vr1, vr2, vr3 -- has special, conditional meaning: -- sequential equivalent: for i = 0 to 63 do if bit i in r0 is 1 then vr1[i] = vr2[i] + vr3[i] end if end for
10 © Dr. Herbert G. Mayer Shared-Memory Multiprocessor Shared Memory Architecture (SMA) Equal access to memory for all n processors, p 0 to p n-1 ; possible to have additional, local memories Only one will succeed to access shared memory, if multiple, simultaneous accesses Simultaneous access must be deterministic Von Neumann bottleneck becoming even tighter than for conventional UP system If locality is great, and ~ 2 * loads as stores, and many arithmetic operations over memory accesses, then resource utilization good Typically # loads = ( 2 to 3 ) times the # stores Else some processors idle due to memory conflict Typical number of processors n=4, but n=8 and greater possible, with large 2nd level cache, even 3rd level
11 © Dr. Herbert G. Mayer Shared-Memory Multiprocessor
12 © Dr. Herbert G. Mayer Distributed-Memory Multiprocessor Distributed Memory Architecture (DMA) All memories are private Hence each processor p i always has access to its own memory Mem i However, collection of all memories is program’s logical data space Thus, processors must access others’ memories Done via Message Passing or Virtual Shared Memory Messages must be routed, route be determined; route may be long Blocking when: message expected but hasn’t arrived yet Blocking when: message to be sent, but destination cannot receive Growing message buffer size increases illusion of asynchronicity of sending and receiving operations Key parameter: time and package overhead to send empty message Message may also be delayed because of network congestion
13 © Dr. Herbert G. Mayer Distributed-Memory Multiprocessor
14 © Dr. Herbert G. Mayer Systolic Array Multiprocessor Systolic Array (SA) Architecture Each processor has private memory Network is defined by the Systolic Pathway (SP) Each node is connected via SP to some subset of other processors Node connectivity: determined by implemented/selected network topology Systolic pathway is high-performance network; sending and receiving may be synchronized (blocking) or asynchronous (data received are buffered) Typical network topologies: ring torus, hex grid, mesh, etc. Sample below is a ring; note that wrap-around along x and y dimensions are not shown Processor can write to x or y gate; sends word off on x or y SP Processor can read from x or y gate; consumes word from x or y SP Buffered SA can write to gate, even if receiver cannot read Reading from gate when no message available blocks Automatic code generation for non-buffered SA hard, compiler must keep track of interprocessor synchronization Can view SP as an extension of memory with infinite capacity, but with sequential access
15 © Dr. Herbert G. Mayer Systolic Array Multiprocessor
16 © Dr. Herbert G. Mayer Systolic Array Multiprocessor Note that each pathway, x or y, may be bi-directional May have any number of pathways, nothing magic about 2, x and y Possible to have I/O capability with each node Next example shows a torus (without displaying the wrap-around pathways)
17 © Dr. Herbert G. Mayer Hybrid Multiprocessor - Superscalar Superscalar (SSA) Architecture Is scalar architecture w.r.t. object code Is parallel architecture, with multiple copies of some hardware units Has multiple ALUs, possibly FP add (FPA), FP multiply (FPM), >1 integer units Arithmetic operations simultaneous with load- and store operations Code sequence looks like sequence of instructions for scalar processor Object code can be custom-tailored by compiler Fetch enough instruction bytes to support longest possible object sequence Decoding is bottle-neck for CISC, easy for RISC 32-bit units Sample of superscalar: i80860 has FPA, FPM, 2 integer ops, load, store with pre-post increment and decrement
18 © Dr. Herbert G. Mayer Hybrid Multiprocessor – SSA & PA Pipelined + Superscalar Architecture
19 © Dr. Herbert G. Mayer Hybrid Multiprocessor - VLIW VLIW Architecture (VLIW) Very Long Instruction Word –typically 128 bits or more –below 128: LIW Object code no longer purely scalar Some special-select opcodes designed to support parallel execution Compiler/programmer explicitly packs VLIW ops Other opcodes are still scalar, can coexists with VLIW instructions Scalar operation possible by placing no-ops into some VLIW fields Sample: Compute instruction of CMU warp ® and Intel iWarp ® Data Dependence example: Result of FPA cannot be used as operand for FPM in the same VLIW instruction Thus, need to software-pipeline; not discussed in CS 106
20 © Dr. Herbert G. Mayer Hybrid Multiprocessor One single VLIW Instruction
21 © Dr. Herbert G. Mayer Hybrid Multiprocessor EPIC Architecture (EA) Groups instructions into bundles Straighten out the branches by associating predicate with instructions Execute instructions in parallel, say the else-clause and the then clause of an If Statement Decide at run time, which of the predicates is true, and execute just that part Use speculation, to straighten branch tree Use rotating register file, AKA register windows Have provides many registers, not just 64