1 CS 201 Computer Systems Programming Chapter 4 “Computer Taxonomy” Herbert G. Mayer, PSU CS Status 6/30/2014
2 Syllabus Introduction Common Architecture Attributes General Limitations Data-Stream Instruction-Stream Generic Architecture Model Instruction Set Architecture (ISA) Iron Law of Performance Uniprocessor (UP) Architectures Multiprocessor (MP) Architectures Hybrid Architectures References
3 Introduction: Uniprocessors Single Accumulator Architectures, earliest in the 1940s; e.g. Atanasoff, Zuse, von Neumann General-Purpose Register Architectures (GPR) 2-Address Architecture, i.e. GPR with one operand implied, e.g. IBM Address Architecture, i.e. GPR with all operands of arithmetic operation explicit, e.g. VAX 11/70 Stack Machines (e.g. B5000, B6000, HP3000) Pipelined architecture, e.g. CDC 5000, Cyber 6000 Vector Architecture, e.g. Amdahl 470/6, competing with IBM’s 360 in the 1970s blurs line to Multiprocessor
4 Introduction: Multiprocessors Shared Memory Architecture; e.g. Illiac IV, BSP Distributed Memory Architecture Systolic Architecture; see Intel ® iWarp and CMU’s warp architecture Data Flow Machine; see Jack Dennis’ work at MIT
5 Introduction: Hybrid Architectures Superscalar Architecture; see Intel 80860, AKA i860 VLIW Architecture see Multiflow computer or systolic array architecture, like Warp at CPU or iWarp at Intel in the 1990s Pipelined Architecture; debatable if it is a hybrid architecture EPIC Architecture; see HP and Intel ® Itanium ® architecture
6 Common Architecture Attributes Main memory (main store), external from processor Program instructions stored in main memory Also, data stored in main memory; typical for von Neumann architecture Data available in –distributed over– static memory, stack, heap, reserved OS space, free space, IO space Instruction pointer (AKA instruction counter, program counter pc), other special registers Von Neumann memory bottle-neck: everything travels on the same, single bus
7 Common Architecture Attributes Accumulator (register, 1 or many) holds result of arithmetic-logical operation Memory Controller handles memory access requests from processor; moves bits to/from memory; is part of “chipset” Current trend is to move some of the memory controller or IO controller onto CPU chip; caveat: that does not mean the chipset IS part of the CPU! Logical processor unit includes: FP unit, Integer unit, control unit, register file, load-store unit, pathways Physical processor unit includes: heat sensors, frequency control, voltage regulator, and more
8 General Limitations Compute-Bound: type of application, in which the vast majority of execution time is spent fetching and executing instructions; time to load and store data in/from memory is small % of overall Memory-Bound: application, in which the majority of execution time is spent loading and storing data in memory; time executing instructions is small % vs. time to access memory IO-Bound: application, in which the majority of execution time is spent accessing secondary storage; time executing instructions, even the time accessing memory, is small % vs. time to access secondary storage Backup-Bound (semi-serious only): Like IO-Bound, but backup storage medium can be even slower than typical secondary storage devices
9 Data-Stream Instruction-Stream Classification developed by Michael J. Flynn, Single-Instruction, Single-Data Stream (SISD) Architecture PDP Single-Instruction, Multiple-Data Stream (SIMD) Architecture Array Processors, Solomon, Illiac IV, BSP, TMC 3. 3.Multiple-Instruction, Single-Data Stream (MISD) Architecture Pipelined architecture 4. 4.Multiple-Instruction, Multiple-Data Stream Architecture (MIMD) true multiprocessor
10 Generic Architecture Model
11 Instruction Set Architecture (ISA) ISA is boundary between Software and Hardware Specifies logical machine visible to programmer & compiler Is functional specification for processor designers That boundary is sometimes a very low-level piece of system SW that handles exceptions, interrupts, and HW-specific services that could fall into the domain of the OS
12 Instruction Set Architecture (ISA) Specified by ISA are: Operations: what to perform and in which order Active, temporary operand storage in CPU: accumulator, stack, registers note that stack can be word-sized, even bit-sized (e.g. extreme design of successor for NCR’s Century architecture of the 1970s) Number of operands per instruction; implicit, others explicit Operand location: where and how to locate/specify the operands: Register, literal, data in memory Type and size of operands: bit, byte, word, double-word,... Instruction Encoding in binary Data types: int, float, double, decimal, char, bit
13 Instruction Set Architecture (ISA)
14 Iron Law of Performance Clock-rate doesn’t count! Bus width doesn’t count. Number of registers and operations executed in parallel doesn’t count! What counts is how long it takes for my computational task to complete. That time is of the essence of computing! If a MIPS-based solution runs at 1 GHz that completes a program X in 2 minutes, while an Intel Pentium ® 4–based program runs at 3 GHz and completes that same program x in 2.5 minutes, programmers are more interested in the MIPS solution If a solution on an Intel CPU can be expressed in an object program of size Y bytes, but on an IBM architecture of size 1.1 * Y bytes, the Intel solution is generally more attractive Meaning of this: Wall-clock time (Time) is time I have to wait for completion Program Size is overall complexity of computational task
15 Iron Law of Performance
16 Different Classes of Architectures
17 Uniprocessor (UP) Architectures Single Accumulator Architecture (SAA) Single register to hold operation results Conventionally called accumulator Accumulator used as destination of arithmetic operations, and as (one) source SAA has central processing unit, memory unit, connecting memory bus; typical for van Neumann architecture The pc points to next instruction in memory to be executed Sample: ENIAC
18 Uniprocessor (UP) Architectures General-Purpose Register (GPR) Architecture Accumulates ALU results in n registers, typically 4, 8, 16, 64 Allows register-to-register operations, fast! GPR is essentially a multi-register extension of SAA Two-address architecture specifies one source operand explicitly, another implicitly, plus one destination Three-address architecture specifies two source operands explicitly, plus an explicit destination Variations allow additional index registers, base registers, multiple index registers, etc.
19 Uniprocessor (UP) Architectures Stack Machine Architecture (SMA) AKA zero-address architecture, since arithmetic operations require no explicit operand, hence no operand addresses; all are implied to be on the stack, except for push and pop Wake-up call to Students: What is equivalent of push/pop on GPR? Pure Stack Machine (SMA) has no registers Hence performance is inherently poor, as all operations involve memory on a stack machine However, one will design an SMA that implements the n top of stack elements as registers, i.e. as a Stack Cache: n = 4, 8,... Sample architectures: Burroughs B5000, HP 3000 Implement impure stack operations that bypass tos operand addressing Sample code sequence to compute:
20 Uniprocessor (UP) Architectures Stack Machine Architecture (SMA) res := a * ( b ) -- operand sizes are implied! push a-- destination implied: stack pushlit also destination implied push b-- ditto add-- 2 sources, and destination implied mult-- 2 sources, and destination implied pop res-- source implied: stack
21 Uniprocessor (UP) Architectures Pipelined Architecture (PA) Arithmetic Logic Unit, ALU, split into separate, sequentially connected units in PA Unit is referred to as a stage; more precisely the time at which the action is done is the stage Each of these stages/units can be initiated once per cycle Yet each subunit is implemented in HW just once Multiple subunits operate in parallel on different sub-ops, each executing a different stage; each stage is part of one instruction execution, many stages running in parallel Non-unit time, differing # of cycles per operation cause different terminations Operations abort in intermediate stage, if some later instruction changes the flow of control; e.g. due to a branch, exception, return, conditional branch, call
22 Uniprocessor (UP) Architectures Pipelined Architecture (PA)
23 Uniprocessor (UP) Architectures Pipelined Architecture (PA) Operation must stall in case of data or control dependence: stall, AKA interlock Ideally each instruction can be partitioned into the same number of stages, i.e. sub-operations Operations to be pipelined can sometimes be evenly partitioned into equal-length sub-operations That equal-length time quantum might as well be a single sub-clock In practice it is hard/impossible for architect to achieve; compare for example integer add and floating point divide!
24 Uniprocessor (UP) Architectures Pipelined Architecture (PA) Ideally all operations have independent operands i.e. one operand being computed is not needed as source of the next few operations if they were needed –and often they are—then this would cause dependence, which causes stall read after write (RAW) write after read (WAR) write after write –with use in between (WAW) Also, ideally, all instructions just happen to be arranged sequentially one after another In reality, there are branches, calls, returns etc.
25 Uniprocessor (UP) Architectures Simplified Pipelined Resource Diagram if:fetch an instruction de: decode the instruction op1: fetch or generate the first operand; if any op2: fetch or generate the second operand; if any exec: execute that stage of the overall operation wb: write result back to destination, if any e.g. noop has no destination; halt has no destination
26 Uniprocessor (UP) Architectures Superscalar Architecture; more detail also shown at: “Hybrid Architecture” Identical to regular uniprocessor architecture But some arithmetic or logical units are replicated E.g. may have multiple floating point (FP) multipliers Or FP multiplier and FP adder may work at the same time The key is: On a superscalar architecture sometimes more instructions than one can execute at one time! Provided that there is no data dependence! First superscalar machines included CDC 6600, Intel i960CA, and AMD series Object code can look identical to code for strict uni-processor, yet the HW fetches more than just the next instruction, and performs data dependence analysis
27 Uniprocessor (UP) Architectures Vector Architecture (VA) Register implemented as HW array of identical registers, named vr i VA may also have scalar registers, named r 0, r 1, etc. Scalar register can also be the first of the vector registers Vector registers can load/store block of contiguous data Still in sequence, but overlapped; number of steps to complete load/store of a vector also depends bus width Vector machine can perform multiple operations of the same kind on whole contiguous blocks of operands Still in sequence, but overlapped, and all operands are readily available Otherwise operates like GPR architecture, but on vector operands; if vector size is 1, then VA identical to UP
28 Uniprocessor (UP) Architectures Vector Architecture (VA)
29 Uniprocessor (UP) Architectures Sample Vector Architecture operation: ldv vr1, mem i -- loads 64 memory locs from [mem+i=0..63] stv vr2, mem j -- stores vr2 in 64 contiguous locs vadd vr1, vr2, vr3-- register-register vector add cvaddf r0, vr1, vr2, vr3-- has conditional meaning: -- sequential equivalent: for i = 0 to 63 do if bit i in r0 is 1 then vr1[i] = vr2[i] + vr3[i]// e.g. cvadd r0, r1, r2, r3 else -- do not move corresponding bits end if end for -- parallel syntax equivalent: forall i = 0 to 63 doparallel if bit i in r0 is 1 then vr1[i] = vr2[i] + vr3[i] end if end parallel for
30 Multiprocessor (MP) Architectures Shared Memory Architecture (SMA) Equal access to memory for all n processors, p 0 to p n-1 Only one will succeed in accessing shared memory, when there are multiple, quasi-simultaneous accesses Simultaneous memory access must be deterministic; needs an arbiter to ensure determinism Von Neumann bottleneck tighter than conventional UP system Generally there are twice as many loads as there are stores in typical object code Occasionally, some processors are idle due to memory conflict Typical number of processors n=4, but n=8 and greater possible, with large 2 nd level cache, even larger 3 rd level Only limited commercial success and acceptance, programming burden frequently on programmer Morphing in the 2000s into multi-core and hyper-threaded architectures, where programming burden is on multi- threading OS or the programmer
31 Multiprocessor (MP) Architectures Shared Memory Architecture (SMA)
32 Multiprocessor (MP) Architectures Distributed Memory Architecture (DMA) Processors have private memories, AKA local memories Yet programmer has to see single, logical memory space, regardless of local distribution Hence each processor p i always has access to its own memory Mem i Collection of all memories Mem i i= 0..n-1 is logical data space Thus, processors must access others’ memories Done via Message Passing or Virtual Shared Memory Messages must be routed, route be determined Route may be long, i.e. require multiple, intermediate nodes Blocking when: message expected but hasn’t arrived yet Blocking when: when destination cannot receive Growing message buffer size increases illusion of asynchronicity of sending and receiving operations Key parameter: time for 1 hop and package overhead to send empty message Message may also be delayed because of network congestion
33 Multiprocessor (MP) Architectures Distributed Memory Architecture (DMA)
34 Multiprocessor (MP) Architectures Systolic Array Architecture (SAA) Very few designed: CMU and Intel for (then) ARPA Each processor has private memory Network is pre-defined by the Systolic Pathway (SP) Each node is pre-connected via SP to some subset of other processors Node connectivity: determined by network topology Systolic pathway is high-performance network; sending and receiving may be synchronized (blocking) or asynchronous (data received are buffered) Typical network topologies: line, ring, torus, hex grid, mesh, etc. Sample below is a ring; wrap-around along x and y dimensions not shown Processor can write to x or y gate; sends word off on x or y SP Processor can read from x or y gate; consumes word from x or y SP Buffered SA can write to gate, even if receiver cannot read Reading from gate when no message available blocks Automatic code generation for non-buffered SA hard, compiler must keep track of interprocessor synchronization Can view SP as an extension of memory with infinite capacity, but with sequential access
35 Multiprocessor (MP) Architectures Systolic Array Architecture (SAA)
36 Multiprocessor (MP) Architectures Systolic Array Architecture (SAA) Note that each pathway, x or y, may be bi-directional May have any number of pathways, nothing magic about 2, x and y; could be 3 or more Possible to have I/O capability with each node Typical application: large polynomials of the form: y = k 0 + k 1 *x 1 + k 2 *x k n-1 *x n- 1 = Σ k i *x i Next example shows a torus without displaying the wrap- around pathways across both dimensions
37 Multiprocessor (MP) Architectures Systolic Array Architecture (SAA)
38 Hybrid Architectures Superscalar Architecture (SA) Replicates (duplicates) some operations in HW Seems like scalar architecture w.r.t. object code, can compute some operations of UP in parallel, e.g. fadd and fmult Is almost a parallel architecture, if it has multiple copies of some hardware units, say two fadd units Is not an MP architecture: ALU is not replicated Has multiple parts of an ALU, possibly multiple FPA units, or FPM units, and/or integer units Arithmetic operations simultaneous with load and store operations; note data dependence! Instruction fetch speculative, since number of parallel operations unknown; rule: fetch too much! But fetch no more than longest possible superscalar pattern
39 Hybrid Architectures Superscalar Architecture (SA) Code sequence looks like sequence of instructions for scalar processor Example: ® code executed on Pentium ® processors More famous and successful example: ® processor Object code can be custom-tailored by compiler; i.e. compiler can have superscalar target processor in mind, bias code emission, knowing that some code sequences are better suited for superscalar execution Fetch enough instruction bytes to support longest possible object sequence Decoding is bottle-neck for CISC, way easier for RISC 32-bit units Sample of superscalar: i80860 could run in parallel one FPA, one FPM, two integer ops, and a load or store in ++ or --
40 Hybrid Architectures Superscalar Architecture (SA)
41 Hybrid Architectures Very Long Instruction Word Architecture (VLIW) Very Long Instruction Word, typically 128 bits or more VLIW machine also has scalar operations VLIW code is no longer scalar, but explicitly parallel Limitations like in superscalar: VLIW is not a general MP architecture subinstructions do not have concurrent memory access dependences must be resolved before code emission But the VLIW opcode is designed to execute in parallel VLIW suboperations can be defined as no-op, thus just the other suboperations run in parallel Compiler/programmer explicitly packs parallelizable operations into VLIW instruction Just like horizontal microcode compaction
42 Hybrid Architectures VLIW Sample: Compute instruction of CMU warp ® and Intel ® iWarp ® Could be 1-bit (or few-bit) opcode for compute instruction; plus sub-opcodes for subinstructions Data dependence example: Result of FPA cannot be used as operand for FPM in the same VLIW instruction But provided proper SW pipelining (not covered in CS 201) both subinstructions may refer to the same FP register Result of int1 cannot be used as operand for int2, etc. With SW pipelining both subinstructions may refer to same int register Thus, need to software-pipeline
43 Hybrid Architectures Itanium EPIC Architecture Explicitly Parallel Instruction Computing Group instructions into bundles Straighten out the branches by associating predicate with instructions; avoids branch and executes speculatively Execute instructions in parallel, say the else clause and the then clause of an If Statement Decide at run time which of the predicates is true, and (post) complete just that path from multiple choices; discard others Use speculation to straighten branch tree Use rotating register file Has many registers, not just 64 GPRs
44 Hybrid Architectures Itanium Groups and bundles lump multiple compute steps into one that can be run in parallel Parallel comparisons allow fast decisions Predication associates a condition (the predicate) with 2 simultaneously executed instruction sequences, only 1 of which will be posted Speculation fetches operands, not knowing for sure, whether this results in use; branch may invalidate early fetch Branch elimination, straightens out code with jumps Branch prediction Large register file
45 Hybrid Architectures Itanium Numerous branch registers; speeds up execution by having some branch destinations in register; fast to load into ip reg Multiple CFM registers, Current Frame Marker regs; avoid slowness due to memory access See separate lecture note
46 References VLIW Architecture: ACM reference to Multiflow computer architecture: