CS 201 Computer Systems Programming Chapter 4 “Computer Taxonomy”

Slides:



Advertisements
Similar presentations
CSCI 4717/5717 Computer Architecture
Advertisements

Computer Organization and Architecture
1 CS 201 Computer Systems Programming Chapter 4 “Computer Taxonomy” Herbert G. Mayer, PSU CS Status 6/30/2014.
Processor Technology and Architecture
Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.
PSU CS 106 Computing Fundamentals II Introduction HM 1/3/2009.
PSU CS 106 Computing Fundamentals II Sample Architectures HM 4/14/2008.
CH12 CPU Structure and Function
1 CS 161 Introduction to Programming and Problem Solving Chapter 4 Computer Taxonomy Herbert G. Mayer, PSU Status 10/11/2014.
Edited By Miss Sarwat Iqbal (FUUAST) Last updated:21/1/13
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
RISC Architecture RISC vs CISC Sherwin Chan.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Computer Architecture and Organization
1 ECE 587 Advanced Computer Architecture I Chapter 2 Computer Taxonomy Herbert G. Mayer, PSU Status 7/1/2015.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Chapter 10 Instruction Sets: Characteristics and Functions Felipe Navarro Luis Gomez Collin Brown.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
PART 4: (1/2) Central Processing Unit (CPU) Basics CHAPTER 12: P ROCESSOR S TRUCTURE AND F UNCTION.
Computer Architecture. Instruction Set “The collection of different instructions that the processor can execute it”. Usually represented by assembly codes,
CS 352H: Computer Systems Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
COMP 740: Computer Architecture and Implementation
Advanced Architectures
How objects are located in memory
Basic Processor Structure/design
Distributed Processors
Central Processing Unit Architecture
A Closer Look at Instruction Set Architectures
William Stallings Computer Organization and Architecture 8th Edition
buses, crossing switch, multistage network.
Parallel Processing - introduction
Introduction to microprocessor (Continued) Unit 1 Lecture 2
Simultaneous Multithreading
Morgan Kaufmann Publishers
Introduction of microprocessor
CS203 – Advanced Computer Architecture
ECE 486/586 Computer Architecture Chapter 1 Computer Taxonomy
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Pipelining and Vector Processing
Superscalar Processors & VLIW Processors
Central Processing Unit
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
CSCE Fall 2013 Prof. Jennifer L. Welch.
CS 201 Computer Systems Programming Chapter 4
Morgan Kaufmann Publishers Computer Organization and Assembly Language
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
buses, crossing switch, multistage network.
Chapter 9 Instruction Sets: Characteristics and Functions
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Computer Architecture
CSCE Fall 2012 Prof. Jennifer L. Welch.
What is Computer Architecture?
Introduction to Microprocessor Programming
What is Computer Architecture?
What is Computer Architecture?
Chapter 4 Multiprocessors
COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING
Computer Architecture
CPU Structure CPU must:
Lecture 4: Instruction Set Design/Pipelining
Chapter 11 Processor Structure and function
COMPUTER ORGANIZATION AND ARCHITECTURE
CSE378 Introduction to Machine Organization
Chapter 10 Instruction Sets: Characteristics and Functions
Presentation transcript:

CS 201 Computer Systems Programming Chapter 4 “Computer Taxonomy” Herbert G. Mayer, PSU CS Status 10/15/2013

Syllabus Introduction Common Architecture Attributes General Limitations Data-Stream Instruction-Stream Generic Architecture Model Instruction Set Architecture (ISA) Iron Law of Performance Uniprocessor (UP) Architectures Multiprocessor (MP) Architectures Hybrid Architectures References

Introduction: Uniprocessors Single Accumulator Architectures, earliest in the 1940s; e.g. Atanasoff, Zuse, von Neumann General-Purpose Register Architectures (GPR) 2-Address Architecture, i.e. GPR with one operand implied, e.g. IBM 360 3-Address Architecture, i.e. GPR with all operands of arithmetic operation explicit, e.g. VAX 11/70 Stack Machines (e.g. B5000, B6000, HP3000) Pipelined architecture, e.g. CDC 5000, Cyber 6000 Vector Architecture, e.g. Amdahl 470/6, competing with IBM’s 360 in the 1970s blurs line to Multiprocessor

Introduction: Multiprocessors Shared Memory Architecture; e.g. Illiac IV, BSP Distributed Memory Architecture Systolic Architecture; see Intel® iWarp and CMU’s warp architecture Data Flow Machine; see Jack Dennis’ work at MIT

Introduction: Hybrid Architectures Superscalar Architecture; see Intel 80860, AKA i860 VLIW Architecture see Multiflow computer or systolic array architecture, like Warp at CPU or iWarp at Intel in the 1990s Pipelined Architecture; debatable if it is a hybrid architecture  EPIC Architecture; see HP and Intel® Itanium® architecture

Common Architecture Attributes Main memory (main store), external from processor Program instructions stored in main memory Also, data stored in memory; typical for von Neumann architecture Data available in –distributed over– static memory, stack, heap, reserved OS space, free space, IO space Instruction pointer (AKA instruction counter, program counter pc), other special registers Von Neumann memory bottle-neck: everything travels on the same, single bus 

Common Architecture Attributes Accumulator (register, 1 or many) holds result of arithmetic-logical operation Memory Controller handles memory access requests from processor; moves bits to/from memory; is part of “chipset” Current trend is to move some of the memory controller or IO controller onto CPU chip; caveat: that does not mean the chipset IS part of the CPU! Logical processor unit includes: FP unit, Integer unit, control unit, register file, load-store unit, pathways Physical processor unit includes: heat sensors, frequency control, voltage regulator, and more

General Limitations Compute-Bound: Memory-Bound: IO-Bound: type of application, in which the vast majority of execution time is spent fetching and executing instructions; time to load and store data in/from memory is small % of overall Memory-Bound: application, in which the majority of execution time is spent loading and storing data in memory; time executing instructions is small % vs. time to access memory IO-Bound: application, in which the majority of execution time is spent accessing secondary storage; time executing instructions, even the time accessing memory, is small % vs. time to access secondary storage Backup-Bound: Like IO-Bound, but backup storage medium can be even slower than typical secondary storage devices

Data-Stream Instruction-Stream Classification developed by Michael J. Flynn, 1966 Single-Instruction, Single-Data Stream (SISD) Architecture PDP-11 Single-Instruction, Multiple-Data Stream (SIMD) Architecture Array Processors, Solomon, Illiac IV, BSP, TMC Multiple-Instruction, Single-Data Stream (MISD) Architecture Pipelined architecture Multiple-Instruction, Multiple-Data Stream Architecture (MIMD) true multiprocessor

Generic Architecture Model

Instruction Set Architecture (ISA) ISA is boundary between Software and Hardware Specifies logical machine visible to programmer & compiler Is functional specification for processor designers That boundary is sometimes a very low-level piece of system SW that handles exceptions, interrupts, and HW-specific services that could fall into the domain of the OS

Instruction Set Architecture (ISA) Specified by ISA are: Operations: what to perform and in which order Active, temporary operand storage in CPU: accumulator, stack, registers note that stack can be word-sized, even bit-sized (e.g. extreme design of successor for NCR’s Century architecture of the 1970s) Number of operands per instruction; implicit, others explicit Operand location: where and how to locate/specify the operands: Register, literal, data in memory Type and size of operands: bit, byte, word, double-word, . . . Instruction Encoding in binary Data types: int, float, double, decimal, char, bit

Instruction Set Architecture (ISA)

Iron Law of Performance Clock-rate doesn’t count! Bus width doesn’t count. Number of registers and operations executed in parallel doesn’t count!  What counts is how long it takes for my computational task to complete. That time is of the essence of computing! If a MIPS-based solution runs at 1 GHz that completes a program X in 2 minutes, while an Intel Pentium® 4–based program runs at 3 GHz and completes that same program x in 2.5 minutes, programmers are more interested in the MIPS solution If a solution on an Intel CPU can be expressed in an object program of size Y bytes, but on an IBM architecture of size 1.1 * Y bytes, the Intel solution is generally more attractive Meaning of this: Wall-clock time (Time) is time I have to wait for completion Program Size is overall complexity of computational task

Iron Law of Performance

Uniprocessor (UP) Architectures Single Accumulator Architecture (SAA) Single register to hold operation results Conventionally called accumulator Accumulator used as destination of arithmetic operations, and as (one) source SAA has central processing unit, memory unit, connecting memory bus; typical for van Neumann architecture The pc points to next instruction in memory to be executed Sample: ENIAC

Uniprocessor (UP) Architectures General-Purpose Register (GPR) Architecture Accumulates ALU results in n registers, typically 4, 8, 16, 64 Allows register-to-register operations, fast! GPR is essentially a multi-register extension of SAA Two-address architecture specifies one source operand explicitly, another implicitly, plus one destination Three-address architecture specifies two source operands explicitly, plus an explicit destination Variations allow additional index registers, base registers, multiple index registers, etc.

Uniprocessor (UP) Architectures Stack Machine Architecture (SMA) AKA zero-address architecture, since arithmetic operations require no explicit operand, hence no operand addresses; all are implied to be on the stack, except for push and pop Wake-up call to Students: What is equivalent of push/pop on GPR? Pure Stack Machine (SMA) has no registers Hence performance is inherently poor, as all operations involve memory on a stack machine However, one will design an SMA that implements the n top of stack elements as registers, i.e. as a Stack Cache: n = 4, 8, . . . Sample architectures: Burroughs B5000, HP 3000 Implement impure stack operations that bypass tos operand addressing Sample code sequence to compute:

Uniprocessor (UP) Architectures Stack Machine Architecture (SMA) res := a * ( 145 + b ) -- operand sizes are implied! push a -- destination implied: stack pushlit 145 -- also destination implied push b -- ditto add -- 2 sources, and destination implied mult -- 2 sources, and destination implied pop res -- source implied: stack

Uniprocessor (UP) Architectures Pipelined Architecture (PA) Arithmetic Logic Unit, ALU, split into separate, sequentially connected units in PA Unit is referred to as a stage; more precisely the time at which the action is done is the stage Each of these stages/units can be initiated once per cycle Yet each subunit is implemented in HW just once Multiple subunits operate in parallel on different sub-ops, each executing a different stage; each stage is part of one instruction execution, many stages running in parallel  Non-unit time, differing # of cycles per operation cause different terminations  Operations abort in intermediate stage, if some later instruction changes the flow of control; e.g. due to a branch, exception, return, conditional branch, call

Uniprocessor (UP) Architectures Pipelined Architecture (PA)

Uniprocessor (UP) Architectures Pipelined Architecture (PA) Operation must stall in case of data or control dependence: stall, AKA interlock Ideally each instruction can be partitioned into the same number of stages, i.e. sub-operations Operations to be pipelined can sometimes be evenly partitioned into equal-length sub-operations That equal-length time quantum might as well be a single sub-clock In practice it is hard/impossible for architect to achieve; compare for example integer add and floating point divide!

Uniprocessor (UP) Architectures Pipelined Architecture (PA) Ideally all operations have independent operands i.e. one operand being computed is not needed as source of the next few operations if they were needed –and often they are—then this would cause dependence, which causes stall read after write (RAW) write after read (WAR) write after write –with use in between (WAW) Also, ideally, all instructions just happen to be arranged sequentially one after another In reality, there are branches, calls, returns etc.

Uniprocessor (UP) Architectures Simplified Pipelined Resource Diagram if: fetch an instruction de: decode the instruction op1: fetch or generate the first operand; if any op2: fetch or generate the second operand; if any exec: execute that stage of the overall operation wb: write result back to destination, if any e.g. noop has no destination; halt has no destination

Uniprocessor (UP) Architectures Superscalar Architecture; shown at Hybrid Identical to regular uniprocessor architecture But some arithmetic or logical units are replicated E.g. may have multiple floating point (FP) multipliers Or FP multiplier and FP adder may work at the same time The key is: On a superscalar architecture sometimes more instructions than one can execute at one moment! Provided that there is no data dependence! First superscalar machines included CDC 6600, Intel i960CA, and AMD 29000 series Object code can look identical to code for strict uni- processor, yet the HW fetches more than just the next instruction, and performs data dependence analysis

Uniprocessor (UP) Architectures Vector Architecture (VA) Register implemented as HW array of identical registers, named vri VA may also have scalar registers, named r0, r1, etc. Scalar register can also be the first of the vector registers Vector registers can load/store block of contiguous data Still in sequence, but overlapped; number of steps to complete load/store of a vector also depends bus width Vector machine can perform multiple operations of the same kind on whole contiguous blocks of operands Still in sequence, but overlapped, and all operands are readily available Otherwise operates like GPR architecture, but on vector operands; if vector size is 1, then VA identical to UP

Uniprocessor (UP) Architectures Vector Architecture (VA)

Uniprocessor (UP) Architectures Sample Vector Architecture operation: ldv vr1, memi -- loads 64 memory locs from [mem+i=0..63] stv vr2, memj -- stores vr2 in 64 contiguous locs vadd vr1, vr2, vr3 -- register-register vector add   cvaddf r0, vr1, vr2, vr3 -- has conditional meaning: -- sequential equivalent: for i = 0 to 63 do if bit i in r0 is 1 then vr1[i] = vr2[i] + vr3[i] // e.g. cvadd r0, r1, r2, r3 else -- do not move corresponding bits end if end for -- parallel syntax equivalent: forall i = 0 to 63 doparallel vr1[i] = vr2[i] + vr3[i] end parallel for

Multiprocessor (MP) Architectures Shared Memory Architecture (SMA) Equal access to memory for all n processors, p0 to pn-1 Only one will succeed in accessing shared memory, when there are multiple, quasi-simultaneous accesses Simultaneous memory access must be deterministic; needs an arbiter to ensure determinism Von Neumann bottleneck tighter than conventional UP system Generally there are twice as many loads as there are stores in typical object code Occasionally, some processors are idle due to memory conflict Typical number of processors n=4, but n=8 and greater possible, with large 2nd level cache, even larger 3rd level Only limited commercial success and acceptance, programming burden frequently on programmer Morphing in the 2000s into multi-core and hyper-threaded architectures, where programming burden is on multi-threading OS or the programmer

Multiprocessor (MP) Architectures Shared Memory Architecture (SMA)

Multiprocessor (MP) Architectures Distributed Memory Architecture (DMA) Processors have private memories, AKA local memories Yet programmer has to see single, logical memory space, regardless of local distribution Hence each processor pi always has access to its own memory Memi Collection of all memories Memi i= 0..n-1 is logical data space Thus, processors must access others’ memories Done via Message Passing or Virtual Shared Memory Messages must be routed, route be determined Route may be long, i.e. require multiple, intermediate nodes Blocking when: message expected but hasn’t arrived yet Blocking when: when destination cannot receive Growing message buffer size increases illusion of asynchronicity of sending and receiving operations Key parameter: time for 1 hop and package overhead to send empty message Message may also be delayed because of network congestion

Multiprocessor (MP) Architectures Distributed Memory Architecture (DMA)

Multiprocessor (MP) Architectures Systolic Array Architecture (SAA) Very few designed: CMU and Intel for (then) ARPA Each processor has private memory Network is pre-defined by the Systolic Pathway (SP) Each node is pre-connected via SP to some subset of other processors Node connectivity: determined by network topology Systolic pathway is high-performance network; sending and receiving may be synchronized (blocking) or asynchronous (data received are buffered) Typical network topologies: line, ring, torus, hex grid, mesh, etc. Sample below is a ring; wrap-around along x and y dimensions not shown Processor can write to x or y gate; sends word off on x or y SP Processor can read from x or y gate; consumes word from x or y SP Buffered SA can write to gate, even if receiver cannot read Reading from gate when no message available blocks Automatic code generation for non-buffered SA hard, compiler must keep track of interprocessor synchronization Can view SP as an extension of memory with infinite capacity, but with sequential access

Multiprocessor (MP) Architectures Systolic Array Architecture (SAA)

Multiprocessor (MP) Architectures Systolic Array Architecture (SAA) Note that each pathway, x or y, may be bi-directional May have any number of pathways, nothing magic about 2, x and y; could be 3 or more Possible to have I/O capability with each node Typical application: large polynomials of the form: y = k0 + k1*x1 + k2*x2 .. + kn-1*xn-1 = Σ ki*xi Next example shows a torus without displaying the wrap- around pathways across both dimensions

Multiprocessor (MP) Architectures Systolic Array Architecture (SAA)

Hybrid Architectures Superscalar Architecture (SA) Replicates (duplicates) some operations in HW Seems like scalar architecture w.r.t. object code, can compute some operations of UP in parallel, e.g. fadd and fmult Is almost a parallel architecture, if it has multiple copies of some hardware units, say two fadd units Is not an MP architecture: ALU is not replicated Has multiple parts of an ALU, possibly multiple FPA units, or FPM units, and/or integer units Arithmetic operations simultaneous with load and store operations; note data dependence! Instruction fetch speculative, since number of parallel operations unknown; rule: fetch too much! But fetch no more than longest possible superscalar pattern

Hybrid Architectures Superscalar Architecture (SA) Code sequence looks like sequence of instructions for scalar processor Example: 80486® code executed on Pentium® processors More famous and successful example: 80860® processor Object code can be custom-tailored by compiler; i.e. compiler can have superscalar target processor in mind, bias code emission, knowing that some code sequences are better suited for superscalar execution Fetch enough instruction bytes to support longest possible object sequence Decoding is bottle-neck for CISC, way easier for RISC  32-bit units Sample of superscalar: i80860 could run in parallel one FPA, one FPM, two integer ops, and a load or store in ++ or --

Hybrid Architectures Superscalar Architecture (SA)

Hybrid Architectures Very Long Instruction Word Architecture (VLIW) Very Long Instruction Word, typically 128 bits or more VLIW machine also has scalar operations VLIW code is no longer scalar, but explicitly parallel Limitations like in superscalar: VLIW is not a general MP architecture subinstructions do not have concurrent memory access dependences must be resolved before code emission But the VLIW opcode is designed to execute in parallel VLIW suboperations can be defined as no-op, thus just the other suboperations run in parallel Compiler/programmer explicitly packs parallelizable operations into VLIW instruction Just like horizontal microcode compaction

Hybrid Architectures VLIW Sample: Compute instruction of CMU warp® and Intel® iWarp® Could be 1-bit (or few-bit) opcode for compute instruction; plus sub-opcodes for subinstructions Data dependence example: Result of FPA cannot be used as operand for FPM in the same VLIW instruction But provided proper SW pipelining (not covered in CS 201) both subinstructions may refer to the same FP register Result of int1 cannot be used as operand for int2, etc. With SW pipelining both subinstructions may refer to same int register Thus, need to software-pipeline

Hybrid Architectures Itanium EPIC Architecture Explicitly Parallel Instruction Computing Group instructions into bundles Straighten out the branches by associating predicate with instructions; avoids branch and executes speculatively Execute instructions in parallel, say the else clause and the then clause of an If Statement Decide at run time which of the predicates is true, and (post) complete just that path from multiple choices; discard others Use speculation to straighten branch tree Use rotating register file Has many registers, not just 64 GPRs

Hybrid Architectures Itanium Groups and bundles lump multiple compute steps into one that can be run in parallel Parallel comparisons allow fast decisions Predication associates a condition (the predicate) with 2 simultaneously executed instruction sequences, only 1 of which will be posted Speculation fetches operands, not knowing for sure, whether this results in use; branch may invalidate early fetch Branch elimination, straightens out code with jumps Branch prediction Large register file

Hybrid Architectures Itanium Numerous branch registers; speeds up execution by having some branch destinations in register; fast to load into ip reg Multiple CFM registers, Current Frame Marker regs; avoid slowness due to memory access See separate lecture note

References http://cs.illinois.edu/csillinois/history http://www.arl.wustl.edu/~pcrowley/cse526/bsp2.pdf http://dl.acm.org/citation.cfm?id=102450 http://csg.csail.mit.edu/Dataflow/talks/DennisTalk.pdf http://en.wikipedia.org/wiki/Flynn's_taxonomy http://www.ajwm.net/amayer/papers/B5000.html http://www.robelle.com/smugbook/classic.html http://en.wikipedia.org/wiki/ILLIAC_IV http://www.intel.com/design/itanium/manuals.htm http://www.csupomona.edu/~hnriley/www/VonN.html http://cva.stanford.edu/classes/ee482s/scribed/lect11.pdf VLIW Architecture: http://www.nxp.com/acrobat_download2/other/vliw-wp.pdf ACM reference to Multiflow computer architecture: http://dl.acm.org/citation.cfm?id=110622&coll=portal&dl=ACM