1 CS 201 Computer Systems Programming Chapter 4 “Computer Taxonomy” Herbert G. Mayer, PSU CS Status 6/30/2014.

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

CSCI 4717/5717 Computer Architecture
ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
RISC / CISC Architecture By: Ramtin Raji Kermani Ramtin Raji Kermani Rayan Arasteh Rayan Arasteh An Introduction to Professor: Mr. Khayami Mr. Khayami.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Computer Organization and Architecture
Computer Organization and Architecture
Computer Organization and Architecture
Computer Architecture and Data Manipulation Chapter 3.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Processor Technology and Architecture
COMP3221: Microprocessors and Embedded Systems Lecture 2: Instruction Set Architecture (ISA) Lecturer: Hui Wu Session.
CSCE 121, Sec 200, 507, 508 Fall 2010 Prof. Jennifer L. Welch.
Chapter 17 Parallel Processing.
CS 300 – Lecture 23 Intro to Computer Architecture / Assembly Language Virtual Memory Pipelining.
Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.
Chapter 15 IA 64 Architecture Review Predication Predication Registers Speculation Control Data Software Pipelining Prolog, Kernel, & Epilog phases Automatic.
PSU CS 106 Computing Fundamentals II Introduction HM 1/3/2009.
PSU CS 106 Computing Fundamentals II Sample Architectures HM 4/14/2008.
Introduction to Parallel Processing Ch. 12, Pg
(6.1) Central Processing Unit Architecture  Architecture overview  Machine organization – von Neumann  Speeding up CPU operations – multiple registers.
CH12 CPU Structure and Function
Advanced Computer Architectures
1 CS 161 Introduction to Programming and Problem Solving Chapter 4 Computer Taxonomy Herbert G. Mayer, PSU Status 10/11/2014.
Basics and Architectures
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
Edited By Miss Sarwat Iqbal (FUUAST) Last updated:21/1/13
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
MICROCOMPUTER ARCHITECTURE 1.  2.1 Basic Blocks of a Microcomputer  2.2 Typical Microcomputer Architecture  2.3 Single-Chip Microprocessor  2.4 Program.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Computer Architecture And Organization UNIT-II General System Architecture.
Computer Architecture and Organization
Chapter 2 Data Manipulation. © 2005 Pearson Addison-Wesley. All rights reserved 2-2 Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine.
1 ECE 587 Advanced Computer Architecture I Chapter 2 Computer Taxonomy Herbert G. Mayer, PSU Status 7/1/2015.
Pipelining and Parallelism Mark Staveley
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
Computer and Information Sciences College / Computer Science Department CS 206 D Computer Organization and Assembly Language.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Computer performance issues* Pipelines, Parallelism. Process and Threads.
What is a program? A sequence of steps
PART 4: (1/2) Central Processing Unit (CPU) Basics CHAPTER 12: P ROCESSOR S TRUCTURE AND F UNCTION.
Copyright © 2005 – Curt Hill MicroProgramming Programming at a different level.
Advanced Architectures
CS 201 Computer Systems Programming Chapter 4 “Computer Taxonomy”
William Stallings Computer Organization and Architecture 8th Edition
Parallel Processing - introduction
Morgan Kaufmann Publishers
ECE 486/586 Computer Architecture Chapter 1 Computer Taxonomy
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
CSCE Fall 2013 Prof. Jennifer L. Welch.
CS 201 Computer Systems Programming Chapter 4
Morgan Kaufmann Publishers Computer Organization and Assembly Language
CSCE Fall 2012 Prof. Jennifer L. Welch.
What is Computer Architecture?
Introduction to Microprocessor Programming
What is Computer Architecture?
What is Computer Architecture?
Lecture 4: Instruction Set Design/Pipelining
CSE378 Introduction to Machine Organization
Presentation transcript:

1 CS 201 Computer Systems Programming Chapter 4 “Computer Taxonomy” Herbert G. Mayer, PSU CS Status 6/30/2014

2 Syllabus Introduction Common Architecture Attributes General Limitations Data-Stream Instruction-Stream Generic Architecture Model Instruction Set Architecture (ISA) Iron Law of Performance Uniprocessor (UP) Architectures Multiprocessor (MP) Architectures Hybrid Architectures References

3 Introduction: Uniprocessors Single Accumulator Architectures, earliest in the 1940s; e.g. Atanasoff, Zuse, von Neumann General-Purpose Register Architectures (GPR) 2-Address Architecture, i.e. GPR with one operand implied, e.g. IBM Address Architecture, i.e. GPR with all operands of arithmetic operation explicit, e.g. VAX 11/70 Stack Machines (e.g. B5000, B6000, HP3000) Pipelined architecture, e.g. CDC 5000, Cyber 6000 Vector Architecture, e.g. Amdahl 470/6, competing with IBM’s 360 in the 1970s blurs line to Multiprocessor

4 Introduction: Multiprocessors Shared Memory Architecture; e.g. Illiac IV, BSP Distributed Memory Architecture Systolic Architecture; see Intel ® iWarp and CMU’s warp architecture Data Flow Machine; see Jack Dennis’ work at MIT

5 Introduction: Hybrid Architectures Superscalar Architecture; see Intel 80860, AKA i860 VLIW Architecture see Multiflow computer or systolic array architecture, like Warp at CPU or iWarp at Intel in the 1990s Pipelined Architecture; debatable if it is a hybrid architecture EPIC Architecture; see HP and Intel ® Itanium ® architecture

6 Common Architecture Attributes Main memory (main store), external from processor Program instructions stored in main memory Also, data stored in main memory; typical for von Neumann architecture Data available in –distributed over– static memory, stack, heap, reserved OS space, free space, IO space Instruction pointer (AKA instruction counter, program counter pc), other special registers Von Neumann memory bottle-neck: everything travels on the same, single bus 

7 Common Architecture Attributes Accumulator (register, 1 or many) holds result of arithmetic-logical operation Memory Controller handles memory access requests from processor; moves bits to/from memory; is part of “chipset” Current trend is to move some of the memory controller or IO controller onto CPU chip; caveat: that does not mean the chipset IS part of the CPU! Logical processor unit includes: FP unit, Integer unit, control unit, register file, load-store unit, pathways Physical processor unit includes: heat sensors, frequency control, voltage regulator, and more

8 General Limitations Compute-Bound: type of application, in which the vast majority of execution time is spent fetching and executing instructions; time to load and store data in/from memory is small % of overall Memory-Bound: application, in which the majority of execution time is spent loading and storing data in memory; time executing instructions is small % vs. time to access memory IO-Bound: application, in which the majority of execution time is spent accessing secondary storage; time executing instructions, even the time accessing memory, is small % vs. time to access secondary storage Backup-Bound (semi-serious only): Like IO-Bound, but backup storage medium can be even slower than typical secondary storage devices

9 Data-Stream Instruction-Stream Classification developed by Michael J. Flynn, Single-Instruction, Single-Data Stream (SISD) Architecture PDP Single-Instruction, Multiple-Data Stream (SIMD) Architecture Array Processors, Solomon, Illiac IV, BSP, TMC 3. 3.Multiple-Instruction, Single-Data Stream (MISD) Architecture Pipelined architecture 4. 4.Multiple-Instruction, Multiple-Data Stream Architecture (MIMD) true multiprocessor

10 Generic Architecture Model

11 Instruction Set Architecture (ISA) ISA is boundary between Software and Hardware Specifies logical machine visible to programmer & compiler Is functional specification for processor designers That boundary is sometimes a very low-level piece of system SW that handles exceptions, interrupts, and HW-specific services that could fall into the domain of the OS

12 Instruction Set Architecture (ISA) Specified by ISA are: Operations: what to perform and in which order Active, temporary operand storage in CPU: accumulator, stack, registers note that stack can be word-sized, even bit-sized (e.g. extreme design of successor for NCR’s Century architecture of the 1970s) Number of operands per instruction; implicit, others explicit Operand location: where and how to locate/specify the operands: Register, literal, data in memory Type and size of operands: bit, byte, word, double-word,... Instruction Encoding in binary Data types: int, float, double, decimal, char, bit

13 Instruction Set Architecture (ISA)

14 Iron Law of Performance Clock-rate doesn’t count! Bus width doesn’t count. Number of registers and operations executed in parallel doesn’t count! What counts is how long it takes for my computational task to complete. That time is of the essence of computing! If a MIPS-based solution runs at 1 GHz that completes a program X in 2 minutes, while an Intel Pentium ® 4–based program runs at 3 GHz and completes that same program x in 2.5 minutes, programmers are more interested in the MIPS solution If a solution on an Intel CPU can be expressed in an object program of size Y bytes, but on an IBM architecture of size 1.1 * Y bytes, the Intel solution is generally more attractive Meaning of this: Wall-clock time (Time) is time I have to wait for completion Program Size is overall complexity of computational task

15 Iron Law of Performance

16 Different Classes of Architectures

17 Uniprocessor (UP) Architectures Single Accumulator Architecture (SAA) Single register to hold operation results Conventionally called accumulator Accumulator used as destination of arithmetic operations, and as (one) source SAA has central processing unit, memory unit, connecting memory bus; typical for van Neumann architecture The pc points to next instruction in memory to be executed Sample: ENIAC

18 Uniprocessor (UP) Architectures General-Purpose Register (GPR) Architecture Accumulates ALU results in n registers, typically 4, 8, 16, 64 Allows register-to-register operations, fast! GPR is essentially a multi-register extension of SAA Two-address architecture specifies one source operand explicitly, another implicitly, plus one destination Three-address architecture specifies two source operands explicitly, plus an explicit destination Variations allow additional index registers, base registers, multiple index registers, etc.

19 Uniprocessor (UP) Architectures Stack Machine Architecture (SMA) AKA zero-address architecture, since arithmetic operations require no explicit operand, hence no operand addresses; all are implied to be on the stack, except for push and pop Wake-up call to Students: What is equivalent of push/pop on GPR? Pure Stack Machine (SMA) has no registers Hence performance is inherently poor, as all operations involve memory on a stack machine However, one will design an SMA that implements the n top of stack elements as registers, i.e. as a Stack Cache: n = 4, 8,... Sample architectures: Burroughs B5000, HP 3000 Implement impure stack operations that bypass tos operand addressing Sample code sequence to compute:

20 Uniprocessor (UP) Architectures Stack Machine Architecture (SMA) res := a * ( b ) -- operand sizes are implied! push a-- destination implied: stack pushlit also destination implied push b-- ditto add-- 2 sources, and destination implied mult-- 2 sources, and destination implied pop res-- source implied: stack

21 Uniprocessor (UP) Architectures Pipelined Architecture (PA) Arithmetic Logic Unit, ALU, split into separate, sequentially connected units in PA Unit is referred to as a stage; more precisely the time at which the action is done is the stage Each of these stages/units can be initiated once per cycle Yet each subunit is implemented in HW just once Multiple subunits operate in parallel on different sub-ops, each executing a different stage; each stage is part of one instruction execution, many stages running in parallel Non-unit time, differing # of cycles per operation cause different terminations  Operations abort in intermediate stage, if some later instruction changes the flow of control; e.g. due to a branch, exception, return, conditional branch, call

22 Uniprocessor (UP) Architectures Pipelined Architecture (PA)

23 Uniprocessor (UP) Architectures Pipelined Architecture (PA) Operation must stall in case of data or control dependence: stall, AKA interlock Ideally each instruction can be partitioned into the same number of stages, i.e. sub-operations Operations to be pipelined can sometimes be evenly partitioned into equal-length sub-operations That equal-length time quantum might as well be a single sub-clock In practice it is hard/impossible for architect to achieve; compare for example integer add and floating point divide!

24 Uniprocessor (UP) Architectures Pipelined Architecture (PA) Ideally all operations have independent operands i.e. one operand being computed is not needed as source of the next few operations if they were needed –and often they are—then this would cause dependence, which causes stall read after write (RAW) write after read (WAR) write after write –with use in between (WAW) Also, ideally, all instructions just happen to be arranged sequentially one after another In reality, there are branches, calls, returns etc.

25 Uniprocessor (UP) Architectures Simplified Pipelined Resource Diagram if:fetch an instruction de: decode the instruction op1: fetch or generate the first operand; if any op2: fetch or generate the second operand; if any exec: execute that stage of the overall operation wb: write result back to destination, if any e.g. noop has no destination; halt has no destination

26 Uniprocessor (UP) Architectures Superscalar Architecture; more detail also shown at: “Hybrid Architecture” Identical to regular uniprocessor architecture But some arithmetic or logical units are replicated E.g. may have multiple floating point (FP) multipliers Or FP multiplier and FP adder may work at the same time The key is: On a superscalar architecture sometimes more instructions than one can execute at one time! Provided that there is no data dependence! First superscalar machines included CDC 6600, Intel i960CA, and AMD series Object code can look identical to code for strict uni-processor, yet the HW fetches more than just the next instruction, and performs data dependence analysis

27 Uniprocessor (UP) Architectures Vector Architecture (VA) Register implemented as HW array of identical registers, named vr i VA may also have scalar registers, named r 0, r 1, etc. Scalar register can also be the first of the vector registers Vector registers can load/store block of contiguous data Still in sequence, but overlapped; number of steps to complete load/store of a vector also depends bus width Vector machine can perform multiple operations of the same kind on whole contiguous blocks of operands Still in sequence, but overlapped, and all operands are readily available Otherwise operates like GPR architecture, but on vector operands; if vector size is 1, then VA identical to UP

28 Uniprocessor (UP) Architectures Vector Architecture (VA)

29 Uniprocessor (UP) Architectures Sample Vector Architecture operation: ldv vr1, mem i -- loads 64 memory locs from [mem+i=0..63] stv vr2, mem j -- stores vr2 in 64 contiguous locs vadd vr1, vr2, vr3-- register-register vector add cvaddf r0, vr1, vr2, vr3-- has conditional meaning: -- sequential equivalent: for i = 0 to 63 do if bit i in r0 is 1 then vr1[i] = vr2[i] + vr3[i]// e.g. cvadd r0, r1, r2, r3 else -- do not move corresponding bits end if end for -- parallel syntax equivalent: forall i = 0 to 63 doparallel if bit i in r0 is 1 then vr1[i] = vr2[i] + vr3[i] end if end parallel for

30 Multiprocessor (MP) Architectures Shared Memory Architecture (SMA) Equal access to memory for all n processors, p 0 to p n-1 Only one will succeed in accessing shared memory, when there are multiple, quasi-simultaneous accesses Simultaneous memory access must be deterministic; needs an arbiter to ensure determinism Von Neumann bottleneck tighter than conventional UP system Generally there are twice as many loads as there are stores in typical object code Occasionally, some processors are idle due to memory conflict Typical number of processors n=4, but n=8 and greater possible, with large 2 nd level cache, even larger 3 rd level Only limited commercial success and acceptance, programming burden frequently on programmer Morphing in the 2000s into multi-core and hyper-threaded architectures, where programming burden is on multi- threading OS or the programmer

31 Multiprocessor (MP) Architectures Shared Memory Architecture (SMA)

32 Multiprocessor (MP) Architectures Distributed Memory Architecture (DMA) Processors have private memories, AKA local memories Yet programmer has to see single, logical memory space, regardless of local distribution Hence each processor p i always has access to its own memory Mem i Collection of all memories Mem i i= 0..n-1 is logical data space Thus, processors must access others’ memories Done via Message Passing or Virtual Shared Memory Messages must be routed, route be determined Route may be long, i.e. require multiple, intermediate nodes Blocking when: message expected but hasn’t arrived yet Blocking when: when destination cannot receive Growing message buffer size increases illusion of asynchronicity of sending and receiving operations Key parameter: time for 1 hop and package overhead to send empty message Message may also be delayed because of network congestion

33 Multiprocessor (MP) Architectures Distributed Memory Architecture (DMA)

34 Multiprocessor (MP) Architectures Systolic Array Architecture (SAA) Very few designed: CMU and Intel for (then) ARPA Each processor has private memory Network is pre-defined by the Systolic Pathway (SP) Each node is pre-connected via SP to some subset of other processors Node connectivity: determined by network topology Systolic pathway is high-performance network; sending and receiving may be synchronized (blocking) or asynchronous (data received are buffered) Typical network topologies: line, ring, torus, hex grid, mesh, etc. Sample below is a ring; wrap-around along x and y dimensions not shown Processor can write to x or y gate; sends word off on x or y SP Processor can read from x or y gate; consumes word from x or y SP Buffered SA can write to gate, even if receiver cannot read Reading from gate when no message available blocks Automatic code generation for non-buffered SA hard, compiler must keep track of interprocessor synchronization Can view SP as an extension of memory with infinite capacity, but with sequential access

35 Multiprocessor (MP) Architectures Systolic Array Architecture (SAA)

36 Multiprocessor (MP) Architectures Systolic Array Architecture (SAA) Note that each pathway, x or y, may be bi-directional May have any number of pathways, nothing magic about 2, x and y; could be 3 or more Possible to have I/O capability with each node Typical application: large polynomials of the form: y = k 0 + k 1 *x 1 + k 2 *x k n-1 *x n- 1 = Σ k i *x i Next example shows a torus without displaying the wrap- around pathways across both dimensions

37 Multiprocessor (MP) Architectures Systolic Array Architecture (SAA)

38 Hybrid Architectures Superscalar Architecture (SA) Replicates (duplicates) some operations in HW Seems like scalar architecture w.r.t. object code, can compute some operations of UP in parallel, e.g. fadd and fmult Is almost a parallel architecture, if it has multiple copies of some hardware units, say two fadd units Is not an MP architecture: ALU is not replicated Has multiple parts of an ALU, possibly multiple FPA units, or FPM units, and/or integer units Arithmetic operations simultaneous with load and store operations; note data dependence! Instruction fetch speculative, since number of parallel operations unknown; rule: fetch too much! But fetch no more than longest possible superscalar pattern

39 Hybrid Architectures Superscalar Architecture (SA) Code sequence looks like sequence of instructions for scalar processor Example: ® code executed on Pentium ® processors More famous and successful example: ® processor Object code can be custom-tailored by compiler; i.e. compiler can have superscalar target processor in mind, bias code emission, knowing that some code sequences are better suited for superscalar execution Fetch enough instruction bytes to support longest possible object sequence Decoding is bottle-neck for CISC, way easier for RISC  32-bit units Sample of superscalar: i80860 could run in parallel one FPA, one FPM, two integer ops, and a load or store in ++ or --

40 Hybrid Architectures Superscalar Architecture (SA)

41 Hybrid Architectures Very Long Instruction Word Architecture (VLIW) Very Long Instruction Word, typically 128 bits or more VLIW machine also has scalar operations VLIW code is no longer scalar, but explicitly parallel Limitations like in superscalar: VLIW is not a general MP architecture subinstructions do not have concurrent memory access dependences must be resolved before code emission But the VLIW opcode is designed to execute in parallel VLIW suboperations can be defined as no-op, thus just the other suboperations run in parallel Compiler/programmer explicitly packs parallelizable operations into VLIW instruction Just like horizontal microcode compaction

42 Hybrid Architectures VLIW Sample: Compute instruction of CMU warp ® and Intel ® iWarp ® Could be 1-bit (or few-bit) opcode for compute instruction; plus sub-opcodes for subinstructions Data dependence example: Result of FPA cannot be used as operand for FPM in the same VLIW instruction But provided proper SW pipelining (not covered in CS 201) both subinstructions may refer to the same FP register Result of int1 cannot be used as operand for int2, etc. With SW pipelining both subinstructions may refer to same int register Thus, need to software-pipeline

43 Hybrid Architectures Itanium EPIC Architecture Explicitly Parallel Instruction Computing Group instructions into bundles Straighten out the branches by associating predicate with instructions; avoids branch and executes speculatively Execute instructions in parallel, say the else clause and the then clause of an If Statement Decide at run time which of the predicates is true, and (post) complete just that path from multiple choices; discard others Use speculation to straighten branch tree Use rotating register file Has many registers, not just 64 GPRs

44 Hybrid Architectures Itanium Groups and bundles lump multiple compute steps into one that can be run in parallel Parallel comparisons allow fast decisions Predication associates a condition (the predicate) with 2 simultaneously executed instruction sequences, only 1 of which will be posted Speculation fetches operands, not knowing for sure, whether this results in use; branch may invalidate early fetch Branch elimination, straightens out code with jumps Branch prediction Large register file

45 Hybrid Architectures Itanium Numerous branch registers; speeds up execution by having some branch destinations in register; fast to load into ip reg Multiple CFM registers, Current Frame Marker regs; avoid slowness due to memory access See separate lecture note

46 References                     VLIW Architecture:   ACM reference to Multiflow computer architecture: