PSU CS 106 Computing Fundamentals II Sample Architectures HM 4/14/2008.

Slides:



Advertisements
Similar presentations
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Advertisements

Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
RISC / CISC Architecture By: Ramtin Raji Kermani Ramtin Raji Kermani Rayan Arasteh Rayan Arasteh An Introduction to Professor: Mr. Khayami Mr. Khayami.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Computer Organization and Architecture
Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
Computer Organization and Architecture
Computer Architecture and Data Manipulation Chapter 3.
1 CS 201 Computer Systems Programming Chapter 4 “Computer Taxonomy” Herbert G. Mayer, PSU CS Status 6/30/2014.
Computer Organization. This module surveys the physical resources of a computer system. –Basic components CPUMemoryBus I/O devices –CPU structure Registers.
Instruction Set Architecture & Design
Processor Technology and Architecture
CHAPTER 4 COMPUTER SYSTEM – Von Neumann Model
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
CSCE 121, Sec 200, 507, 508 Fall 2010 Prof. Jennifer L. Welch.
1 COMP541 Sequencing – III (Sequencing a Computer) Montek Singh April 9, 2007.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Chapter 15 IA 64 Architecture Review Predication Predication Registers Speculation Control Data Software Pipelining Prolog, Kernel, & Epilog phases Automatic.
PSU CS 106 Computing Fundamentals II Introduction HM 1/3/2009.
(6.1) Central Processing Unit Architecture  Architecture overview  Machine organization – von Neumann  Speeding up CPU operations – multiple registers.
CSE378 Pipelining1 Pipelining Basic concept of assembly line –Split a job A into n sequential subjobs (A 1,A 2,…,A n ) with each A i taking approximately.
The Structure of the CPU
Parallelism Processing more than one instruction at a time. Pipelining
1 CS 161 Introduction to Programming and Problem Solving Chapter 4 Computer Taxonomy Herbert G. Mayer, PSU Status 10/11/2014.
Basics and Architectures
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
COMPUTER ORGANIZATION AND ASSEMBLY LANGUAGE Lecture 15 & 16 Stacks, Endianness Addressing Modes Course Instructor: Engr. Aisha Danish.
Chapter 1 An Introduction to Processor Design 부산대학교 컴퓨터공학과.
MICROCOMPUTER ARCHITECTURE 1.  2.1 Basic Blocks of a Microcomputer  2.2 Typical Microcomputer Architecture  2.3 Single-Chip Microprocessor  2.4 Program.
Chapter 4 The Von Neumann Model
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Chapter 2 Data Manipulation. © 2005 Pearson Addison-Wesley. All rights reserved 2-2 Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine.
COMPUTER ORGANIZATION AND ASSEMBLY LANGUAGE Lecture 19 & 20 Instruction Formats PDP-8,PDP-10,PDP-11 & VAX Course Instructor: Engr. Aisha Danish.
Operand Addressing And Instruction Representation Cs355-Chapter 6.
Computer Architecture 2 nd year (computer and Information Sc.)
1 ECE 587 Advanced Computer Architecture I Chapter 2 Computer Taxonomy Herbert G. Mayer, PSU Status 7/1/2015.
Pipelining and Parallelism Mark Staveley
DIGITAL SIGNAL PROCESSORS. Von Neumann Architecture Computers to be programmed by codes residing in memory. Single Memory to store data and program.
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
Computer and Information Sciences College / Computer Science Department CS 206 D Computer Organization and Assembly Language.
What is a program? A sequence of steps
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
1 Basic Processor Architecture. 2 Building Blocks of Processor Systems CPU.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
CS 352H: Computer Systems Architecture
CS 201 Computer Systems Programming Chapter 4 “Computer Taxonomy”
Central Processing Unit Architecture
A Closer Look at Instruction Set Architectures
buses, crossing switch, multistage network.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
ELEN 468 Advanced Logic Design
Chapter 9 a Instruction Level Parallelism and Superscalar Processors
ECE 486/586 Computer Architecture Chapter 1 Computer Taxonomy
Processor Organization and Architecture
CSCE Fall 2013 Prof. Jennifer L. Welch.
CS 201 Computer Systems Programming Chapter 4
Morgan Kaufmann Publishers Computer Organization and Assembly Language
Multivector and SIMD Computers
buses, crossing switch, multistage network.
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
CSCE Fall 2012 Prof. Jennifer L. Welch.
Presentation transcript:

PSU CS 106 Computing Fundamentals II Sample Architectures HM 4/14/2008

2 © Dr. Herbert G. Mayer Agenda Single Accumulator Architecture General-Purpose Register Architecture Stack Machine Architecture Pipelined Architecture Vector Architecture Shared-Memory Multiprocessor Architecture Distributed-Memory Multiprocessor Architecture Systolic Architecture Superscalar Architecture VLIW Architecture

3 © Dr. Herbert G. Mayer Single Accumulator Architecture Single Accumulator (SAA) Architecture: Single register to hold operation results Conventionally called accumulator Accumulator used as destination of arithmetic operations, and as (one) source Has central processing unit, memory unit, connecting memory bus pc points to next instruction (in memory) to be executed next Sample: ENIAC

4 © Dr. Herbert G. Mayer General-Purpose Reg. Architecture General-Purpose Register (GPR) Architecture Accumulates ALU results in more than one register, n typically 4, 8, 16,.. 64 Allows register-to-register operations, fast! Essentially a multi-register extension of SA architecture Two-address architecture specifies one source operand plus destination Three-address architecture specifies two source operands plus destination Variations allow additional index registers, base registers etc.

5 © Dr. Herbert G. Mayer Stack Machine Architecture Stack Machine Architecture (SA) AKA zero-address architecture, as operations need no explicit operands Pure Stack Machine has no registers; of course there are no pure SAs Hence performance will be slow/poor, as all operations involve memory However: implement n top of stack elements as registers: Cache Sample architectures: Burroughs B5000, HP 3000 Implement impure stack operations that bypass tos operand addressing Example code sequence to compute res := a * ( b ) -- high-level source push a pushlit 145 push b add mult pop res

6 © Dr. Herbert G. Mayer Pipelined Architecture Pipelined Architecture (PA) Arithmetic Logic Unit (ALU) split into separate, sequential units Each of which can be initiated once per cycle Yet each subunit is implemented in HW just once Multiple subunits operate in parallel on different sub-ops, at different stages Ideally, all subunits require unit time (1 cycle) Ideally, all operations (add, fetch, store) take the same # of steps of time Non-unit time, differing # of cycles per operation cause different termination moments Operation aborted in case of branch, exception, call etc. Operation must stall in case of operand dependence: stall, caused by the interlock

7 © Dr. Herbert G. Mayer Pipelined Architecture, Cont’d

8 © Dr. Herbert G. Mayer Vector Architecture (VA) Register implemented as HW array of identical registers, named Vr0, Vr1 VA may also have scalar registers, named r0, r1, etc. Vector registers can load/store block of contiguous data Still in sequence, but overlapped; number of steps to complete load/store of a vector also depends on width of bus Vector registers can perform multiple operations of the same kind on blocks of operands Still sequentially, but overlapped, and all operands are readily available Otherwise operation of VA is similar to GPR architecture

9 © Dr. Herbert G. Mayer Vector Architecture, Cont’d Sample operations: ldv vr1,memi-- loads e.g. 64 memory locations stv vr2,memj-- stores Vr2 in 64 contiguous locs vadd vr1,vr2, vr3-- register-register vector addition cvaddf r0, vr1, vr2, vr3 -- has special, conditional meaning: -- sequential equivalent: for i = 0 to 63 do if bit i in r0 is 1 then vr1[i] = vr2[i] + vr3[i] end if end for

10 © Dr. Herbert G. Mayer Shared-Memory Multiprocessor Shared Memory Architecture (SMA) Equal access to memory for all n processors, p 0 to p n-1 ; possible to have additional, local memories Only one will succeed to access shared memory, if multiple, simultaneous accesses Simultaneous access must be deterministic Von Neumann bottleneck becoming even tighter than for conventional UP system If locality is great, and ~ 2 * loads as stores, and many arithmetic operations over memory accesses, then resource utilization good Typically # loads = ( 2 to 3 ) times the # stores Else some processors idle due to memory conflict Typical number of processors n=4, but n=8 and greater possible, with large 2nd level cache, even 3rd level

11 © Dr. Herbert G. Mayer Shared-Memory Multiprocessor

12 © Dr. Herbert G. Mayer Distributed-Memory Multiprocessor Distributed Memory Architecture (DMA) All memories are private Hence each processor p i always has access to its own memory Mem i However, collection of all memories is program’s logical data space Thus, processors must access others’ memories Done via Message Passing or Virtual Shared Memory Messages must be routed, route be determined; route may be long Blocking when: message expected but hasn’t arrived yet Blocking when: message to be sent, but destination cannot receive Growing message buffer size increases illusion of asynchronicity of sending and receiving operations Key parameter: time and package overhead to send empty message Message may also be delayed because of network congestion

13 © Dr. Herbert G. Mayer Distributed-Memory Multiprocessor

14 © Dr. Herbert G. Mayer Systolic Array Multiprocessor Systolic Array (SA) Architecture Each processor has private memory Network is defined by the Systolic Pathway (SP) Each node is connected via SP to some subset of other processors Node connectivity: determined by implemented/selected network topology Systolic pathway is high-performance network; sending and receiving may be synchronized (blocking) or asynchronous (data received are buffered) Typical network topologies: ring torus, hex grid, mesh, etc. Sample below is a ring; note that wrap-around along x and y dimensions are not shown Processor can write to x or y gate; sends word off on x or y SP Processor can read from x or y gate; consumes word from x or y SP Buffered SA can write to gate, even if receiver cannot read Reading from gate when no message available blocks Automatic code generation for non-buffered SA hard, compiler must keep track of interprocessor synchronization Can view SP as an extension of memory with infinite capacity, but with sequential access

15 © Dr. Herbert G. Mayer Systolic Array Multiprocessor

16 © Dr. Herbert G. Mayer Systolic Array Multiprocessor Note that each pathway, x or y, may be bi-directional May have any number of pathways, nothing magic about 2, x and y Possible to have I/O capability with each node Next example shows a torus (without displaying the wrap-around pathways)

17 © Dr. Herbert G. Mayer Hybrid Multiprocessor - Superscalar Superscalar (SSA) Architecture Is scalar architecture w.r.t. object code Is parallel architecture, with multiple copies of some hardware units Has multiple ALUs, possibly FP add (FPA), FP multiply (FPM), >1 integer units Arithmetic operations simultaneous with load- and store operations Code sequence looks like sequence of instructions for scalar processor Object code can be custom-tailored by compiler Fetch enough instruction bytes to support longest possible object sequence Decoding is bottle-neck for CISC, easy for RISC  32-bit units Sample of superscalar: i80860 has FPA, FPM, 2 integer ops, load, store with pre-post increment and decrement

18 © Dr. Herbert G. Mayer Hybrid Multiprocessor – SSA & PA Pipelined + Superscalar Architecture

19 © Dr. Herbert G. Mayer Hybrid Multiprocessor - VLIW VLIW Architecture (VLIW) Very Long Instruction Word –typically 128 bits or more –below 128: LIW Object code no longer purely scalar Some special-select opcodes designed to support parallel execution Compiler/programmer explicitly packs VLIW ops Other opcodes are still scalar, can coexists with VLIW instructions Scalar operation possible by placing no-ops into some VLIW fields Sample: Compute instruction of CMU warp ® and Intel iWarp ® Data Dependence example: Result of FPA cannot be used as operand for FPM in the same VLIW instruction Thus, need to software-pipeline; not discussed in CS 106

20 © Dr. Herbert G. Mayer Hybrid Multiprocessor One single VLIW Instruction

21 © Dr. Herbert G. Mayer Hybrid Multiprocessor EPIC Architecture (EA) Groups instructions into bundles Straighten out the branches by associating predicate with instructions Execute instructions in parallel, say the else-clause and the then clause of an If Statement Decide at run time, which of the predicates is true, and execute just that part Use speculation, to straighten branch tree Use rotating register file, AKA register windows Have provides many registers, not just 64