CSE 586 Computer Architecture Review

Slides:



Advertisements
Similar presentations
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Advertisements

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
COMP25212 Advanced Pipelining Out of Order Processors.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Instruction-Level Parallelism (ILP)
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Review CSE 4711 Computer Design and Organization Architecture = Design + Organization + Performance Topics in this class: –Central processing unit: deeply.
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
Translation Buffers (TLB’s)
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
CS203 – Advanced Computer Architecture ILP and Speculation.
Instruction-Level Parallelism and Its Dynamic Exploitation
Dynamic Scheduling Why go out of style?
Computer Organization CS224
CSL718 : Superscalar Processors
/ Computer Architecture and Design
CS203 – Advanced Computer Architecture
Appendix C Pipeline implementation
Computer Design and Organization
Advantages of Dynamic Scheduling
Pipelining: Advanced ILP
Morgan Kaufmann Publishers The Processor
Out of Order Processors
Evolution in Memory Management Techniques
Superscalar Processors & VLIW Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Branch statistics Branches occur every 4-6 instructions (16-25%) in integer programs; somewhat less frequently in scientific ones Unconditional branches.
Multiprocessors - Flynn’s taxonomy (1966)
Adapted from the slides of Prof
Checking for issue/dispatch
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
How to improve (decrease) CPI
Pipeline control unit (highly abstracted)
Static vs. dynamic scheduling
Advanced Computer Architecture
Static vs. dynamic scheduling
Control unit extension for data hazards
Lecture 20: OOO, Memory Hierarchy
Instruction Level Parallelism (ILP)
Translation Buffers (TLB’s)
/ Computer Architecture and Design
Instruction Execution Cycle
Computer Design and Organization
Adapted from the slides of Prof
Pipeline control unit (highly abstracted)
September 20, 2000 Prof. John Kubiatowicz
Translation Buffers (TLB’s)
Pipeline Control unit (highly abstracted)
Control unit extension for data hazards
Dynamic Hardware Prediction
How to improve (decrease) CPI
Control unit extension for data hazards
The University of Adelaide, School of Computer Science
Translation Buffers (TLBs)
CSE 586 Computer Architecture Lecture 3
Conceptual execution on a processor which exploits ILP
Review What are the advantages/disadvantages of pages versus segments?
Presentation transcript:

CSE 586 Computer Architecture Review Jean-Loup Baer http://www.cs.washington.edu/education/courses/586/00sp CSE 586 Spring 00

Performance of Computer Systems Performance metrics Use (weighted) arithmetic means for execution times Use (weighted) harmonic means for rates CPU exec. time = Instruction count* CPIi*fi *clock cycle time We talk about “contributions to the CPI” from, e.g., Hazards in the pipeline Cache misses Branch mispredictions etc. CSE 586 Spring 00

ISA’s (RISC, CISC, EPIC) In RISC, R stands for: Restricted (relatively small number of opcodes) Regular (all instructions have same length ) And also, few instruction formats and addressing modes RISC and load-store architectures are synonymous CISC Fewer instructions executed but CPI/instruction is larger More complex to design VLIW-EPIC Compiler-based exploitation of ILP CSE 586 Spring 00

Basic Pipeline Model Basic pipelining (e.g, DLX in the book; MIPS 3000) 5 stages: IF, ID, EX, MEM, WB Pipeline registers between stages to keep data/control info needed in subsequent stages Hazards Structural (won’t happen in basic pipeline but will in multiple pipeline machines) Data dependencies Most can be removed via forwarding Otherwise stall (insert bubbles) Control CSE 586 Spring 00

IF ID/RR EXE Mem WB EX/MEM ID/EX MEM/WB IF/ID 4 (PC) zero (Rd) PC ALU (PC) zero (Rd) PC Inst. mem. Regs. ALU Data mem. Forwarding unit s e ALU data 2 control Stall unit Control unit CSE 586 Spring 00

Branch Prediction Modern processors use dynamic branch prediction Becomes increasingly important because of deep pipes and multiple issue of instructions BPT (Branch prediction table) Prediction occurs during ID cycle BPT either indexed by some bits on the PC or organized cache-like BPT: either separate table or part of the “metadada” of the I-cache Use of 2-bit saturating counters for the prediction per se CSE 586 Spring 00

2-bit Saturating Counter Scheme Property: takes two wrong predictions before it changes T to NT (and vice-versa) taken ^ Generally, this is the initial state not taken predict taken predict taken taken taken not taken not taken predict not taken predict not taken taken ^ not taken CSE 586 Spring 00

More Elaborate Branch Prediction BTB (Branch Target Buffer) = BPT + Target address Prediction and target address “computation” occur during IF cycle Possibility of decoupling a (large) BPT and a (smaller) BTB Correlated -- or 2-level – branch prediction Relies on history of outcome of previous branches to predict current branch Many variations on the number of (shift) registers recording branch history and the number of Pattern History Tables (PHT) storing the 2-bit saturating counters CSE 586 Spring 00

Decoupled BTB BPT Tag Hist BTB (2) If predict T then access BTB Tag Next address (3) if match then have target address Note: the BPT does not require a tag, so could be much larger PC (1) access BPT CSE 586 Spring 00

Extensions to Single Pipe Model Basic pipelining How to handle precise exceptions Single issue processor with multiple pipes How to handle sharing the WB stage How to avoid WAW hazards CSE 586 Spring 00

EX (e.g., integer; latency 0) IF ID M1 M7 Me WB F-p mul (latency 7) A1 A4 F-p add (latency 3) both Needed at beg of cycle & ready at end of cycle Div (e.g., not pipelined, Latency 25) 2/23/2019 CSE 586 Spring 00

Exploiting Instruction Level Parallelism ILP: where can the compiler optimize Loop unrolling and software pipelining Speculative execution ILP: Dynamic scheduling in a single issue machine Scoreboard -- Centralized control unit Tomasulo’s algorithm -- Decentralized control CSE 586 Spring 00

Scoreboard -- The example machine Registers Data buses Functional units (pipes) scoreboard Control lines /status CSE 586 Spring 00

Scoreboard The scoreboard keeps a record of all data dependencies The scoreboard keeps a record of all functional unit occupancies The scoreboard decides if an instruction can be issued The scoreboard decides if an instruction can store its result Implementation-wise, scoreboard keeps track of which registers are used as sources and destinations and which functional units use them CSE 586 Spring 00

Example Machine using Tomasulo Algorithm From memory From I-unit Load buffers Fp registers Common data bus Store buffers Reservation stations To memory F-p units CSE 586 Spring 00

Tomasulo’s algorithm Decentralized control Use of reservation stations to buffer and/or rename registers (hence gets rid of WAW and WAR hazards) Results –and their names– are broadcast to reservations stations and register file Instructions are issued in order but can be dispatched, executed and completed out-of-order Issue, Execute, Write stages CSE 586 Spring 00

Register Renaming Goal: avoid WAW and WAR hazards Is performed at “decode” time to rename the result register Two basic implementation schemes Have a separate physical register file Use of reorder buffer (to preserve in-order completion) and reservation stations Often a mix of the two CSE 586 Spring 00

Example Machine (Tomasulo-like) Revisited Reorder buffer From memory & CDB From I-unit To memory Fp registers Reservation stations To CDB F-p units 2/23/2019 CSE 586 Spring 00

The Commit Step (in-order completion) A fourth stage: Commit Need of a mechanism (reorder buffer) to: “Complete” instructions in order. This commits the instruction. Since multiple issue machine, should be able to commit (retire) several instructions per cycle Know when an instruction has completed non-speculatively (head of the buffer) Know whether the result of an instruction is correct, i.e., flush reorder buffer when there are incorrectly predicted branches and exceptions CSE 586 Spring 00

Multiple Issue Implications Will increase throughput The Instruction Fetch step requires buffering and can become a critical point in the design The Commit stage must be able to retire multiple instructions in a given cycle Decoding, issuing, dispatching can encounter more structural hazards CSE 586 Spring 00

VLIW-EPIC Compiler plays a major role in scheduling operations Merced/Itanium implementation “Bundles” of predicated instructions Large register files with “rotating” registers to facilitate loop unrolling, software pipelining, and call/return paradigms Predication and sophisticated branch prediction Powerful floating-point units SIMD instructions for 3D Graphics and Multimedia CSE 586 Spring 00

Predication Partial predication (Conditional Moves) Full predication (predicate definitions); Unconditional and OR predicates Used extensively in Merced/Itanium CSE 586 Spring 00

Memory Hierarchy Memory hierarchies “work” because of the principle of locality Temporal and spatial locality Two main interfaces in the memory hierarchy Caches – Main memory Main memory – disk (secondary memory) Same questions arise at both interfaces: Size , placement, retrieval, replacement, and timing of the information being transferred CSE 586 Spring 00

Caches Cache organizations Cache performance Direct-mapped, fully-associative, set-associative Decomposition of the address for hit/miss detection Write-through vs. write-back; write-around and write-allocate The 3 C’s Cache performance Metrics: CPIc , Average memory access time Examples of naïve analysis CSE 586 Spring 00

Cache Performance Improving performance by giving more “associativity” Victim caches; column-associative caches; skewed ass. caches Reducing conflict misses Interaction with the O.S.: page coloring Interaction with the compiler: code placement Improving performance by tolerating memory latency Prefetching Write buffers Critical word first Sector caches Lock-up free caches CSE 586 Spring 00

Main Memory DRAM basics Interleaving Page-mode and SDRAMs Low order bits for reading consecutive words in parallel “Middle” bits for banks of banks allowing concurrent access by several devices Page-mode and SDRAMs Processor In Memory paradigm (IRAM, Active Pages) Rambus CSE 586 Spring 00

Virtual memory Paging and segmentation Page tables TLB’s Address translation CSE 586 Spring 00

From Virtual Address to Memory Location (highly abstracted) ALU hit Virtual address cache miss miss TLB Main memory hit Physical address CSE 586 Spring 00

Hardware-software interactions for paging systems TLB’s Miises handled either in hardware or software Page fault: detection and termination Context-switch (exception) I/O interrupt Choice of a (or several) page size(s) Virtually addressed caches - Synonyms Protection I/O and caches (software and hardware solutions) Cache coherence CSE 586 Spring 00

I/O I/O architecture (CPU-memory and I/O buses) Disks (access time components) Buses (arbitration, transactions, split-transactions) I/O hardware-software interface DMA Disk arrays (RAID) CSE 586 Spring 00

Parallel Processing Flynn’s taxonomy {Single Instr., Multiple Instr.} X {Single Data, Multiple Data} MIMD machines --Shared-memory multiprocessors UMA NUMA-cc DSM MIMD machines – Message passing systems Multicomputers Synchronous vs. asynchronous message passing CSE 586 Spring 00

Shared-bus Systems SMP’s Cache coherence using snoopy protocols Write-update protocols (Dragon) Write-invalidate protocols (Illinois) Cache coherence misses Impact of capacity and block sizes Multilevel inclusion property CSE 586 Spring 00

NUMA Machines Interconnection networks for tightly-coupled systems Centralized vs. decentralized switches Centralized switches Crossbar Perfect shuffle – Omega and Butterfly networks Decentralized switches Meshes and tori Performance metrics Bandwidth; Bisection bandwidth; latency Routing and flow control CSE 586 Spring 00

Directory-based Cache Coherence Full directory Partial directory 2-bit Coarse directories Basic protocols SCI Directory in the caches COMA architecture CSE 586 Spring 00

Synchronization Locking and barriers Primitives for implementation of locking Test-and-Set Fetch-and-Φ Full/empty bits Load locked and Store conditional Spin locks Test and Test-and-Set Queuing locks CSE 586 Spring 00

Models of Memory Consistency Sequential consistency Relaxed models Weak Ordering Release consistency CSE 586 Spring 00