Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSE 586 Computer Architecture Review

Similar presentations


Presentation on theme: "CSE 586 Computer Architecture Review"— Presentation transcript:

1 CSE 586 Computer Architecture Review
Jean-Loup Baer CSE 586 Spring 00

2 Performance of Computer Systems
Performance metrics Use (weighted) arithmetic means for execution times Use (weighted) harmonic means for rates CPU exec. time = Instruction count* CPIi*fi *clock cycle time We talk about “contributions to the CPI” from, e.g., Hazards in the pipeline Cache misses Branch mispredictions etc. CSE 586 Spring 00

3 ISA’s (RISC, CISC, EPIC) In RISC, R stands for:
Restricted (relatively small number of opcodes) Regular (all instructions have same length ) And also, few instruction formats and addressing modes RISC and load-store architectures are synonymous CISC Fewer instructions executed but CPI/instruction is larger More complex to design VLIW-EPIC Compiler-based exploitation of ILP CSE 586 Spring 00

4 Basic Pipeline Model Basic pipelining (e.g, DLX in the book; MIPS 3000) 5 stages: IF, ID, EX, MEM, WB Pipeline registers between stages to keep data/control info needed in subsequent stages Hazards Structural (won’t happen in basic pipeline but will in multiple pipeline machines) Data dependencies Most can be removed via forwarding Otherwise stall (insert bubbles) Control CSE 586 Spring 00

5 IF ID/RR EXE Mem WB EX/MEM ID/EX MEM/WB IF/ID 4 (PC) zero (Rd) PC
ALU (PC) zero (Rd) PC Inst. mem. Regs. ALU Data mem. Forwarding unit s e ALU data 2 control Stall unit Control unit CSE 586 Spring 00

6 Branch Prediction Modern processors use dynamic branch prediction
Becomes increasingly important because of deep pipes and multiple issue of instructions BPT (Branch prediction table) Prediction occurs during ID cycle BPT either indexed by some bits on the PC or organized cache-like BPT: either separate table or part of the “metadada” of the I-cache Use of 2-bit saturating counters for the prediction per se CSE 586 Spring 00

7 2-bit Saturating Counter Scheme Property: takes two wrong predictions before it changes T to NT (and vice-versa) taken ^ Generally, this is the initial state not taken predict taken predict taken taken taken not taken not taken predict not taken predict not taken taken ^ not taken CSE 586 Spring 00

8 More Elaborate Branch Prediction
BTB (Branch Target Buffer) = BPT + Target address Prediction and target address “computation” occur during IF cycle Possibility of decoupling a (large) BPT and a (smaller) BTB Correlated -- or 2-level – branch prediction Relies on history of outcome of previous branches to predict current branch Many variations on the number of (shift) registers recording branch history and the number of Pattern History Tables (PHT) storing the 2-bit saturating counters CSE 586 Spring 00

9 Decoupled BTB BPT Tag Hist BTB (2) If predict T then access BTB
Tag Next address (3) if match then have target address Note: the BPT does not require a tag, so could be much larger PC (1) access BPT CSE 586 Spring 00

10 Extensions to Single Pipe Model
Basic pipelining How to handle precise exceptions Single issue processor with multiple pipes How to handle sharing the WB stage How to avoid WAW hazards CSE 586 Spring 00

11 EX (e.g., integer; latency 0)
IF ID M1 M7 Me WB F-p mul (latency 7) A1 A4 F-p add (latency 3) both Needed at beg of cycle & ready at end of cycle Div (e.g., not pipelined, Latency 25) 2/23/2019 CSE 586 Spring 00

12 Exploiting Instruction Level Parallelism
ILP: where can the compiler optimize Loop unrolling and software pipelining Speculative execution ILP: Dynamic scheduling in a single issue machine Scoreboard -- Centralized control unit Tomasulo’s algorithm -- Decentralized control CSE 586 Spring 00

13 Scoreboard -- The example machine
Registers Data buses Functional units (pipes) scoreboard Control lines /status CSE 586 Spring 00

14 Scoreboard The scoreboard keeps a record of all data dependencies
The scoreboard keeps a record of all functional unit occupancies The scoreboard decides if an instruction can be issued The scoreboard decides if an instruction can store its result Implementation-wise, scoreboard keeps track of which registers are used as sources and destinations and which functional units use them CSE 586 Spring 00

15 Example Machine using Tomasulo Algorithm
From memory From I-unit Load buffers Fp registers Common data bus Store buffers Reservation stations To memory F-p units CSE 586 Spring 00

16 Tomasulo’s algorithm Decentralized control
Use of reservation stations to buffer and/or rename registers (hence gets rid of WAW and WAR hazards) Results –and their names– are broadcast to reservations stations and register file Instructions are issued in order but can be dispatched, executed and completed out-of-order Issue, Execute, Write stages CSE 586 Spring 00

17 Register Renaming Goal: avoid WAW and WAR hazards
Is performed at “decode” time to rename the result register Two basic implementation schemes Have a separate physical register file Use of reorder buffer (to preserve in-order completion) and reservation stations Often a mix of the two CSE 586 Spring 00

18 Example Machine (Tomasulo-like) Revisited
Reorder buffer From memory & CDB From I-unit To memory Fp registers Reservation stations To CDB F-p units 2/23/2019 CSE 586 Spring 00

19 The Commit Step (in-order completion)
A fourth stage: Commit Need of a mechanism (reorder buffer) to: “Complete” instructions in order. This commits the instruction. Since multiple issue machine, should be able to commit (retire) several instructions per cycle Know when an instruction has completed non-speculatively (head of the buffer) Know whether the result of an instruction is correct, i.e., flush reorder buffer when there are incorrectly predicted branches and exceptions CSE 586 Spring 00

20 Multiple Issue Implications
Will increase throughput The Instruction Fetch step requires buffering and can become a critical point in the design The Commit stage must be able to retire multiple instructions in a given cycle Decoding, issuing, dispatching can encounter more structural hazards CSE 586 Spring 00

21 VLIW-EPIC Compiler plays a major role in scheduling operations
Merced/Itanium implementation “Bundles” of predicated instructions Large register files with “rotating” registers to facilitate loop unrolling, software pipelining, and call/return paradigms Predication and sophisticated branch prediction Powerful floating-point units SIMD instructions for 3D Graphics and Multimedia CSE 586 Spring 00

22 Predication Partial predication (Conditional Moves)
Full predication (predicate definitions); Unconditional and OR predicates Used extensively in Merced/Itanium CSE 586 Spring 00

23 Memory Hierarchy Memory hierarchies “work” because of the principle of locality Temporal and spatial locality Two main interfaces in the memory hierarchy Caches – Main memory Main memory – disk (secondary memory) Same questions arise at both interfaces: Size , placement, retrieval, replacement, and timing of the information being transferred CSE 586 Spring 00

24 Caches Cache organizations Cache performance
Direct-mapped, fully-associative, set-associative Decomposition of the address for hit/miss detection Write-through vs. write-back; write-around and write-allocate The 3 C’s Cache performance Metrics: CPIc , Average memory access time Examples of naïve analysis CSE 586 Spring 00

25 Cache Performance Improving performance by giving more “associativity”
Victim caches; column-associative caches; skewed ass. caches Reducing conflict misses Interaction with the O.S.: page coloring Interaction with the compiler: code placement Improving performance by tolerating memory latency Prefetching Write buffers Critical word first Sector caches Lock-up free caches CSE 586 Spring 00

26 Main Memory DRAM basics Interleaving Page-mode and SDRAMs
Low order bits for reading consecutive words in parallel “Middle” bits for banks of banks allowing concurrent access by several devices Page-mode and SDRAMs Processor In Memory paradigm (IRAM, Active Pages) Rambus CSE 586 Spring 00

27 Virtual memory Paging and segmentation Page tables TLB’s
Address translation CSE 586 Spring 00

28 From Virtual Address to Memory Location (highly abstracted)
ALU hit Virtual address cache miss miss TLB Main memory hit Physical address CSE 586 Spring 00

29 Hardware-software interactions for paging systems
TLB’s Miises handled either in hardware or software Page fault: detection and termination Context-switch (exception) I/O interrupt Choice of a (or several) page size(s) Virtually addressed caches - Synonyms Protection I/O and caches (software and hardware solutions) Cache coherence CSE 586 Spring 00

30 I/O I/O architecture (CPU-memory and I/O buses)
Disks (access time components) Buses (arbitration, transactions, split-transactions) I/O hardware-software interface DMA Disk arrays (RAID) CSE 586 Spring 00

31 Parallel Processing Flynn’s taxonomy
{Single Instr., Multiple Instr.} X {Single Data, Multiple Data} MIMD machines --Shared-memory multiprocessors UMA NUMA-cc DSM MIMD machines – Message passing systems Multicomputers Synchronous vs. asynchronous message passing CSE 586 Spring 00

32 Shared-bus Systems SMP’s Cache coherence using snoopy protocols
Write-update protocols (Dragon) Write-invalidate protocols (Illinois) Cache coherence misses Impact of capacity and block sizes Multilevel inclusion property CSE 586 Spring 00

33 NUMA Machines Interconnection networks for tightly-coupled systems
Centralized vs. decentralized switches Centralized switches Crossbar Perfect shuffle – Omega and Butterfly networks Decentralized switches Meshes and tori Performance metrics Bandwidth; Bisection bandwidth; latency Routing and flow control CSE 586 Spring 00

34 Directory-based Cache Coherence
Full directory Partial directory 2-bit Coarse directories Basic protocols SCI Directory in the caches COMA architecture CSE 586 Spring 00

35 Synchronization Locking and barriers
Primitives for implementation of locking Test-and-Set Fetch-and-Φ Full/empty bits Load locked and Store conditional Spin locks Test and Test-and-Set Queuing locks CSE 586 Spring 00

36 Models of Memory Consistency
Sequential consistency Relaxed models Weak Ordering Release consistency CSE 586 Spring 00


Download ppt "CSE 586 Computer Architecture Review"

Similar presentations


Ads by Google