1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

Slides:

Advertisements

Similar presentations

1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.

Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Cont. Computer Architecture.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture Facilitate parallel execution Scale well with advancing.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

ENGS 116 Lecture 91 Dynamic Branch Prediction and Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday:

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Lecture 19: Virtual Memory

1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.

ALPHA Introduction I- Stream ALPHA Introduction I- Stream Dharmesh Parikh.

1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.

1 Lecture 7: Speculative Execution and Recovery using Reorder Buffer Branch prediction and speculative execution, precise interrupt, reorder buffer.

1 Lecture 5: Dependence Analysis and Superscalar Techniques Overview Instruction dependences, correctness, inst scheduling examples, renaming, speculation,

Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Final Review Prof. Mike Schulte Advanced Computer Architecture ECE 401.

Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,

Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Instruction-Level Parallelism and Its Dynamic Exploitation

Dynamic Scheduling Why go out of style?

ALPHA Introduction I- Stream

/ Computer Architecture and Design

PowerPC 604 Superscalar Microprocessor

CS203 – Advanced Computer Architecture

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Lecture 14: Reducing Cache Misses

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Chapter 5 Memory CSE 820.

Lecture 11: Memory Data Flow Techniques

Ka-Ming Keung Swamy D Ponpandi

How to improve (decrease) CPI

Advanced Computer Architecture

Instruction Level Parallelism (ILP)

Dynamic Hardware Prediction

How to improve (decrease) CPI

Lecture 10: ILP Innovations

Lecture 9: ILP Innovations

The University of Adelaide, School of Computer Science

Ka-Ming Keung Swamy D Ponpandi

Presentation transcript:

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy

2 Exam Schedule In-class (40 points) Wednesday Nov. 12 class time (75 minutes) Take home (60 points) Distributed Wednesday Nov. 12, return by 5:00pm Friday Not two exams: different purposes, question types, and difficulty levels

3 Performance Evaluate Performance metrics: latency and throughput and others Speedup Benchmarks: design considerations, categories, examples (SPEC and TPC) Summarizing performance Amdahl’s Law: idea and equation CPU time equation

4 ISA Design ISA types GPR ISA variants: #oprands, use of register, immediate, and memory operands GPR ISA design issues memory addressing endian and alignment Compare RISC and CISC ISA impacts on processor performance

5 Instruction Scheduling Fundamentals Dependence analysis Data and name (anti- and output) dependences; or RAW, WAW, and WAR Dependences through register and memory Control dependence CPU Time = #inst × CPI × Cycle time CPI = CPI ideal +CPI data hazard +CPI control hazard Deep pipeline: reduce cycle time Multi-issue and dynamic scheduling: CPI ideal Branch pred. and spec. execution: CPI control hazard Memory hierarchy: CPI data hazard

6 Tomasulo Algorithm Study focus: data and name dependences through registers Hardware structures Register status table (renaming table): Help remove name dependences and build up data dependences Reservation station: Preserve data dependences, buffer instruction states, wake up dependent instructions Common data bus: Broadcast tag and data What are the stages of Tomasulo? Understand the big example!

7 Precise Interrupt and Speculative Execution What is precise interrupt and why? In-order commit: solution for both Central idea: maintain architectural states Must buffer inst output after execution Commit inst output to architecture states in program order Flush pipeline at exceptions or mis-speculations Q: What is ROB? And its structure? Q: What is the change of pipeline?

8 Modern Instruction Scheduling Major differences: more pipeline stages, data forwarding, decoupled tag broadcasting, may use issue queue Issue queue: RS changed to IQ; switch two pipeline stages; significant changes at registers, renaming, and ROB Why data forwarding? How is IQ different from RS? What is the change at pipeline What is physical register? What is the change at rename stage? Understand the generic superscalar processor models

9 Branch Prediction Objective: delivery instruction continuously Several functions: predict target, direction, and return address Review BTB and BHT design Why use saturating counter? Why use correlating prediction How BTB and BHT are updated? How to calculate mis-prediction penalty? What is return address stack? Understand tournament predictor

10 Memory Data Flow Techniques Address dependence through memory: store->load dependences Must buffer store outputs => store queue Want memory-level parallelism => memory disambiguation load bypassing and forwarding may speculate if store address not known Need to detect mis-speculation => load queue (violation detected on stores) Q: Where is the performance gain? Q: What are the structures of LQ and SQ? Q: How store queue and load queue are synchronized with ROB? Q: Which portion of SQ preserves architecture states? How to flush SQ and LQ? Superscalar tech: inst flow, reg. flow, and data mem. flow

11 Limits of ILP What may limit ILP in realistic programs? What is the strategy to evaluate ILP limits?

12 Cache Fundaments Cache design What is cache? And why to use cache? What are the 4 Qs of cache design? Note caching happens on memory blocks Be very familiar with cache address mapping format Cache performance Three factors: miss rate, miss penalty, and hit time What is AMAT? And memory stall time? What is the final measurement of cache performance? How to evaluate set-associative caches? Know how to analyze memory access pattern

13 Cache Optimization Techniques What are desired: low cache misses, fast cache hit, and small miss penalty, with minimal complexity (ideal world) Understand cache misses What are three Cs? Which techniques to reduce each type? Involves tradeoff E.g. cache size, block size, set associativity

14 Improving Cache Performance 3.Reducing miss penalty or miss rates via parallelism Non-blocking caches Hardware prefetching Compiler prefetching 4.Reducing cache hit time Small and simple caches Avoiding address translation Pipelined cache access Trace caches 1. Reducing miss rates Larger block size larger cache size higher associativity way prediction Pseudoassociativity compiler optimization 2. Reducing miss penalty Multilevel caches critical word first read miss first merging write buffers victim caches Bold type: know details Others: understand concepts

15 Virtual Memory Why VM? What are four Qs for VM design? How to compare cache and VM? Be familiar with VM address mapping format Understand flat page table; what are in PTE? What is TLB? How does TLB work? Why multi-level page table?

16 Typical Memory Hierarchy Today L1 instruction cache: small and combined with prediction (way prediction, trace cache) and prefetching (e.g. stream buffer); virtually indexed and virtually tagged L1 data cache: small and fast, pipelined, and likely to be set associative (Intel: 8KB, 4-way set associative) virtually indexed and physical tagged Write through TLB: small (128-entry in 21264), split for inst and data, tends to be fully associative, D-TLB run in parallel with L1-data L2 unified cache: As large as transistor budget allows; today highly set associative (e.g., 512KB 8-way for P4); write-back to reduce memory traffic Optional L3 cache: Even larger, off-chip Page table: multi-level, software (21264) or hardware (Intel) managed Main memory: large but slow, high-bandwidth