1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Slides:



Advertisements
Similar presentations
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.
1 Lecture 20: Cache Hierarchies, Virtual Memory Today’s topics:  Cache hierarchies  Virtual memory Reminder:  Assignment 8 will be posted soon (due.
1 Lecture 17: Basic Pipelining Today’s topics:  5-stage pipeline  Hazards and instruction scheduling Mid-term exam stats:  Highest: 90, Mean: 58.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
1 Lecture 12: Cache Innovations Today: cache access basics and innovations (Sections )
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Lecture 17 Final Review Prof. Mike Schulte Computer Architecture ECE 201.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
1 Lecture 3: Pipelining Basics Today: chapter 1 wrap-up, basic pipelining implementation (Sections C.1 - C.4) Reminders:  Sign up for the class mailing.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Lecture: Out-of-order Processors
COSC3330 Computer Architecture
Lecture: Pipelining Extensions
5.2 Eleven Advanced Optimizations of Cache Performance
Lecture: Out-of-order Processors
Lecture: SMT, Cache Hierarchies
Lecture 21: Memory Hierarchy
Lecture 21: Memory Hierarchy
Lecture 6: Advanced Pipelines
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 23: Cache, Memory, Virtual Memory
Lecture 18: Pipelining Today’s topics:
Lecture: SMT, Cache Hierarchies
Lecture 18: Pipelining Today’s topics:
Lecture 22: Cache Hierarchies, Memory
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Lecture 20: OOO, Memory Hierarchy
Lecture: SMT, Cache Hierarchies
Lecture 20: OOO, Memory Hierarchy
Lecture 22: Cache Hierarchies, Memory
Lecture: SMT, Cache Hierarchies
Lecture 21: Memory Hierarchy
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )
Lecture 9: Dynamic ILP Topics: out-of-order processors
Lecture 7: Branch Prediction, Dynamic ILP
Presentation transcript:

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics

2 Slowdowns from Stalls Perfect pipelining with no hazards  an instruction completes every cycle (total cycles ~ num instructions)  speedup = increase in clock speed = num pipeline stages With hazards and stalls, some cycles (= stall time) go by during which no instruction completes, and then the stalled instruction completes Total cycles = number of instructions + stall cycles

3 Multicycle Instructions Multiple parallel pipelines – each pipeline can have a different number of stages Instructions can now complete out of order – must make sure that writes to a register happen in the correct order

4 An Out-of-Order Processor Implementation Branch prediction and instr fetch R1  R1+R2 R2  R1+R3 BEQZ R2 R3  R1+R2 R1  R3+R2 Instr Fetch Queue Decode & Rename Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 T1 T2 T3 T4 T5 T6 Reorder Buffer (ROB) T1  R1+R2 T2  T1+R3 BEQZ T2 T4  T1+T2 T5  T4+T2 Issue Queue (IQ) ALU Register File R1-R32 Results written to ROB and tags broadcast to IQ

5 Example Code Completion times with in-order with ooo ADD R1, R2, R3 5 5 ADD R4, R1, R2 6 6 LW R5, 8(R4) 7 7 ADD R7, R6, R5 9 9 ADD R8, R7, R LW R9, 16(R4) 11 7 ADD R10, R6, R ADD R11, R10, R

6 Cache Hierarchies Data and instructions are stored on DRAM chips – DRAM is a technology that has high bit density, but relatively poor latency – an access to data in memory can take as many as 300 cycles today! Hence, some data is stored on the processor in a structure called the cache – caches employ SRAM technology, which is faster, but has lower bit density Internet browsers also cache web pages – same concept

7 Memory Hierarchy As you go further, capacity and latency increase Registers 1KB 1 cycle L1 data or instruction Cache 32KB 2 cycles L2 cache 2MB 15 cycles Memory 1GB 300 cycles Disk 80 GB 10M cycles

8 Locality Why do caches work?  Temporal locality: if you used some data recently, you will likely use it again  Spatial locality: if you used some data recently, you will likely access its neighbors No hierarchy: average access time for data = 300 cycles 32KB 1-cycle L1 cache that has a hit rate of 95%: average access time = 0.95 x x (301) = 16 cycles

9 Accessing the Cache 8-byte words Direct-mapped cache: each address maps to a unique address 8 words: 3 index bits Byte address Data array Sets Offset

10 The Tag Array 8-byte words Direct-mapped cache: each address maps to a unique address Byte address Tag Compare Data arrayTag array

11 Example Access Pattern 8-byte words Direct-mapped cache: each address maps to a unique address Byte address Tag Compare Data arrayTag array Assume that addresses are 8 bits long How many of the following address requests are hits/misses? 4, 7, 10, 13, 16, 68, 73, 78, 83, 88, 4, 7, 10…

12 Increasing Line Size 32-byte cache line size or block size Byte address Tag Data arrayTag array Offset A large cache line size  smaller tag array, fewer misses because of spatial locality

13 Associativity Byte address Tag Data arrayTag array Set associativity  fewer conflicts; wasted power because multiple data and tags are read Way-1Way-2 Compare

14 Associativity Byte address Tag Data arrayTag array How many offset/index/tag bits if the cache has 64 sets, each set has 64 bytes, 4 ways Way-1Way-2 Compare

15 Example 32 KB 4-way set-associative data cache array with 32 byte line sizes How many sets? How many index bits, offset bits, tag bits? How large is the tag array?

16 Title Bullet