Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Slides:

Advertisements

Similar presentations

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time

Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.

Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

Performance of Cache Memory

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Cache Here we focus on cache improvements to support at least 1 instruction fetch and at least 1 data access per cycle – With a superscalar, we might need.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.

ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

Lecture 33: Chapter 5 Today’s topic –Cache Replacement Algorithms –Multi-level Caches –Virtual Memories 1.

Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.

Storage HierarchyCS510 Computer ArchitectureLecture Lecture 12 Storage Hierarchy.

CMPE 421 Parallel Computer Architecture

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

1  1998 Morgan Kaufmann Publishers Recap: Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the.

Computer Architecture Ch5-1 Ping-Liang Lai ( 賴秉樑 ) Lecture 5 Review of Memory Hierarchy (Appendix C in textbook) Computer Architecture 計算機結構.

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.

Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1  1998 Morgan Kaufmann Publishers Chapter Seven.

Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.

Soner Onder Michigan Technological University

CSE 351 Section 9 3/1/12.

The Goal: illusion of large, fast, cheap memory

CSC 4250 Computer Architectures

Multilevel Memories (Improving performance using alittle “cash”)

Cache Memory Presentation I

Lecture 08: Memory Hierarchy Cache Performance

Chapter 5 Memory CSE 820.

Lecture 08: Memory Hierarchy Cache Performance

CPE 631 Lecture 05: Cache Design

Morgan Kaufmann Publishers

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

CS 704 Advanced Computer Architecture

EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007

appreciate the favor & spread the kindness

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Cache - Optimization.

Cache Memory Rabi Mahapatra

Lecture 7 Memory Hierarchy and Cache Design

Memory & Cache.

10/18: Lecture Topics Using spatial locality

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Lab 2 Demo Report due April 21 Assignment 2 Submission

Appendix B.1-B.3

Memory Hierarchy

main memory + virtual memory Virtual memory: some objects may reside on disk Address pace split into pages A page resides in either main mem or virtual mem Palt: occurs when a page is not in cache or main memory; need to move the entire page from disk to main memory

Outline Cache Basics Cache Performance Cache Optimization

Outline Cache Basics Cache Performance Cache Optimization

Cache The highest or first level of the memory hierarchy encountered once the addr leaves the processor buffering is employed to reuse commonly occurring items Cache hit/miss when the processor can/cannot find a requested data item in the cache

Cache Locality Block/line run: a fixed-size collection of data containing the requested word, retrieved from the main memory and placed into the cache Temporal locality: need the requested word again soon Spatial locality: likely need other data in the block soon

Cache Miss Time required for cache miss depends: latency and memory bandwidth Latency: the time to retrieve the first word of the block Bandwidth: the time to retrieve the rest of this block

Outline Cache Basics Cache Performance Cache Optimization

Cache Performance

Example a computer with CPI=1 when cache hit; 50% instructions are loads and stores; 2% miss rate, 25 cc miss penalty; Q: how much faster would the computer be if all instructions were cache hits?

Cache Performance Answer always hit: CPU execution time

Cache Performance Answer with misses: Memory stall cycles CPU execution time cache

Cache Performance Answer

Cache Performance Memory stall cycles the number of cycles during processor is stalled waiting for a mem access Miss rate number of misses over number of accesses Miss penalty the cost per miss (number of extra clock cycles to wait)

Block Placement Direct Mapped only one place Fully Associative anywhere Set Associative anywhere within only one set

Block Placement

n-way set associative: n blocks in a set Direct mapped = one-way set associative i.e., one block in a set Fully associative = m-way set associative i.e., entire cache as one set with m blocks

Block Identification Block address + block offset Block address: tag + index Index: select the set Tag: check all blocks in the set Block offset: the address of the desired data within the block chosen by index + tag; Fully associative caches have no index field

Block Replacement Cache miss, need to load the data to a cache block, which block to replace? Random simple to build LRU: Least Recently Used the block that has been unused for the longest time; use spatial locality; complicated/expensive; FIFO: first in, first out

Write Strategy Read together with tag checking Must write after tag checking

Write Strategy Write-through info is written to both the block in the cache and to the block in the lower- level memory Write-back info is written only to the block in the cache; to the main memory only when the modified cache block is replaced;

Write Strategy Options on a write miss Write allocate the block is allocated on a write miss No-write allocate write miss not affect the cache; the block is modified in the lower-level memory; until the program tries to read the block;

Write Strategy

No-write allocate: 4 misses + 1 hit cache not affected- address 100 not in the cache; read [200] miss, block replaced, then write [200] hits; Write allocate: 2 misses + 3 hits

Avg Mem Access Time Average memory access time =Hit time + Miss rate x Miss penalty

Example 16KB instr cache + 16KB data cache; 32KB unified cache; 36% data transfer instructions; (load/store takes 1 extra cc on unified cache) 1 CC hit; 200 CC miss penalty; Q1: split cache or unified cache has lower miss rate? Q2: average memory access time?

Example: miss rates

Q1

Q2

Cache vs Processor Processor Performance Lower avg memory access time may correspond to higher CPU time (Example on Page B.19)

Out-of-Order Execution in out-of-order execution, stalls happen to only instructions that depend on incomplete result; other instructions can continue; so less avg miss penalty

Outline Cache Basics Cache Performance Cache Optimization

Average Memory Access Time = Hit Time + Miss Rate x Miss Penalty

Larger block size; Larger cache size; Higher associativity;

Reducing Miss Rate 3 categories of miss rates / root causes Compulsory: cold-start/first-reference misses; Capacity cache size limit; blocks discarded and later retrieved; Conflict collision misses: associativty a block discarded and later retrieved in a set;

Opt #1: Larger Block Size Reduce compulsory misses Leverage spatial locality Increase conflict/capacity misses Fewer block in the cache

Example given the above miss rates; assume memory takes 80 CC overhead, delivers 16 bytes in 2 CC; Q: which block size has the smallest average memory access time for each cache size?

Answer avg mem access time =hit time + miss rate x miss penalty *assume 1-CC hit time for a 256-byte block in a 256 KB cache: avg mem access time = % x (80 + 2x256/16) = 1.5 cc

Answer average memory access time

Opt #2: Larger Cache Reduce capacity misses Increase hit time, cost, and power

Opt #3: Higher Associativity Reduce conflict misses Increase hit time

Example assume higher associativity -> higher clock cycle time: assume 1-cc hit time, 25-cc miss penalty, and miss rates in the following table;

Miss rates

Question: for which cache sizes are each of the statements true?

Answer for a 512 KB, 8-way set associative cache: avg mem access time =hit time + miss rate x miss penalty =1.52x x 25 =1.66

Answer average memory access time

Average Memory Access Time = Hit Time + Miss Rate x Miss Penalty Multilevel caches; Reads > Writes;

Opt #4: Multilevel Cache Reduce miss penalty Motivation faster/smaller cache to keep pace with the speed of processors? larger cache to overcome the widening gap between processor and main mem?

Opt #4: Multilevel Cache Two-level cache Add another level of cache between the original cache and memory L1: small enough to match the clock cycle time of the fast processor; L2: large enough to capture many accesses that would go to main memory, lessening miss penalty

Opt #4: Multilevel Cache Average memory access time =Hit time L1 + Miss rate L1 x Miss penalty L1 =Hit time L1 + Miss rate L1 x(Hit time L2 +Miss rate L2 xMiss penalty L2 ) Average mem stalls per instruction =Misses per instruction L1 x Hit time L2 + Misses per instr L2 x Miss penalty L2

Opt #4: Multilevel Cache Local miss rate the number of misses in a cache divided by the total number of mem accesses to this cache; Miss rate L1, Miss rate L2 Global miss rates the number of misses in the cache divided by the number of mem accesses generated by the processor; Miss rate L1, Miss rate L1 x Miss rate L2

Example 1000 mem references -> 40 misses in L1 and 20 misses in L2; miss penalty from L2 is 200 cc; hit time of L2 is 10 cc; hit time of L1 is 1 cc; 1.5 mem references per instruction; Q: 1. various miss rates? 2. avg mem access time? 3. avg stall cycles per instruction?

Answer 1. various miss rates? L1: local = global 40/1000 = 4% L2: local: 20/40 = 50% global: 20/10000 = 2%

Answer 2. avg mem access time? average memory access time =Hit time L1 + Miss rate L1 x(Hit time L2 +Miss rate L2 xMiss penalty L2 ) =1 + 4% x ( % x 200) =5.4

Answer 3. avg stall cycles per instruction? average stall cycles per instruction =Misses per instruction L1 x Hit time L2 + Misses per instr L2 x Miss penalty L2 =(1.5x40/1000)x10+(1.5x20/1000)x200 =6.6

Opt #5: Prioritize read misses over writes Reduce miss penalty instead of simply stall read miss until write buffer empties, check the contents of write buffer, let the read miss continue if no conflicts with write buffer & memory system is available

Opt #5: Prioritize read misses over writes Why for the code sequence, assume a direct-mapped, write-through cache that maps 512 and 1024 to the same block; a four-word write buffer is not checked on a read miss. R2 ≡ R3 ?

Average Memory Access Time = Hit Time + Miss Rate x Miss Penalty Avoid address translation during indexing of the cache

Opt #6: Avoid address translation during indexing cache Cache addressing virtual address – virtual cache physical address – physical cache Processor/program – virtual address Processor -> address translation -> Cache virtual cache or physical cache?

Opt #6: Avoid address translation during indexing cache Virtually indexed, physically tagged page offset to index the cache; physical address for tag match; For direct-mapped cache, it cannot be bigger than the page size. Reference: CPU Cache E7%BC%93%E5%AD%98

?