Lecture 21: Memory Hierarchy

Slides:



Advertisements
Similar presentations
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Cache Here we focus on cache improvements to support at least 1 instruction fetch and at least 1 data access per cycle – With a superscalar, we might need.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
1 Lecture 20: Cache Hierarchies, Virtual Memory Today’s topics:  Cache hierarchies  Virtual memory Reminder:  Assignment 8 will be posted soon (due.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
1 Lecture 12: Cache Innovations Today: cache access basics and innovations (Sections )
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )
Virtual Memory Topics Virtual Memory Access Page Table, TLB Programming for locality Memory Mountain Revisited.
CSE 378 Cache Performance1 Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache /
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
CSCI206 - Computer Organization & Programming
CMSC 611: Advanced Computer Architecture
COSC3330 Computer Architecture
CSE 351 Section 9 3/1/12.
Cache Memory and Performance
Multilevel Memories (Improving performance using alittle “cash”)
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Lecture: Cache Hierarchies
Cache Memory Presentation I
Consider a Direct Mapped Cache with 4 word blocks
Morgan Kaufmann Publishers Memory & Cache
Morgan Kaufmann Publishers
Lecture: Large Caches, Virtual Memory
Lecture: Cache Hierarchies
Lecture 23: Cache, Memory, Security
Lecture 21: Memory Hierarchy
Lecture 21: Memory Hierarchy
Rose Liu Electrical Engineering and Computer Sciences
Lecture 23: Cache, Memory, Virtual Memory
Lecture 08: Memory Hierarchy Cache Performance
Lecture 22: Cache Hierarchies, Memory
Lecture: Cache Innovations, Virtual Memory
Lecture 22: Cache Hierarchies, Memory
CPE 631 Lecture 05: Cache Design
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Performance metrics for caches
Performance metrics for caches
Lecture 24: Memory, VM, Multiproc
10/16: Lecture Topics Memory problem Memory Solution: Caches Locality
CDA 5155 Caches.
Adapted from slides by Sally McKee Cornell University
CMSC 611: Advanced Computer Architecture
Performance metrics for caches
Lecture 20: OOO, Memory Hierarchy
CS 704 Advanced Computer Architecture
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Lecture 20: OOO, Memory Hierarchy
Lecture 22: Cache Hierarchies, Memory
Lecture 11: Cache Hierarchies
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Performance metrics for caches
Cache - Optimization.
Cache Memory Rabi Mahapatra
Principle of Locality: Memory Hierarchies
Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )
Lecture 14: Cache Performance
Performance metrics for caches
10/18: Lecture Topics Using spatial locality
Presentation transcript:

Lecture 21: Memory Hierarchy Today’s topics: Cache organization Cache hits/misses

OoO Wrap-up How large are the structures in an out-of-order processor? What are the pros and cons of having larger/smaller structures?

Cache Hierarchies Data and instructions are stored on DRAM chips – DRAM is a technology that has high bit density, but relatively poor latency – an access to data in memory can take as many as 300 cycles today! Hence, some data is stored on the processor in a structure called the cache – caches employ SRAM technology, which is faster, but has lower bit density Internet browsers also cache web pages – same concept

Memory Hierarchy As you go further, capacity and latency increase Disk 80 GB 10M cycles Memory 1GB 300 cycles L2 cache 2MB 15 cycles L1 data or instruction Cache 32KB 2 cycles Registers 1KB 1 cycle

Locality Why do caches work? Temporal locality: if you used some data recently, you will likely use it again Spatial locality: if you used some data recently, you will likely access its neighbors No hierarchy: average access time for data = 300 cycles 32KB 1-cycle L1 cache that has a hit rate of 95%: average access time = 0.95 x 1 + 0.05 x (301) = 16 cycles

a unique cache location. Accessing the Cache Byte address 101000 Offset 8-byte words 8 words: 3 index bits Direct-mapped cache: each address maps to a unique cache location. Sets Data array

The Tag Array Byte address 101000 Tag 8-byte words Compare Direct-mapped cache: each address maps to a unique address Tag array Data array

Example Access Pattern Byte address Assume that addresses are 8 bits long How many of the following address requests are hits/misses? 4, 7, 10, 13, 16, 68, 73, 78, 83, 88, 4, 7, 10… 101000 Tag 8-byte words Compare Direct-mapped cache: each address maps to a unique address Tag array Data array

Increasing Line Size Byte address A large cache line size  smaller tag array, fewer misses because of spatial locality 10100000 32-byte cache line size or block size Tag Offset Tag array Data array

Associativity Byte address Set associativity  fewer conflicts; wasted power because multiple data and tags are read 10100000 Tag Way-1 Way-2 Tag array Data array Compare

How many offset/index/tag bits if the cache has Associativity How many offset/index/tag bits if the cache has 64 sets, each set has 64 bytes, 4 ways Byte address 10100000 Tag Way-1 Way-2 Tag array Data array Compare

Example 1 32 KB 4-way set-associative data cache array with 32 byte line sizes How many sets? How many index bits, offset bits, tag bits? How large is the tag array?

Example 1 32 KB 4-way set-associative data cache array with 32 byte line sizes cache size = #sets x #ways x block size How many sets? 256 How many index bits, offset bits, tag bits? 8 5 19 How large is the tag array? tag array size = #sets x #ways x tag size = 19 Kb = 2.375 KB

Example 2 A pipeline has CPI 1 if all loads/stores are L1 cache hits 40% of all instructions are loads/stores 85% of all loads/stores hit in 1-cycle L1 50% of all (10-cycle) L2 accesses are misses Memory access takes 100 cycles What is the CPI?

Example 2 A pipeline has CPI 1 if all loads/stores are L1 cache hits 40% of all instructions are loads/stores 85% of all loads/stores hit in 1-cycle L1 50% of all (10-cycle) L2 accesses are misses Memory access takes 100 cycles What is the CPI? Start with 1000 instructions 1000 cycles (includes all 400 L1 accesses) + 400 (l/s) x 15% x 10 cycles (the L2 accesses) + 400 x 15% x 50% x 100 cycles (the mem accesses) = 4,600 cycles CPI = 4.6

Cache Misses On a write miss, you may either choose to bring the block into the cache (write-allocate) or not (write-no-allocate) On a read miss, you always bring the block in (spatial and temporal locality) – but which block do you replace? no choice for a direct-mapped cache randomly pick one of the ways to replace replace the way that was least-recently used (LRU) FIFO replacement (round-robin)

Writes When you write into a block, do you also update the copy in L2? write-through: every write to L1  write to L2 write-back: mark the block as dirty, when the block gets replaced from L1, write it to L2 Writeback coalesces multiple writes to an L1 block into one L2 write Writethrough simplifies coherency protocols in a multiprocessor system as the L2 always has a current copy of data

Types of Cache Misses Compulsory misses: happens the first time a memory word is accessed – the misses for an infinite cache Capacity misses: happens because the program touched many other words before re-touching the same word – the misses for a fully-associative cache Conflict misses: happens because two words map to the same location in the cache – the misses generated while moving from a fully-associative to a direct-mapped cache

Title Bullet