1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Slides:



Advertisements
Similar presentations
361 Computer Architecture Lecture 15: Cache Memory
Advertisements

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
1 Recap: Memory Hierarchy. 2 Memory Hierarchy - the Big Picture Problem: memory is too slow and or too small Solution: memory hierarchy Fastest Slowest.
CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Overview of Cache and Virtual MemorySlide 1 The Need for a Cache (edited from notes with Behrooz Parhami’s Computer Architecture textbook) Cache memories.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.
CIS429/529 Cache Basics 1 Caches °Why is caching needed? Technological development and Moore’s Law °Why are caches successful? Principle of locality °Three.
Now, Review of Memory Hierarchy
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Chapter 7 Large and Fast: Exploiting Memory Hierarchy Bo Cheng.
Memory Chapter 7 Cache Memories.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
361 Computer Architecture Lecture 14: Cache Memory
CIS629 - Fall 2002 Caches 1 Caches °Why is caching needed? Technological development and Moore’s Law °Why are caches successful? Principle of locality.
CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.
ECE 232 L27.Virtual.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 27 Virtual.
Caching I Andreas Klappenecker CPSC321 Computer Architecture.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
DAP Spr.‘98 ©UCB 1 Lecture 11: Memory Hierarchy—Ways to Reduce Misses.
CMPE 421 Parallel Computer Architecture
Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
Lecture 14 Memory Hierarchy and Cache Design Prof. Mike Schulte Computer Architecture ECE 201.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
CPE 442 cache.1 Introduction To Computer Architecture CpE 442 Cache Memory Design.
1  1998 Morgan Kaufmann Publishers Recap: Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the.
EEL5708/Bölöni Lec 4.1 Fall 2004 September 10, 2004 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Review: Memory Hierarchy.
1 Computer Architecture Cache Memory. 2 Today is brought to you by cache What do we want? –Fast access to data from memory –Large size of memory –Acceptable.
Memory Hierarchy— Motivation, Definitions, Four Questions about Memory Hierarchy, Improving Performance Professor Alvin R. Lebeck Computer Science 220.
1010 Caching ENGR 3410 – Computer Architecture Mark L. Chang Fall 2006.
EEL-4713 Ann Gordon-Ross 1 EEL-4713 Computer Architecture Memory hierarchies.
The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.
Computer Organization & Programming
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Memory Hierarchy How to improve memory access. Outline Locality Structure of memory hierarchy Cache Virtual memory.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
Improving Memory Access 2/3 The Cache and Virtual Memory
Cps 220 Cache. 1 ©GK Fall 1998 CPS220 Computer System Organization Lecture 17: The Cache Alvin R. Lebeck Fall 1999.
Summary of caches: The Principle of Locality: –Program likely to access a relatively small portion of the address space at any instant of time. Temporal.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
CMSC 611: Advanced Computer Architecture
The Goal: illusion of large, fast, cheap memory
Lecture 21: Memory Hierarchy
Rose Liu Electrical Engineering and Computer Sciences
COMP 206 Siddhartha Chatterjee Fall 2000
CPE 631 Lecture 05: Cache Design
CMSC 611: Advanced Computer Architecture
CS 704 Advanced Computer Architecture
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
CPE 631 Lecture 04: Review of the ABC of Caches
COMP 206 Siddhartha Chatterjee Fall 2000
Presentation transcript:

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and Virtual Memory)

2Outline  Motivation for Caches Principle of locality Principle of locality  Levels of Memory Hierarchy  Cache Organization  Cache Read/Write Policies Block replacement policies Block replacement policies Write-back vs. write-through caches Write-back vs. write-through caches Write buffers Write buffers Reading: HP3 Sections

3  The Five Classic Components of a Computer  This lecture (and next few): Memory System Control Datapath Memory Processor Input Output The Big Picture: Where are We Now?

4  Motivation Large (cheap) memories (DRAM) are slow Large (cheap) memories (DRAM) are slow Small (costly) memories (SRAM) are fast Small (costly) memories (SRAM) are fast  Make the average access time small service most accesses from a small, fast memory service most accesses from a small, fast memory reduce the bandwidth required of the large memory reduce the bandwidth required of the large memory Processor Memory System Cache DRAM The Motivation for Caches

5  The Principle of Locality Program accesses a relatively small portion of the address space at any instant of time Program accesses a relatively small portion of the address space at any instant of time Example: 90% of time in 10% of the code Example: 90% of time in 10% of the code  Two different types of locality Temporal Locality (locality in time): Temporal Locality (locality in time):  if an item is referenced, it will tend to be referenced again soon Spatial Locality (locality in space): Spatial Locality (locality in space):  if an item is referenced, items close by tend to be referenced soon Address Space02n2n Frequency of reference The Principle of Locality

6 CPU Registers 500 Bytes 0.25 ns ~$.01 Cache 16K-1M Bytes 1 ns ~$.0001 Main Memory 64M-2G Bytes 100ns ~$ Disk 100 G Bytes 5 ms cents Capacity Access Time Cost/bit Tape/Network “infinite” secs cents Registers L1, L2, … Cache Memory Disk Tape/Network Words Blocks Pages Files Staging Transfer Unit programmer/compiler 1-8 bytes cache controller bytes OS 4-64K bytes user/operator Mbytes Upper Level Lower Level Faster Larger Levels of the Memory Hierarchy

7 Lower Level (Memory) Upper Level (Cache) To Processor From Processor Blk X Blk Y Memory Hierarchy: Principles of Operation  At any given time, data is copied between only 2 adjacent levels Upper Level (Cache): the one closer to the processor Upper Level (Cache): the one closer to the processor  Smaller, faster, and uses more expensive technology Lower Level (Memory): the one further away from the processor Lower Level (Memory): the one further away from the processor  Bigger, slower, and uses less expensive technology  Block The smallest unit of information that can either be present or not present in the two-level hierarchy The smallest unit of information that can either be present or not present in the two-level hierarchy

8 Memory Hierarchy: Terminology  Hit: data appears in some block in the upper level (e.g.: Block X in previous slide) Hit Rate = fraction of memory access found in upper level Hit Rate = fraction of memory access found in upper level Hit Time = time to access the upper level Hit Time = time to access the upper level  memory access time + Time to determine hit/miss  Miss: data needs to be retrieved from a block in the lower level (e.g.: Block Y in previous slide) Miss Rate = 1 - (Hit Rate) Miss Rate = 1 - (Hit Rate) Miss Penalty: includes time to fetch a new block from lower level Miss Penalty: includes time to fetch a new block from lower level  Time to replace a block in the upper level from lower level + Time to deliver the block the processor  Hit Time: significantly less than Miss Penalty

9 Cache Addressing Set 0 Set j-1 Block 0Block k-1Replacement infoSector 0Sector m-1TagByte 0Byte n-1ValidDirtyShared  Block/line is unit of allocation  Sector/sub-block is unit of transfer and coherence  Cache parameters j, k, m, n are integers, and generally powers of 2

10 Cache Shapes Direct-mapped (A = 1, S = 16) 2-way set-associative (A = 2, S = 8) 4-way set-associative (A = 4, S = 4) 8-way set-associative (A = 8, S = 2) Fully associative (A = 16, S = 1)

11 Examples of Cache Configurations

12 Storage Overhead of Cache

13 Cache Organization  Direct Mapped Cache Each memory location can only mapped to 1 cache location Each memory location can only mapped to 1 cache location No need to make any decision :-) No need to make any decision :-)  Current item replaces previous item in that cache location  N-way Set Associative Cache Each memory location have a choice of N cache locations Each memory location have a choice of N cache locations  Fully Associative Cache Each memory location can be placed in ANY cache location Each memory location can be placed in ANY cache location  Cache miss in a N-way Set Associative or Fully Associative Cache Bring in new block from memory Bring in new block from memory Throw out a cache block to make room for the new block Throw out a cache block to make room for the new block Need to decide which block to throw out! Need to decide which block to throw out!

14 Write Allocate versus Not Allocate  Assume that a write to a memory location causes a cache miss Do we read in the block? Do we read in the block?  Yes: Write Allocate  No: Write No-Allocate

15 Basics of Cache Operation: Overview

16 Details of Simple Blocking Cache Write Through Write Back

17 Cache Data Cache Block 0 Cache TagValid ::: Cache Data Cache Block 0 Cache TagValid ::: Cache Index Mux 01 SEL1SEL0 Cache Block Compare Addr. Tag Compare OR Hit Addr. Tag A-way Set-Associative Cache  A -way set associative: A entries for each cache index A direct-mapped caches operating in parallel A direct-mapped caches operating in parallel  Example: Two-way set associative cache Cache Index selects a “set” from the cache Cache Index selects a “set” from the cache The two tags in the set are compared in parallel The two tags in the set are compared in parallel Data is selected based on the tag result Data is selected based on the tag result

18 : Cache Data Byte : Cache Tag (27 bits long) Valid Bit : Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Cache Tag Byte Select Ex: 0x01 X X X X X Fully Associative Cache Fully Associative Cache  Push the set-associative idea to its limit! Forget about the Cache Index Forget about the Cache Index Compare the Cache Tags of all cache tag entries in parallel Compare the Cache Tags of all cache tag entries in parallel Example: Block Size = 32B, we need N 27-bit comparators Example: Block Size = 32B, we need N 27-bit comparators

19 : Entry 0 Entry 1 Entry 63 Replacement Pointer Cache Block Replacement Policies  Random Replacement Hardware randomly selects a cache item and throw it out Hardware randomly selects a cache item and throw it out  Least Recently Used Hardware keeps track of the access history Hardware keeps track of the access history Replace the entry that has not been used for the longest time Replace the entry that has not been used for the longest time For 2-way set-associative cache, need one bit for LRU repl. For 2-way set-associative cache, need one bit for LRU repl.  Example of a Simple “Pseudo” LRU Implementation Assume 64 Fully Associative entries Assume 64 Fully Associative entries Hardware replacement pointer points to one cache entry Hardware replacement pointer points to one cache entry Whenever access is made to the entry the pointer points to: Whenever access is made to the entry the pointer points to:  Move the pointer to the next entry Otherwise: do not move the pointer Otherwise: do not move the pointer

20 Cache Write Policy  Cache read is much easier to handle than cache write Instruction cache is much easier to design than data cache Instruction cache is much easier to design than data cache  Cache write How do we keep data in the cache and memory consistent? How do we keep data in the cache and memory consistent?  Two options (decision time again :-) Write Back: write to cache only. Write the cache block to memory when that cache block is being replaced on a cache miss Write Back: write to cache only. Write the cache block to memory when that cache block is being replaced on a cache miss  Need a “dirty bit” for each cache block  Greatly reduce the memory bandwidth requirement  Control can be complex Write Through: write to cache and memory at the same time Write Through: write to cache and memory at the same time  What!!! How can this be? Isn’t memory too slow for this?

21 Processor Cache Write Buffer DRAM Write Buffer for Write Through  Write Buffer: needed between cache and main mem Processor: writes data into the cache and the write buffer Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Memory controller: write contents of the buffer to memory  Write buffer is just a FIFO Typical number of entries: 4 Typical number of entries: 4 Works fine if store freq. (w.r.t. time) << 1 / DRAM write cycle Works fine if store freq. (w.r.t. time) << 1 / DRAM write cycle  Memory system designer’s nightmare Store frequency (w.r.t. time) > 1 / DRAM write cycle Store frequency (w.r.t. time) > 1 / DRAM write cycle Write buffer saturation Write buffer saturation

22 Processor Cache Write Buffer DRAMProcessor Cache Write Buffer DRAM L2 Cache Write Buffer Saturation  Store frequency (w.r.t. time) > 1 / DRAM write cycle If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row) If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row)  Store buffer will overflow no matter how big you make it  CPU Cycle Time << DRAM Write Cycle Time  Solutions for write buffer saturation Use a write back cache Use a write back cache Install a second level (L2) cache Install a second level (L2) cache