Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy.

Slides:



Advertisements
Similar presentations
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1 Recap: Memory Hierarchy. 2 Memory Hierarchy - the Big Picture Problem: memory is too slow and or too small Solution: memory hierarchy Fastest Slowest.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Chapter 7 Large and Fast: Exploiting Memory Hierarchy Bo Cheng.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
The Memory Hierarchy II CPSC 321 Andreas Klappenecker.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
361 Computer Architecture Lecture 14: Cache Memory
CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
DAP Spr.‘98 ©UCB 1 Lecture 11: Memory Hierarchy—Ways to Reduce Misses.
2/27/2002CSE Cache II Caches, part II CPU On-chip cache Off-chip cache DRAM memory Disk memory.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
CMPE 421 Parallel Computer Architecture
Lecture 19: Virtual Memory
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell CS352H: Computer Systems Architecture Topic 11: Memory Hierarchy.
1010 Caching ENGR 3410 – Computer Architecture Mark L. Chang Fall 2006.
EEL-4713 Ann Gordon-Ross 1 EEL-4713 Computer Architecture Memory hierarchies.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.
Computer Organization & Programming
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Memory Hierarchy How to improve memory access. Outline Locality Structure of memory hierarchy Cache Virtual memory.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
Computer Organization CS224 Fall 2012 Lessons 39 & 40.
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
The Memory Hierarchy (Lectures #17 - #20) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
Chapter 5 Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
CPE 626 CPU Resources: Introduction to Cache Memories Aleksandar Milenkovic Web:
CMSC 611: Advanced Computer Architecture
The Goal: illusion of large, fast, cheap memory
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Morgan Kaufmann Publishers Memory & Cache
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
10/18: Lecture Topics Using spatial locality
Presentation transcript:

Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

2 Memory/Storage Architecture Lab 2 Technology Trends YearCapacity$/GB Kbit$ Kbit$ Mbit$ Mbit$ Mbit$ Mbit$ Mbit$ Mbit$ Mbit$ Gbit$50

3 Memory/Storage Architecture Lab 3 Memory Hierarchy Ideally one would desire an indefinitely large memory capacity such that any particular … word would be immediately available … We are … forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible. Burks, Goldstine, and von Neumann, 1946 CPU Size of the memory at each level Decreasing cost Increasing speed and bandwidth Levels in the memory hierarchy Level 1 Level 2 Level n

4 Memory/Storage Architecture Lab 4 Memory Technology (Big Picture) Speed: Fastest Size: Smallest Cost: Highest Slowest Biggest Lowest Processor Control Datapath Memory

5 Memory/Storage Architecture Lab 5 Memory Technology (Real-world Realization) Processor Control Off-chip Level Caches (SRAM) Main Memory (DRAM) Secondary storage (Disk) RegistersOn-chipCaches Register Cache Main Memory Disk Memory Speed <1ns <5ns 50ns~70ns 5ms~20ms Size100B KB→MB MB→GB GB→TB ManagementCompiler Hardware OS OS Register Cache Main Memory Disk Memory Speed <1ns <5ns 50ns~70ns 5ms~20ms Size100B KB→MB MB→GB GB→TB ManagementCompiler Hardware OS OS

6 Memory/Storage Architecture Lab 6 Memory Hierarchy  An optimization resulting from a perfect match between memory technology and two types of program locality Temporal locality (locality in time) − If an item is referenced, it will tend to be referenced again soon. Spatial locality (locality in space) − If an item is referenced, items whose addresses are close by will tend to be referenced soon.  Goal : To provide a “virtual” memory technology (an illusion) that has an access time of the highest-level memory with the size and cost of the lowest-level memory

7 Memory/Storage Architecture Lab 7 Temporal and Spatial Localities Source: Glass & Cao (1997 ACM SIGMETRICS)

8 Memory/Storage Architecture Lab 8 Memory Hierarchy Terminology  Hit – Accessed data is found in upper level Hit Rate = fraction of accesses found in upper level Hit Time = time to access the upper level  Miss – Accessed data found only in lower level Processor waits until data is fetched from next level, then restarts/continues access Miss rate = 1 – (hit rate) Miss penalty = time to get block from lower level + time to replace in upper level  Hit time << miss penalty Average memory access time << worst case access time Average memory access time = hit time + miss rate ⅹ miss penalty Data are transferred in the unit of blocks

9 Memory/Storage Architecture Lab 9 (CPU) Cache  Upper level : SRAM (small, fast, expensive) lower level : DRAM (large, slow, cheap)  Goal : To provide a “virtual” memory technology that has an access time of SRAM with the size and cost of DRAM  Additional benefits Reduction of memory bandwidth consumed by processor  More memory bandwidth for I/O No need to change the ISA

10 Memory/Storage Architecture Lab 10 Direct-mapped Cache  Each memory block is mapped to a single cache block  The mapped cache block is determined by memory block address mod number of cache blocks

11 Memory/Storage Architecture Lab 11 Direct-Mapped Cache Example  Consider a direct-mapped cache with block size 4 bytes and total capacity 4KB Assume 1 word per block… The 2 lowest address bits specify the byte within a block The next 10 address bits specify the block’s index within the cache The 20 highest address bits are the unique tag for this memory block The valid bit specifies whether the block is an accurate copy of memory Exploit temporal locality

12 Memory/Storage Architecture Lab 12 On cache read  On cache hit, CPU proceeds normally  On cache miss (handled completely by hardware) Stall the CPU pipeline Fetch the missed block from the next level of hierarchy Instruction cache miss − Restart instruction fetch Data cache miss − Complete data access

13 Memory/Storage Architecture Lab 13 On cache write  Write-through Always write the data into both the cache and main memory Simple but slow and increases memory traffic (requires a write buffer)  Write-back Write the data into the cache only and update the main memory when a dirty block is replaced (requires a dirty bit and possibly a write buffer) Fast but complex to implement and causes a consistency problem

14 Memory/Storage Architecture Lab 14 Write allocation  What should happen on a write miss?  Alternatives for write-through Allocate on miss: fetch the block Write around: don’t fetch the block − Since programs often write a whole block before reading it (e.g., initialization)  For write-back Usually fetch the block

15 Memory/Storage Architecture Lab 15 Memory Reference Sequence  Look at the following sequence of memory references for the previous direct-mapped cache 0, 4, 8188, 0, 16384, 0 00XXXX XXXX 10220XXXX 10230XXXX …… IndexValid TagData Cache Initially Empty

16 Memory/Storage Architecture Lab 16 After Reference 1  Look at the following sequence of memory references for the previous direct-mapped cache 0, 4, 8188, 0, 16384, 0 Address = Memory bytes 0…3 (copy) 10XXXX XXXX 10220XXXX 10230XXXX …… IndexValid TagData Cache Miss, Place Block at Index 0 Miss

17 Memory/Storage Architecture Lab 17 After Reference 2  Look at the following sequence of memory references for the previous direct-mapped cache 0, 4, 8188, 0, 16384, 0 Address = Memory bytes 0…3 (copy) Memory bytes 4…7 (copy) 20XXXX XXXX 10220XXXX 10230XXXX …… IndexValid TagData Cache Miss, Place Block at Index 1 Miss

18 Memory/Storage Architecture Lab 18 After Reference 3  Look at the following sequence of memory references for the previous direct-mapped cache 0, 4, 8188, 0, 16384, 0 Address = Memory bytes 0…3 (copy) Memory bytes 4…7 (copy) 20XXXX XXXX 10220XXXX Memory bytes 8188…8191 (copy) …… IndexValid TagData Cache Miss, Place Block at Index 1023 Miss

19 Memory/Storage Architecture Lab 19 After Reference 4  Look at the following sequence of memory references for the previous direct-mapped cache 0, 4, 8188, 0, 16384, 0 Address = Memory bytes 0…3 (copy) Memory bytes 4…7 (copy) 20XXXX XXXX 10220XXXX Memory bytes 8188…8191 (copy) …… IndexValid TagData Cache Hit to Block at Index 0 Hit

20 Memory/Storage Architecture Lab 20 After Reference 5  Look at the following sequence of memory references for the previous direct-mapped cache 0, 4, 8188, 0, 16384, 0 Address = [same index!] 01Memory bytes 16384…16387(copy) Memory bytes 4…7 (copy) 20XXXX XXXX 10220XXXX Memory bytes 8188…8191 (copy) …… IndexValid TagData Cache Miss, Replace Block at Index 0 Miss

21 Memory/Storage Architecture Lab 21 After Reference 6  Look at the following sequence of memory references for the previous direct-mapped cache 0, 4, 8188, 0, 16384, 0 Address = [same index!] 01 Memory bytes 0…3 (copy) Memory bytes 4…7 (copy) 20XXXX XXXX 10220XXXX Memory bytes 8188…8191 (copy) …… IndexValid TagData Cache Miss, Replace Block at Index 0 Total of 1 Hit and 5 Misses Miss Again

22 Memory/Storage Architecture Lab 22 Exploiting Spatial Locality - Larger than one word block size 16 KB Direct-mapped cache with B (16 words) blocks

23 Memory/Storage Architecture Lab 23 Miss Rate vs. Block Size

24 Memory/Storage Architecture Lab 24 Set-Associative Caches  Allow multiple entries per index to improve hit rates n-way set associative caches allow up to n conflicting references to be cached − n is the number of cache blocks in each set − n comparisons are needed to search all blocks in the set in parallel − When there is a conflict, which block is replaced (this was easy for direct mapped caches – there`s only one entry!) Fully-associative caches − a single (very large!) set allows a memory location to be placed in any cache block Direct-mapped caches are essentially 1-way set-associative caches  For fixed cache capacity, higher associativity leads to higher hit rates Because more combinations of memory blocks can be present in the cache Set associativity optimizes cache contents, but at what cost?

25 Memory/Storage Architecture Lab 25 Cache Organization Spectrum

26 Memory/Storage Architecture Lab 26 Implementation of Set Associative Cache

27 Memory/Storage Architecture Lab 27 Cache Organization Example One-way set associative (direct mapped) BlockTagData Two-way set associative SetTagData TagData Four-way set associative Set 0 1 TagDataTagDataTagDataTagData TagDataTagDataTagDataTagDataTagDataTagDataTagDataTagData Eight-way set associative (fully associative)

28 Memory/Storage Architecture Lab 28 Cache Block Replacement Policy  Direct-mapped Caches No replacement policy is needed since each memory block can be placed in only one cache block  N-way set-associative Caches Each memory block can be placed in any of the n cache blocks in the mapped set Least Recently Used (LRU) replacement policy is typically used to select a block to be replaced among the blocks in the mapped set LRU replaces the block that has not been used for the longest time

29 Memory/Storage Architecture Lab 29 Miss Rate vs. Set Associativity

30 Memory/Storage Architecture Lab 30 Memory Reference Sequence  Look again at the following sequence of memory references for a 2- way set associative cache with a block size of two words (8bytes) 0, 4, 8188, 0, 16384, 0  This sequence had 5 misses and 1 hit for the direct mapped cache with the same capacity 0XXXX … … Set Number Valid TagData Cache Initially Empty

31 Memory/Storage Architecture Lab 31 After Reference 1  Look again at the following sequence of memory references for a 2- way set associative cache with a block size of two words (8bytes) 0, 4, 8188, 0, 16384, 0 Address = Memory bytes 0..7 (copy) 0XXXX … … Valid TagData Set Number Cache Miss, Place in First Block of Set 0 Miss

32 Memory/Storage Architecture Lab 32 After Reference 2  Look again at the following sequence of memory references for a 2- way set associative cache with a block size of two words (8bytes) 0, 4, 8188, 0, 16384, 0 Address = Memory bytes 0..7 (copy) 0XXXX … … Valid TagData Set Number Cache Hit to first Block in Set 0 Hit

33 Memory/Storage Architecture Lab 33 After Reference 3  Look again at the following sequence of memory references for a 2- way set associative cache with a block size of two words (8bytes) 0, 4, 8188, 0, 16384, 0 Address = Memory bytes 0..7 (copy) 0XXXX Memory bytes (copy) 0XXXX … … Valid TagData Set Number Cache Miss, Place in First Block of Set 255 Miss

34 Memory/Storage Architecture Lab 34 After Reference 4  Look again at the following sequence of memory references for a 2- way set associative cache with a block size of two words (8bytes) 0, 4, 8188, 0, 16384, 0 Address = Memory bytes 0..7 (copy) 0XXXX Memory bytes (copy) 0XXXX … … Valid TagData Set Number Cache Hit to first Block in Set 0 Hit

35 Memory/Storage Architecture Lab 35 After Reference 5  Look again at the following sequence of memory references for a 2- way set associative cache with a block size of two words (8bytes) 0, 4, 8188, 0, 16384, 0 Address = Memory bytes 0..7 (copy) Memory bytes (copy) 0XXXX Memory bytes (copy) 0XXXX … … Valid TagData Set Number Cache Miss, Place in Second Block of Set 0 Miss

36 Memory/Storage Architecture Lab 36 After Reference 6  Look again at the following sequence of memory references for a 2- way set associative cache with a block size of two words (8bytes) 0, 4, 8188, 0, 16384, 0 Address = Memory bytes 0..7 (copy) Memory bytes (copy) 0XXXX Memory bytes (copy) 0XXXX … … Valid TagData Set Number Cache Hit to first Block in Set 0 Total of 3 hits and 3 misses Cache Hit to first Block in Set 0 Total of 3 hits and 3 misses Hit

37 Memory/Storage Architecture Lab 37 Improving Cache Performance  Cache Performance is determined by Average memory access time = hit time + (miss rate x miss penalty)  Decrease hit time Make cache smaller, but miss rate increases Use direct mapped, but miss rate increase  Decrease miss rate Make cache larger, but can increases hit time Add associativity, but can increases hit time Increase block size, but increases miss penalty  Decrease miss penalty Reduce transfer time component of miss penalty Add another level of cache

38 Memory/Storage Architecture Lab 38 Current Cache Organizations Intel NehalemAMD Opteron X4 L1 caches (per core) L1 I-cache: 32KB, 64-byte blocks, 4-way, approx LRU replacement, hit time n/a L1 D-cache: 32KB, 64-byte blocks, 8-way, approx LRU replacement, write- back/allocate, hit time n/a L1 I-cache: 32KB, 64-byte blocks, 2-way, LRU replacement, hit time 3 cycles L1 D-cache: 32KB, 64-byte blocks, 2-way, LRU replacement, write-back/allocate, hit time 9 cycles L2 unified cache (per core) 256KB, 64-byte blocks, 8-way, approx LRU replacement, write-back/allocate, hit time n/a 512KB, 64-byte blocks, 16-way, approx LRU replacement, write-back/allocate, hit time n/a L3 unified cache (shared) 8MB, 64-byte blocks, 16-way, replacement n/a, write-back/allocate, hit time n/a 2MB, 64-byte blocks, 32-way, replace block shared by fewest cores, write- back/allocate, hit time 32 cycles n/a: data not available

39 Memory/Storage Architecture Lab 39 Cache Coherence Problem  Suppose two CPU cores share a physical address space Write-through caches Time step EventCPU A’s cache CPU B’s cache Memory 00 1CPU A reads X00 2CPU B reads X000 3CPU A writes 1 to X101

40 Memory/Storage Architecture Lab 40 Snoopy Protocols  Write Invalidate Protocol: Write to shared data: an invalidate is sent to all caches which snoop and invalidate any copies  Write Broadcast Protocol: Write to shared data: broadcast on bus, processors snoop, and update copies  Write serialization: bus serializes requests Bus is single point of arbitration

41 Memory/Storage Architecture Lab 41 Write invalidate Protocol  Cache gets exclusive access to a block when it is to be written Broadcasts an invalidate message on the bus Subsequent read in another cache misses − Owning cache supplies updated value CPU activityBus activityCPU A’s cacheCPU B’s cache Memory 0 CPU A reads XCache miss for X00 CPU B reads XCache miss for X000 CPU A writes 1 to XInvalidate for X10 CPU B read XCache miss for X111

42 Memory/Storage Architecture Lab 42 Summary  Memory hierarchies are an optimization resulting from a perfect match between memory technology and two types of program locality Temporal locality Spatial locality  The goal is to provide a “virtual” memory technology (an illusion) that has an access time of the highest-level memory with the size and cost of the lowest-level memory  Cache memory is an instance of a memory hierarchy exploits both temporal and spatial localities direct-mapped caches are simple and fast but have higher miss rates set-associative caches have lower miss rates but are complex and slow multilevel caches are becoming increasingly popular cache coherence protocols ensures consistency among multiple caches