1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1 Recap: Memory Hierarchy. 2 Memory Hierarchy - the Big Picture Problem: memory is too slow and or too small Solution: memory hierarchy Fastest Slowest.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Overview of Cache and Virtual MemorySlide 1 The Need for a Cache (edited from notes with Behrooz Parhami’s Computer Architecture textbook) Cache memories.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
Modified from notes by Saeid Nooshabadi
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Memory Chapter 7 Cache Memories.
The Memory Hierarchy II CPSC 321 Andreas Klappenecker.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
The University of Adelaide, School of Computer Science
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
DAP Spr.‘98 ©UCB 1 Lecture 11: Memory Hierarchy—Ways to Reduce Misses.
Lecture 33: Chapter 5 Today’s topic –Cache Replacement Algorithms –Multi-level Caches –Virtual Memories 1.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
ECE Dept., University of Toronto
Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
Computer Architecture Ch5-1 Ping-Liang Lai ( 賴秉樑 ) Lecture 5 Review of Memory Hierarchy (Appendix C in textbook) Computer Architecture 計算機結構.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
COMP 3221: Microprocessors and Embedded Systems Lectures 27: Cache Memory - III Lecturer: Hui Wu Session 2, 2005 Modified.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
CS203 – Advanced Computer Architecture Cache. Memory Hierarchy Design Memory hierarchy design becomes more crucial with recent multi-core processors:
1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.
CSE 351 Section 9 3/1/12.
ECE232: Hardware Organization and Design
Multilevel Memories (Improving performance using alittle “cash”)
The University of Adelaide, School of Computer Science
Cache Memory Presentation I
Lecture 21: Memory Hierarchy
Lecture 23: Cache, Memory, Virtual Memory
Lecture 08: Memory Hierarchy Cache Performance
Adapted from slides by Sally McKee Cornell University
CMSC 611: Advanced Computer Architecture
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
/ Computer Architecture and Design
Lecture 22: Cache Hierarchies, Memory
Lecture 21: Memory Hierarchy
Cache - Optimization.
Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )
The University of Adelaide, School of Computer Science
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach, Fifth Edition

2 In the beginning… Main memory (i.e., RAM) was faster than processors Memory access times less than clock cycle times Everything was “great” until ~1980 Copyright © 2012, Elsevier Inc. All rights reserved.

3 Memory Performance Gap Introduction

4 Copyright © 2012, Elsevier Inc. All rights reserved. Memory Performance Gap Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory Solution: organize memory system into a hierarchy Entire addressable memory space available in largest, slowest memory Incrementally smaller and faster memories, each containing a subset of the memory below it, proceed in steps up toward the processor Temporal and spatial locality insures that nearly all references can be found in smaller memories Gives the illusion of a large, fast memory being presented to the processor Introduction

5 Copyright © 2012, Elsevier Inc. All rights reserved. Memory Hierarchy Introduction

6 Copyright © 2012, Elsevier Inc. All rights reserved. Memory Hierarchy Design Memory hierarchy design becomes more crucial with recent multi-core processors: Aggregate peak bandwidth grows with # cores: Intel Core i7 can generate two references per core per clock Four cores and 3.2 GHz clock 25.6 billion 64-bit data references/second billion 128-bit instruction references = GB/s! DRAM bandwidth is only 6% of this (25 GB/s) Requires: Multi-port, pipelined caches Two levels of cache per core Shared third-level cache on chip Introduction

7 Copyright © 2012, Elsevier Inc. All rights reserved. Performance and Power High-end microprocessors have >10 MB on-chip cache Consumes large amount of area and power budget Introduction

8 Copyright © 2012, Elsevier Inc. All rights reserved. Cache Hits When a word is found in the cache, a hit occurs: Yay! Introduction

9 Copyright © 2012, Elsevier Inc. All rights reserved. Cache Misses When a word is not found in the cache, a miss occurs: Fetch word from lower level in hierarchy, requiring a higher latency reference Lower level may be another cache or the main memory Also fetch the other words contained within the block Takes advantage of spatial locality Place block into cache in a location determined by address Introduction

10 Copyright © 2012, Elsevier Inc. All rights reserved. Causes of Misses Miss rate Fraction of cache access that result in a miss Causes of misses Compulsory First reference to a block Capacity Blocks discarded and later retrieved Conflict Program makes repeated references to multiple addresses from different blocks that map to the same location in the cache Introduction

11 Copyright © 2012, Elsevier Inc. All rights reserved. Causes of Misses Introduction

12 Note that speculative and multithreaded processors may execute other instructions during a miss Reduces performance impact of misses Copyright © 2012, Elsevier Inc. All rights reserved. Quantifying Misses Introduction

13 Copyright © 2012, Elsevier Inc. All rights reserved. Cache Design 4 Questions: Where can a block be placed in the cache? How is a block found? Which block should be replaced? How is a write handled? Introduction

14 Copyright © 2012, Elsevier Inc. All rights reserved. Block Placement Divide cache into sets to reduce conflict misses n blocks per sets => n-way set associative Direct-mapped cache => one block per set Fully associative => one set Introduction

15 Copyright © 2012, Elsevier Inc. All rights reserved. Block Placement Introduction

16 Copyright © 2012, Elsevier Inc. All rights reserved. Block Placement For a direct-mapped cache High-order address bits determine block location (uniquely) For example: 32-bit addresses, 64 blocks, 8-byte blocks Low-order 3 address bits determine byte location in block High-order 32-3 = 29 bits determine block location in cache 64 blocks, so location is value of high-order 29 bits mod 64 Can simply use the low-order 6 bits of those 29 bits! Introduction

17 Block Placement In general Block address is divided into tag and index based on associativity An n-way associative cache has a number of sets equal to blocks/associativity, so index is lg blocks/associativity bits For example, a 2-way associative cache with 64 blocks has 32 sets, so 5 index bits Tag is used to identify the block during lookup Copyright © 2012, Elsevier Inc. All rights reserved.

18 Copyright © 2012, Elsevier Inc. All rights reserved. Causes of Misses Introduction

19 Copyright © 2012, Elsevier Inc. All rights reserved. Cache Lookup Easy for direct mapped, each block has a unique location Just need to verify the tag For n-way, check tag at index in all sets simultaneously Higher associativity means more-complex hardware (i.e., slower) Also, each entry needs a “valid” bit Avoids initialization issues Introduction

20 Copyright © 2012, Elsevier Inc. All rights reserved. Opteron L1 Data Cache (2-way) Introduction

21 Copyright © 2012, Elsevier Inc. All rights reserved. Replacement Easy for direct mapped, each block has a unique location (again) For higher associativities Least Recently Used (LRU) First In, First Out (FIFO) Random Introduction

22 Copyright © 2012, Elsevier Inc. All rights reserved. Writes Writing to cache: two strategies Write-through Immediately update lower levels of hierarchy Write-back Only update lower levels of hierarchy when an updated block is replaced (use dirty bit) Both strategies use write buffer to make writes asynchronous Introduction

23 Example 32-bit addressing; 128KB cache, 256B blocks (512 blocks), 2-way set associative (256 sets), LRU replacement policy Low-order 8 bits is offset, next lowest 8 bits is index, high-order 16 bits is tag Copyright © 2012, Elsevier Inc. All rights reserved.

24 Example From empty: Read 0xAABB0100 Tag: AABB, Index: 01, Offset: 00; place in set 1, bank 0 Miss (compulsory) Read 0xABBA0101 Tag: ABBA, Index: 01, Offset: 01; place in set 1, bank 1 Miss (compulsory) Read 0xABBA01A3 Tag: ABBA, Index: 01, Offset: A3; located in set 1, bank 1 Hit! Read 0xCBAF01FF Tag: CBAF, Index: 01, Offset: FF; place in set 1, bank 0 (LRU) Miss (conflict) Read 0xAABB01CC Tag: AABB, Index: 01, Offset: CC; place in set 1, bank 1 (LRU) Miss (conflict) Copyright © 2012, Elsevier Inc. All rights reserved.

25 Copyright © 2012, Elsevier Inc. All rights reserved. Basic Optimizations Six basic cache optimizations: Larger block size Reduces compulsory misses Increases conflict misses, increases miss penalty Larger total cache capacity Reduces capacity misses Increases hit time, increases power consumption Higher associativity Reduces conflict misses Increases hit time, increases power consumption Higher number of cache levels Reduces overall memory access time Giving priority to read misses over writes Reduces miss penalty Avoiding address translation in cache indexing Reduces hit time Introduction

26 Copyright © 2012, Elsevier Inc. All rights reserved. Pitfalls Ignoring the impact of the operating system on the performance of the memory hierarchy Memory Technology