Lecture slides originally adapted from Prof. Valeriu Beiu (Washington State University, Spring 2005, EE 334)

Slides:

Advertisements

Similar presentations

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Advertisements

Cs 325 virtualmemory.1 Accessing Caches in Virtual Memory Environment.

CS 430 Computer Architecture 1 CS 430 – Computer Architecture Caches, Part II William J. Taffe using slides of David Patterson.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

CS61C L22 Caches II (1) Garcia, Fall 2005 © UCB Lecturer PSOE, new dad Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C : Machine.

CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.

Inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 32 – Caches III Prem Kumar of Northwestern has created a quantum inverter.

ECE 232 L26.Cache.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 26 Caches.

Modified from notes by Saeid Nooshabadi

CS61C L22 Caches III (1) A Carle, Summer 2006 © UCB inst.eecs.berkeley.edu/~cs61c/su06 CS61C : Machine Structures Lecture #22: Caches Andy.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Memory Chapter 7 Cache Memories.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

CS 61C L35 Caches IV / VM I (1) Garcia, Fall 2004 © UCB Andy Carle inst.eecs.berkeley.edu/~cs61c-ta inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures.

331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

361 Computer Architecture Lecture 14: Cache Memory

ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

Computer ArchitectureFall 2008 © November 3 rd, 2008 Nael Abu-Ghazaleh CS-447– Computer.

CS61C L18 Cache2 © UC Regents 1 CS61C - Machine Structures Lecture 18 - Caches, Part II November 1, 2000 David Patterson

CS61C L32 Caches III (1) Garcia, Fall 2006 © UCB Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c UC Berkeley CS61C.

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

Storage HierarchyCS510 Computer ArchitectureLecture Lecture 12 Storage Hierarchy.

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems

1  1998 Morgan Kaufmann Publishers Recap: Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the.

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.

Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:

Inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 14 – Caches III Google Glass may be one vision of the future of post-PC.

1 CMPE 421 Parallel Computer Architecture PART3 Accessing a Cache.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,

COMP 3221: Microprocessors and Embedded Systems Lectures 27: Cache Memory - III Lecturer: Hui Wu Session 2, 2005 Modified.

CPE 626 CPU Resources: Introduction to Cache Memories Aleksandar Milenkovic Web:

CMSC 611: Advanced Computer Architecture

Soner Onder Michigan Technological University

CSC 4250 Computer Architectures

Cache Memory Presentation I

Mary Jane Irwin ( ) CSE 431 Computer Architecture Fall 2005 Lecture 19: Cache Introduction Review Mary Jane Irwin (

CS61C : Machine Structures Lecture 6. 2

Lecture 21: Memory Hierarchy

CS61C : Machine Structures Lecture 6. 2

So we create a memory hierarchy:

Lecture 23: Cache, Memory, Virtual Memory

Lecture 08: Memory Hierarchy Cache Performance

Lecture 22: Cache Hierarchies, Memory

CPE 631 Lecture 05: Cache Design

Instructor Paul Pearce

ECE232: Hardware Organization and Design

CS 704 Advanced Computer Architecture

EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007

CS-447– Computer Architecture Lecture 20 Cache Memories

CS 3410, Spring 2014 Computer Science Cornell University

Lecture 21: Memory Hierarchy

Cache Memory Rabi Mahapatra

CPE 631 Lecture 04: Review of the ABC of Caches

CS Computer Architecture Spring Lecture 19: Cache Introduction

10/18: Lecture Topics Using spatial locality

Presentation transcript:

Lecture slides originally adapted from Prof. Valeriu Beiu (Washington State University, Spring 2005, EE 334)

 Temporal Locality (Locality in Time)  Keep most recently accessed data items closer to the processor  Spatial Locality (Locality in Space)  Move blocks consisting of contiguous words to the upper levels of cache 2

 Hit Time  Time to find and retrieve data from current level cache  Miss Penalty  Average time to retrieve data on a current level miss (includes the possibility of misses on successive levels of memory hierarchy)  Hit Rate  % of requests that are found in current level cache  Miss Rate  1 - Hit Rate 3

 CPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle time  Memory stall clock cycles = Reads x Read miss rate x Read miss penalty + Writes x Write miss rate x Write miss penalty  Memory stall clock cycles = Memory accesses x Miss rate x Miss penalty  Memory hit time is included in execution cycles 4

 Average Memory Access time (AMAT) = Hit Time + (Miss Rate x Miss Penalty) 5

 Benefits of Larger Block Size  Spatial Locality: if we access a given word, we’re likely to access other nearby words soon  Very applicable with Stored-Program Concept: if we execute a given instruction, it’s likely that we’ll execute the next few as well  Works nicely in sequential array accesses too 6

 Drawbacks of Larger Block Size  Larger block size means larger miss penalty ▪ On a miss, takes longer time to load a new block from next level  If block size is too big (relative to cache size), then there are too few blocks  Result: miss rate goes up  In general, minimize Average Access Time = Hit Time + Miss Penalty x Miss Rate 7

 Compulsory Misses  Occur when a program is first started  Cache does not contain any of that program’s data yet, so misses are bound to occur  Can’t be avoided easily 8

 Conflict Misses  Miss that occurs because two distinct memory addresses map to the same cache location  Two blocks (which happen to map to the same location) can keep overwriting each other  Big problem in direct-mapped caches  How do we reduce the effect of these? 9

 Capacity Misses  Miss that occurs because the cache has a limited size  Miss that would not occur if we increase the size of the cache  This is the primary type of miss for Fully Associate caches 10

 Compulsory (cold start or process migration, first reference): first access to a block  Not a whole lot you can do about it  Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant  Capacity:  Cache cannot contain all blocks access by the program  Solution: increase cache size  Conflict (collision):  Multiple memory locations mapped to the same cache location  Solution 1: increase cache size  Solution 2: increase associativity  Coherence (invalidation):  Other process (e.g., I/O) updates memory 11

Direct Mapped N-way Set Associative Fully Associative Cache SizeLargeMediumSmall Capacity MissLowMediumHigh Conflict MissHighMediumZero Compulsory MissSame Coherence MissSame 12 Note: If you run many (millions or billions of) instructions, Compulsory Misses are insignificant

 Direct-Mapped Cache: index completely specifies which position a block can go in on a miss  N-Way Set Assoc (N > 1): index specifies a set, but block can occupy any position within the set on a miss  Fully Associative: block can be written into any position  Question: if we have the choice, where should we write an incoming block? 13

 If there are any locations with valid bit off (empty), then usually write the new block into the first one  If all possible locations already have a valid block, we must use a replacement policy by which we determine which block gets “cached out” on a miss 14

 LRU (Least Recently Used)  Idea: cache out block which has been accessed (read or write) least recently  Pro: temporal locality => recent past use implies likely future use: in fact, this is a very effective policy  Con: with 2-way set assoc, easy to keep track (one LRU bit); with 4-way or greater, requires complicated hardware and much time to keep track of this 15

 Larger cache  limited by cost and technology  hit time of first level cache < cycle time  More places in the cache to put each block of memory - associativity  fully-associative ▪ any block any line  k-way set associated ▪ k places for each block ▪ direct map: k=1 16

 How do we chose between options of associativity, block size, replacement policy?  Design against a performance model  Minimize: Average Access Time = Hit Time + Miss Penalty x Miss Rate  Influenced by technology and program behavior 17

 Assume  Hit Time = 1 cycle  Miss rate = 5%  Miss penalty = 20 cycles  Avg mem access time?  AMAT = x 20 = 2 cycles 18

 When caches first became popular, Miss Penalty ~ 10 processor clock cycles  Today GHz Processors (<1 ns per clock cycle) and 100 ns to go to DRAM  processor clock cycles!  Solution: another cache between memory and the processor cache:  Second Level (L2) Cache 19

 Avg Mem Access Time = L1 Hit Time + L1 Miss Rate * L1 Miss Penalty  L1 Miss Penalty = L2 Hit Time + L2 Miss Rate * L2 Miss Penalty  Avg Mem Access Time = L1 Hit Time + L1 Miss Rate * (L2 Hit Time + L2 Miss Rate * L2 Miss Penalty) 20

 Assume  L1 Hit Time = 1 cycle  L1 Miss rate = 5%  L1 Miss Penalty = 100 cycles  Avg mem access time?  AMAT = x 100 = 6 cycles 21

 Assume  L1 Hit Time = 1 cycle  L1 Miss rate = 5%  L2 Hit Time = 5 cycles  L2 Miss rate = 15% (% L1 misses that miss)  L2 Miss Penalty = 100 cycles  L1 miss penalty?  = * 100 = 20  Avg mem access time?  AMAT = x 20 = 2 cycles  3x faster with L2 cache 22

 Q1: Where can a block be placed in the upper level?Block placement  Q2: How is a block found if it is in the upper level?Block identification  Q3: Which block should be replaced on a miss?Block replacement  Q4: What happens on a write?Write strategy 23

 Example: Block 12 placed in an 8 block cache  Fully Associative  Direct Mapped  Set Associative 24

 Direct indexing (using index and block offset), tag compares, or combination  Increasing associativity shrinks index, expands tag 25 Block AddressBlock Offset TagIndex Set Select Data Select

 Easy for Direct Mapped  Set Associative or Fully Associative:  Random  LRU (Least Recently Used) 26

 How do we handle:  Write Hit?  Write Miss? 27

 Write-through  update the block in the cache and the corresponding block in lower-level memory  Write-back  update word in cache block  allow memory word to be “stale” ▪ add ‘dirty’ bit to each line indicating that memory needs to be updated when block is replaced ▪ OS flushes cache before I/O !!!  Performance trade-offs? 28

 Pros and Cons of each?  WT: read misses cannot result in writes  WB: no writes of repeated writes  WT always combined with write buffers so that don’t wait for lower level memory 29

 A Write Buffer is needed between the Cache and Memory  Processor: writes data into the cache and the write buffer  Memory controller: write contents of the buffer to memory  Write buffer is just a FIFO:  Typical number of entries: 4  Must handle bursts of writes  Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle 30

 Store frequency (w.r.t. time) > 1 / DRAM write cycle  If this condition exists for a long period of time (CPU cycle time too quick and/or too many store instructions in a row): ▪ Store buffer will overflow no matter how big you make it  Solution for write buffer saturation:  Use a write back cache  Install a second level (L2) cache ▪ Does this always work? 31

 Write-Buffer Issues: Could introduce RAW Hazard with memory!  Write buffer may contain only copy of valid data => Reads to memory may get wrong result if we ignore write buffer  Solutions:  Simply wait for write buffer to empty before servicing reads: ▪ Might increase read miss penalty (old MIPS 1000 by 50% )  Check write buffer contents before read (“fully associative”) ▪ If no conflicts, let the memory access continue ▪ Else grab data from buffer  Can Write Buffer help with Write Back?  Read miss replacing dirty block ▪ Copy dirty block to write buffer while starting read to memory  CPU stall less since restarts as soon as do read 32

 Assume: a 16-bit write to memory location 0x0 and causes a miss  Do we allocate space in cache and possibly read in the block? ▪ Yes: Write Allocate ▪ No: Not Write Allocate 33

 The Principle of Locality:  Program likely to access a relatively small portion of the address space at any instant of time. ▪ Temporal Locality: Locality in Time ▪ Spatial Locality: Locality in Space  Three (+1) Major Categories of Cache Misses:  Compulsory Misses: e.g. cold start misses  Conflict Misses: increase cache size and/or associativity Nightmare Scenario: ping pong effect!  Capacity Misses: increase cache size  Coherence Misses: Caused by external processors or I/O devices 34

 size of cache: speed vs. capacity  block size  associativity  direct mapped vs. associative ▪ choice of N for N-way set associative  block replacement policy  2 nd level cache?  write-hit policy (write-through, write-back)  write-miss policy 35

 The optimal choice is a compromise  depends on access characteristics ▪ workload ▪ use (I-cache, D-cache, TLB)  depends on technology / cost  Use performance model to pick between choices, depending on: programs, technology, budget,...  Simplicity often wins 36

 Migration  Move data item to a local cache  Benefits? Concerns?  Replication  When shared data simultaneously read, make copies of the data in the local caches  Benefits? Concerns? 37

 Have to be careful with the memory hierarchy in an architecture/system with parallelism  A big issue for multicore multiprocessors  Also can occur with I/O devices 38

 Preserves Program Order  CPU A writes value 100 to location X // Assume no other CPU writes here CPU A reads from X // Should read 100  Coherent View of Memory  CPU A writes value 200 to location X // Assume no other CPU writes here CPU B reads from X //Should read 200 // if read/write sufficiently separated in time 39

 Writes to the same location are serialized  CPU A writes value 300 to location X CPU B writes value 400 to location X  CPU can never read 400 from X, then later 300  Ensure all writes to the same location are seen in the same order  Must enforce coherence 40

 Most popular cache coherence protocol  Problem: No centralized state of caches  Solution: Cache controllers snoop bus/network to determine which block is accessed  One example: Write Invalidate Protocol  Ensure exclusive access to block to be written  Beware of false sharing  Scalable? Also see: Directory-Based C.C.P. 41

 Caches are NOT mandatory:  Processor performs arithmetic  Memory stores data  Caches simply make data transfers go faster  Each level of memory hierarchy is just a subset of next higher level  Caches speed up due to temporal locality: store data used recently  Block size > 1 word speeds up due to spatial locality: store words adjacent to the ones used recently  Cache coherency important/tricky for multicore 42