CS61CL Machine Structures Lec 11 – Introduction to Cache Design

Slides:

Advertisements

Similar presentations

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

Advertisements

CS 430 Computer Architecture 1 CS 430 – Computer Architecture Caches, Part II William J. Taffe using slides of David Patterson.

Computer ArchitectureFall 2007 © November 14th, 2007 Majd F. Sakr CS-447– Computer Architecture.

CS61C L22 Caches II (1) Garcia, Fall 2005 © UCB Lecturer PSOE, new dad Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C : Machine.

Memory Subsystem and Cache Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.

Lecturer PSOE Dan Garcia

CS61C L23 Cache II (1) Chae, Summer 2008 © UCB Albert Chae, Instructor inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture #23 – Cache II.

Inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 32 – Caches III Prem Kumar of Northwestern has created a quantum inverter.

Modified from notes by Saeid Nooshabadi

CS61C L22 Caches III (1) A Carle, Summer 2006 © UCB inst.eecs.berkeley.edu/~cs61c/su06 CS61C : Machine Structures Lecture #22: Caches Andy.

CS61C L23 Caches I (1) Beamer, Summer 2007 © UCB Scott Beamer, Instructor inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture #23 Cache I.

CS61C L33 Caches III (1) Garcia, Spring 2007 © UCB Future of movies is 3D?  Dreamworks says they may exclusively release movies in this format. It’s based.

COMP3221: Microprocessors and Embedded Systems Lecture 26: Cache - II Lecturer: Hui Wu Session 2, 2005 Modified from.

CS61C L32 Caches II (1) Garcia, 2005 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures.

CS 61C L35 Caches IV / VM I (1) Garcia, Fall 2004 © UCB Andy Carle inst.eecs.berkeley.edu/~cs61c-ta inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

COMP3221 lec34-Cache-II.1 Saeid Nooshabadi COMP 3221 Microprocessors and Embedded Systems Lectures 34: Cache Memory - II

CS61C L24 Cache II (1) Beamer, Summer 2007 © UCB Scott Beamer, Instructor inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture #24 Cache II.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

Cs 61C L17 Cache.1 Patterson Spring 99 ©UCB CS61C Cache Memory Lecture 17 March 31, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson) www-inst.eecs.berkeley.edu/~cs61c/schedule.html.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

Computer ArchitectureFall 2008 © November 3 rd, 2008 Nael Abu-Ghazaleh CS-447– Computer.

CS61C L18 Cache2 © UC Regents 1 CS61C - Machine Structures Lecture 18 - Caches, Part II November 1, 2000 David Patterson

CS 61C L21 Caches II (1) Garcia, Spring 2004 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C : Machine.

CS61C L32 Caches III (1) Garcia, Fall 2006 © UCB Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c UC Berkeley CS61C.

CS 61C L23 Caches IV / VM I (1) Garcia, Spring 2004 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C :

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

DAP Spr.‘98 ©UCB 1 Lecture 11: Memory Hierarchy—Ways to Reduce Misses.

CMPE 421 Parallel Computer Architecture

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

CML CML CS 230: Computer Organization and Assembly Language Aviral Shrivastava Department of Computer Science and Engineering School of Computing and Informatics.

Csci 211 Computer System Architecture – Review on Cache Memory Xiuzhen Cheng

CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

Inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 30 – Caches I After more than 4 years C is back at position number 1 in.

Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

Inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 14 – Caches III Google Glass may be one vision of the future of post-PC.

LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”

COMP 3221: Microprocessors and Embedded Systems Lectures 27: Cache Memory - III Lecturer: Hui Wu Session 2, 2005 Modified.

CMSC 611: Advanced Computer Architecture

Memory Hierarchy Ideal memory is fast, large, and inexpensive

Memory COMPUTER ARCHITECTURE

Lecture 12 Virtual Memory.

How will execution time grow with SIZE?

Cache Memory Presentation I

Instructor Paul Pearce

CS61C : Machine Structures Lecture 6. 2

Memristor memory on its way (hopefully)

Chapter 8 Digital Design and Computer Architecture: ARM® Edition

CS61C : Machine Structures Lecture 6. 2

Lecture 23: Cache, Memory, Virtual Memory

Lecture 22: Cache Hierarchies, Memory

CPE 631 Lecture 05: Cache Design

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Instructor Paul Pearce

ECE232: Hardware Organization and Design

How can we find data in the cache?

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007

CS-447– Computer Architecture Lecture 20 Cache Memories

Lecturer PSOE Dan Garcia

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Some of the slides are adopted from David Patterson (UCB)

Csci 211 Computer System Architecture – Review on Cache Memory

Cache Memory Rabi Mahapatra

10/18: Lecture Topics Using spatial locality

Presentation transcript:

CS61CL Machine Structures Lec 11 – Introduction to Cache Design David Culler Electrical Engineering and Computer Sciences University of California, Berkeley CS252 S05

CS61CL Road Map Software Hardware I/O system Instr. Set Proc. HLL Program Asm Lang. Pgm Compiler Assembler foo.c foo.s foo.o Machine Lang. pgm foo.exe Linker Software Instruction Set Architecture Hardware Machine Organization I/O system Instr. Set Proc. Digital Design Circuit Design Datapath & Control Layout & fab Semiconductor Materials 10/14/09 CS61CL F09

Turning “Moore stuff” into performance 11/24/2018 Turning “Moore stuff” into performance 10/14/09 CS61CL F09 EECS 150 Fa07

Performance Trends MIPS R3000 11/24/2018 cs61cl f09 lec 5

Recall: Performance Performance is in units of things per sec Speedup( E ) = Performance(with E) / Performance( without E) Performance is in units of things per sec bigger is better If we are primarily concerned with response time performance(x) = 1 execution_time(x) " X is n times faster than Y" means Performance(X) Execution_time(Y) n = = Performance(Y) Execution_time(X) 11/24/2018 cs61cl f09 lec 5

Review: Pipelined Execution °°° PC + A B Ci IR IR_ex IR_mem IR_wb imem Dmem Speedup with N stages is ≤ N Limited by dependences (aka Hazards) Structural hazard: two operations want to use same resource at same time Data Hazard: cannot use a value before it is produced Control Hazard: attempt to branch before condition is determined 11/4/09 UCB CS61CL F09 Lec 10

The Problem: Memory Gap µProc 60%/yr. DRAM 7%/yr. 1 10 100 1000 1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) Performance 1985 80386 cache off-chip 1989 first Intel CPU (80486) with cache on chip 1995 first Intel CPU (Pentium Pro) with two levels of cache on chip

Recall: Where do Objects live and work? °°° 000..0: FFF..F: n: 000..0: °°° n: read-miss Memory FFF..F: read-hit Processor read load register operate store word 11/24/2018 UCB CS61CL F09

Size of memory at each level Storage Hierarchy Processor Higher Increasing Distance from Proc., Decreasing speed Level 1 Level 2 Level n Level 3 . . . Registers Cache Memory Disk Size of memory at each level Lower As we move to deeper levels the latency goes up and price per bit goes down.

Why Caches Work Physics: Statistics: Programs exhibit locality Large memories are slow, Fast memories are small Statistics: Programs exhibit locality Temporal locality: recently accessed locations are likely to be accessed again soon Spatial Locality: is a location is accessed others nearby are likely to accessed too Use statistics to cheat the laws of physics illusion of a large fast memory on average access to a large memory can be fast keep recently accessed blocks in a small fast memory Ave Mem Access Time = Hit Time + Pmiss* MissPenalty 10/14/09 CS61CL F09

Manual vs Automatic Management of the Storage Hierarchy In everyday life? what books in backpack? desk? library? amazon? music collection? Registers? Files? Cache? 10/14/09 CS61CL F09

Cache: Transparent Memory Acceleration Processor performs reads and writes on memory locations inst. fetch, load, store memory abstraction is unchanged! Cache has copy of a small portion of the memory hit: present in cache => respond quickly miss: absent in cache => obtains it from memory and respond Unit of transfer: Block several words of memory into a cache line Where can it be placed? How can we tell if it is there? What happens to memory on write hit? What happens to cache on write miss? 10/14/09 CS61CL F09

Direct-Mapped Cache Each memory address is associated with one possible block within the cache => only need to look in a single location in the cache for the data if it exists in the cache Block is the unit of transfer between cache and memory

Direct-Mapped Cache (B=1, S=4) CacheIndex 4 Byte Direct Mapped Cache Memory Address Memory 1 1 2 2 3 3 Block size = 1 byte 4 5 6 Cache Line 0 can be occupied by data from: Memory location 0, 4, 8, ... 4 blocks any memory location that is multiple of 4 7 8 9 A B C D E F

Direct-Mapped Cache (B=2, S=4) Memory Address Memory Cache Index 8 Byte Direct Mapped Cache 1 1 2 3 2 2 4 5 4 3 Block size = 2 bytes 6 7 6 8 9 8 etc A C 00010010 E 10 12 Let’s look at the simplest cache one can build. A direct mapped cache that only has 4 bytes. In this direct mapped cache with only 4 bytes, location 0 of the cache can be occupied by data form memory location 0, 4, 8, C, ... and so on. While location 1 of the cache can be occupied by data from memory location 1, 5, 9, ... etc. So in general, the cache location where a memory location can map to is uniquely determined by the 2 least significant bits of the address (Cache Index). For example here, any memory location whose two least significant bits of the address are 0s can go to cache location zero. With so many memory locations to chose from, which one should we place in the cache? Of course, the one we have read or write most recently because by the principle of temporal locality, the one we just touch is most likely to be the one we will need again soon. Of all the possible memory locations that can be placed in cache Location 0, how can we tell which one is in the cache? +2 = 22 min. (Y:02) How is the block located? How is the byte in block selected? e.g., Mem address 11101? 14 16 18 1A 1C 1E

How do you tell if the right block in is the line? Like luggage at the airport … 10/14/09 CS61CL F09

Tag-Check (B=2, S=4, N=1) Memory Tag Data 8 2 1E 14 1 1 2 3 1 3 2 2 5 (addresses shown) Mem Address Tag Data Cache Index 8 2 1E 14 1 3 2 1 1 2 3 1 2 3 2 2 4 5 4 3 6 7 6 8 9 8 A etc C What should go in the tag? entire address? don’t need the bits we used in getting there E 10 12 Let’s look at the simplest cache one can build. A direct mapped cache that only has 4 bytes. In this direct mapped cache with only 4 bytes, location 0 of the cache can be occupied by data form memory location 0, 4, 8, C, ... and so on. While location 1 of the cache can be occupied by data from memory location 1, 5, 9, ... etc. So in general, the cache location where a memory location can map to is uniquely determined by the 2 least significant bits of the address (Cache Index). For example here, any memory location whose two least significant bits of the address are 0s can go to cache location zero. With so many memory locations to chose from, which one should we place in the cache? Of course, the one we have read or write most recently because by the principle of temporal locality, the one we just touch is most likely to be the one we will need again soon. Of all the possible memory locations that can be placed in cache Location 0, how can we tell which one is in the cache? +2 = 22 min. (Y:02) 14 16 18 1A 1C 1E

Mapping Memory Address to Cache ttttttttttttttttt iiiiiiiiii oooo tag index byte to check to offset if have select within correct block block* block * Direct map => 1 block per “set” More generally, index to select set

Direct-Mapped Cache Example (1/3) Suppose we have a 8KB of data in a direct-mapped cache with 16 byte blocks Determine the size of the tag, index and offset fields if we’re using a 32-bit architecture Offset need to specify correct byte within a block block contains 16 bytes = 24 bytes need 4 bit to specify correct byte

Direct-Mapped Cache Example (2/3) Index: (~index into an “array of blocks”) need to specify correct block in cache cache contains 8 KB = 213 bytes block contains 16 B = 24 bytes # blocks/cache = bytes/cache bytes/block = 213 bytes/cache 24 bytes/block = 29 blocks/cache need 9 bits to specify this many blocks

Direct-Mapped Cache Example (3/3) Tag: use remaining bits as tag tag length = addr length - offset - index = 32 - 4 - 9 bits = 19 bits so tag is leftmost 19 bits of memory address

Administration Midterms to be returned in Tu/W lab HW 8 (the last) out today due ??? Proj 4 out today, due ?? with a partner Pick the due dates and plan RRRRR Week 10/14/09 CS61CL F09

16 KB Direct Mapped Cache, 16B blocks Valid bit: determines whether anything is stored in that row (when computer initially turned on, all entries invalid) ... Valid Tag 0xc-f 0x8-b 0x4-7 0x0-3 1 2 3 4 5 6 7 1022 1023 Index

1. Load Byte 0x00000014 000000000000000000 0000000001 0100 Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index

So we read block 1 (0000000001) 000000000000000000 0000000001 0100 Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index

No valid data 000000000000000000 0000000001 0100 Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index

So load that data into cache, setting tag, valid 000000000000000000 0000000001 0100 Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 d c b a

Read from cache at offset, return word b 000000000000000000 0000000001 0100 Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 d c b a

2. Read Byte 0x0000001C 000000000000000000 0000000001 1100 Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 d c b a

Index is Valid 000000000000000000 0000000001 1100 Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 d c b a

Index valid, Tag Matches 000000000000000000 0000000001 1100 Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 d c b a

Index Valid, Tag Matches, return d 000000000000000000 0000000001 1100 Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 d c b a

3. Load Byte 0x00000034 000000000000000000 0000000011 0100 Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 d c b a

So read block 3 000000000000000000 0000000011 0100 Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 d c b a

No valid data 000000000000000000 0000000011 0100 Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 d c b a

Load that cache block, return word f 000000000000000000 0000000011 0100 Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 d c b a 1 h g f e

4. Load Byte 0x00008014 000000000000000010 0000000001 0100 Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 d c b a 1 h g f e

So read Cache Block 1, Data is Valid 000000000000000010 0000000001 0100 Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 d c b a 1 h g f e

Cache Block 1 Tag does not match (0 != 2) 000000000000000010 0000000001 0100 Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 d c b a 1 h g f e

Miss, so replace block 1 with new data & tag 000000000000000010 0000000001 0100 Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 2 l k j i 1 h g f e

And return byte: J 000000000000000010 0000000001 0100 Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 2 l k j i 1 h g f e

What to do on a write hit? Write-through Write-back update the word in cache block and corresponding word in memory Write-back update word in cache block allow memory word to be “stale”  add ‘dirty’ bit to each block indicating that memory needs to be updated when block is replaced  OS flushes cache before I/O… Performance trade-offs?

Types of Cache Misses (Three C’s) 1st C: Compulsory Misses occur when a program is first started cache does not contain any of that program’s data yet, so misses are bound to occur reduced with increasing block size

Types of Cache Misses (Three C’s) 1st C: Compulsory Misses 2nd C: Conflict Misses miss that occurs because two distinct memory addresses map to the same cache line when both are needed keep overwriting each other Dealing with Conflict Misses Solution 1: Make the cache size bigger More lines, fewer conflicts Conflicts far apart in address space remain Solution 2: Multiple distinct blocks in the same cache Index

Fully Associative Cache (B=32) Any block anywhere Memory address fields: Offset: byte within block Index: non Tag: all the rest Compare all tags in parallel Byte Offset : Cache Data B 0 4 31 Cache Tag (27 bits long) Valid B 1 B 31 Cache Tag = :

Types of Cache Misses (Three C’s) 1st C: Compulsory Misses 2nd C: Conflict Misses 3rd C: Capacity Misses miss that occurs because the cache has a limited size miss that would not occur if we increase the size of the cache

N-Way Set Associative Cache Basic Idea direct-map to set associative lookup of N blocks within it Memory address fields: Tag: same as before Offset: same as before Index: points us to the correct “row” (called a set in this case) Given memory address: Find correct set using Index value. Compare Tag with all Tag values in the determined set. If a match occurs, hit!, otherwise a miss. Finally, use the offset field as usual to find the desired data within the block.

Associative Cache Example Index 1 Memory Memory Address 1 2 3 4 5 6 7 8 9 A B C D E F 2-way set associative cache. Let’s look at the simplest cache one can build. A direct mapped cache that only has 4 bytes. In this direct mapped cache with only 4 bytes, location 0 of the cache can be occupied by data form memory location 0, 4, 8, C, ... and so on. While location 1 of the cache can be occupied by data from memory location 1, 5, 9, ... etc. So in general, the cache location where a memory location can map to is uniquely determined by the 2 least significant bits of the address (Cache Index). For example here, any memory location whose two least significant bits of the address are 0s can go to cache location zero. With so many memory locations to chose from, which one should we place in the cache? Of course, the one we have read or write most recently because by the principle of temporal locality, the one we just touch is most likely to be the one we will need again soon. Of all the possible memory locations that can be placed in cache Location 0, how can we tell which one is in the cache? +2 = 22 min. (Y:02)

4-Way Set Associative Cache Circuit tag index

Block Replacement Policy Direct-Mapped Cache index completely specifies position which position a block can go in on a miss N-Way Set Assoc index specifies a set, but block can occupy any position within the set on a miss Fully Associative block can be written into any position Question: if we have the choice, where should we write an incoming block? If there are any locations with valid bit off (empty), then usually write the new block into the first one. If all possible locations already have a valid block, we must pick a replacement policy: rule by which we determine which block gets “cached out” on a miss.

Block Replacement Policy: LRU LRU (Least Recently Used) Idea: cache out block which has been accessed (read or write) least recently Pro: temporal locality  recent past use implies likely future use: in fact, this is a very effective policy Con: with 2-way set assoc, easy to keep track (one LRU bit); with 4-way or greater, requires complicated hardware and much time to keep track of this

Block Replacement Example We have a 2-way set associative cache with a four word total capacity and one word blocks. We perform the following word accesses (ignore bytes for this problem): 0, 2, 0, 1, 4, 0, 2, 3, 5, 4 How many hits and how many misses will there be for the LRU block replacement policy?

Block Replacement: LRU set 0 set 1 0: miss, bring into set 0 (loc 0) set 0 set 1 lru lru Addresses 0, 2, 0, 1, 4, 0, ... 2 2: miss, bring into set 0 (loc 1) 2 set 0 set 1 lru lru 0: hit 2 lru set 0 set 1 1: miss, bring into set 1 (loc 0) lru 1 set 0 set 1 1 lru 2 lru 4 4: miss, bring into set 0 (loc 1, replace 2) set 0 set 1 4 1 lru lru 0: hit

Big Idea How to choose between associativity, block size, replacement & write policy? Design against a performance model Minimize: Average Memory Access Time = Hit Time + Miss Penalty x Miss Rate influenced by technology & program behavior Create the illusion of a memory that is large, cheap, and fast - on average How can we improve miss penalty?

Improving Miss Penalty When caches first became popular, Miss Penalty ~ 10 processor clock cycles Today 2400 MHz Processor (0.4 ns per clock cycle) and 80 ns to go to DRAM  200 processor clock cycles! MEM $ $2 DRAM Proc Solution: another cache between memory and the processor cache: Second Level (L2) Cache

An actual CPU – Early PowerPC Cache 32 KB Instructions and 32 KB Data L1 caches External L2 Cache interface with integrated controller and cache tags, supports up to 1 MByte external L2 cache Dual Memory Management Units (MMU) with Translation Lookaside Buffers (TLB) Pipelining Superscalar (3 inst/cycle) 6 execution units (2 integer and 1 double precision IEEE floating point)

An Actual CPU – Pentium M 32KB I$ 32KB D$

And in Conclusion… We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. So we create a memory hierarchy: each successively lower level contains “most used” data from next higher level exploits temporal & spatial locality do the common case fast, worry less about the exceptions (design principle of MIPS) Locality of reference is a Big Idea

And in Conclusion… Valid Tag 0xc-f 0x8-b 0x4-7 0x0-3 1 2 3 Mechanism for transparent movement of data among levels of a storage hierarchy set of address/value bindings address  index to set of candidates compare desired address with tag service hit or miss load new block and binding on miss Valid Tag 0xc-f 0x8-b 0x4-7 0x0-3 1 2 3 ... d c b a 000000000000000000 0000000001 1100 address: tag index offset

And in Conclusion… We’ve discussed memory caching in detail. Caching in general shows up over and over in computer systems Filesystem cache, Web page cache, Game databases / tablebases, Software memoization, Others? Big idea: if something is expensive but we want to do it repeatedly, do it once and cache the result. Cache design choices: Size of cache: speed v. capacity Block size (i.e., cache aspect ratio) Write Policy (Write through v. write back Associativity choice of N (direct-mapped v. set v. fully associative) Block replacement policy 2nd level cache? 3rd level cache? Use performance model to pick between choices, depending on programs, technology, budget, ...

Bonus slides These are extra slides that used to be included in lecture notes, but have been moved to this, the “bonus” area to serve as a supplement. The slides will appear in the order they would have in the normal presentation Bonus

TIO The great cache mnemonic AREA (cache size, B) = HEIGHT (# of blocks) * WIDTH (size of one block, B/block) 2(H+W) = 2H * 2W WIDTH (size of one block, B/block) Tag Index Offset HEIGHT (# of blocks) AREA (cache size, B)

Accessing data in a direct mapped cache Memory Ex.: 16KB of data, direct-mapped, 4 word blocks Can you work out height, width, area? Read 4 addresses 0x00000014 0x0000001C 0x00000034 0x00008014 Memory vals here: Address (hex) Value of Word 00000010 00000014 00000018 0000001C a b c d ... 00000030 00000034 00000038 0000003C e f g h 00008010 00008014 00008018 0000801C i j k l

Accessing data in a direct mapped cache 4 Addresses: 0x00000014, 0x0000001C, 0x00000034, 0x00008014 4 Addresses divided (for convenience) into Tag, Index, Byte Offset fields 000000000000000000 0000000001 0100 000000000000000000 0000000001 1100 000000000000000000 0000000011 0100 000000000000000010 0000000001 0100 Tag Index Offset

Do an example yourself. What happens? Chose from: Cache: Hit, Miss, Miss w. replace Values returned: a ,b, c, d, e, ..., k, l Read address 0x00000030 ? 000000000000000000 0000000011 0000 Read address 0x0000001c ? 000000000000000000 0000000001 1100 Cache Valid 0x0-3 0x4-7 0x8-b 0xc-f Index Tag 1 1 2 l k j i 2 3 1 h g f e 4 5 6 7 ... ...

Answers 0x00000030 a hit Index = 3, Tag matches, Offset = 0, value = e 0x0000001c a miss Index = 1, Tag mismatch, so replace from memory, Offset = 0xc, value = d Since reads, values must = memory values whether or not cached: 0x00000030 = e 0x0000001c = d Memory Address (hex) Value of Word 00000010 00000014 00000018 0000001C a b c d ... 00000030 00000034 00000038 0000003C e f g h 00008010 00008014 00008018 0000801C i j k l

Block Size Tradeoff (1/3) Benefits of Larger Block Size Spatial Locality: if we access a given word, we’re likely to access other nearby words soon Very applicable with Stored-Program Concept: if we execute a given instruction, it’s likely that we’ll execute the next few as well Works nicely in sequential array accesses too As I said earlier, block size is a tradeoff. In general, larger block size will reduce the miss rate because it take advantage of spatial locality. But remember, miss rate NOT the only cache performance metrics. You also have to worry about miss penalty. As you increase the block size, your miss penalty will go up because as the block gets larger, it will take you longer to fill up the block. Even if you look at miss rate by itself, which you should NOT, bigger block size does not always win. As you increase the block size, assuming keeping cache size constant, your miss rate will drop off rapidly at the beginning due to spatial locality. However, once you pass certain point, your miss rate actually goes up. As a result of these two curves, the Average Access Time (point to equation), which is really the more important performance metric than the miss rate, will go down initially because the miss rate is dropping much faster than the increase in miss penalty. But eventually, as you keep on increasing the block size, the average access time can go up rapidly because not only is the miss penalty is increasing, the miss rate is increasing as well. Let me show you why your miss rate may go up as you increase the block size by another extreme example. +3 = 33 min. (Y:13)

Block Size Tradeoff (2/3) Drawbacks of Larger Block Size Larger block size means larger miss penalty on a miss, takes longer time to load a new block from next level If block size is too big relative to cache size, then there are too few blocks Result: miss rate goes up In general, minimize Average Memory Access Time (AMAT) = Hit Time + Miss Penalty x Miss Rate

Block Size Tradeoff (3/3) Hit Time time to find and retrieve data from current level cache Miss Penalty average time to retrieve data on a current level miss (includes the possibility of misses on successive levels of memory hierarchy) Hit Rate % of requests that are found in current level cache Miss Rate 1 - Hit Rate

Extreme Example: One Big Block Cache Data Valid Bit B 0 B 1 B 3 Tag B 2 Cache Size = 4 bytes Block Size = 4 bytes Only ONE entry (row) in the cache! If item accessed, likely accessed again soon But unlikely will be accessed again immediately! The next access will likely to be a miss again Continually loading data into the cache but discard data (force out) before use it again Nightmare for cache designer: Ping Pong Effect

Block Size Tradeoff Conclusions Miss Rate Block Size Miss Penalty Block Size Exploits Spatial Locality Fewer blocks: compromises temporal locality Average Access Time Block Size Increased Miss Penalty & Miss Rate

Analyzing Multi-level cache hierarchy DRAM Proc $ $2 L2 hit time L2 Miss Rate L2 Miss Penalty L1 hit time L1 Miss Rate L1 Miss Penalty Avg Mem Access Time = L1 Hit Time + L1 Miss Rate * L1 Miss Penalty L1 Miss Penalty = L2 Hit Time + L2 Miss Rate * L2 Miss Penalty Avg Mem Access Time = L1 Hit Time + L1 Miss Rate * (L2 Hit Time + L2 Miss Rate * L2 Miss Penalty)

Example Assume Avg mem access time Hit Time = 1 cycle Miss rate = 5% Miss penalty = 20 cycles Calculate AMAT… Avg mem access time = 1 + 0.05 x 20 = 1 + 1 cycles = 2 cycles

Ways to reduce miss rate Larger cache limited by cost and technology hit time of first level cache < cycle time (bigger caches are slower) More places in the cache to put each block of memory – associativity fully-associative any block any line N-way set associated N places for each block direct map: N=1

Typical Scale L1 size: tens of KB hit time: complete in one clock cycle miss rates: 1-5% L2: size: hundreds of KB hit time: few clock cycles miss rates: 10-20% L2 miss rate is fraction of L1 misses that also miss in L2 why so high?

Example: with L2 cache Assume L1 miss penalty = 5 + 0.15 * 200 = 35 L1 Hit Time = 1 cycle L1 Miss rate = 5% L2 Hit Time = 5 cycles L2 Miss rate = 15% (% L1 misses that miss) L2 Miss Penalty = 200 cycles L1 miss penalty = 5 + 0.15 * 200 = 35 Avg mem access time = 1 + 0.05 x 35 = 2.75 cycles

Example: without L2 cache Assume L1 Hit Time = 1 cycle L1 Miss rate = 5% L1 Miss Penalty = 200 cycles Avg mem access time = 1 + 0.05 x 200 = 11 cycles 4x faster with L2 cache! (2.75 vs. 11)