ENG3380 Computer Organization and Architecture “Cache Memory Part II”

Slides:

Advertisements

Similar presentations

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

331 Week13.1Spring :332:331 Computer Architecture and Assembly Language Spring 2006 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Memory Chapter 7 Cache Memories.

331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

1  2004 Morgan Kaufmann Publishers Chapter Seven.

CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.

CMPE 421 Parallel Computer Architecture

Chapter 5 Large and Fast: Exploiting Memory Hierarchy CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University.

EEE-445 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output Cache Main Memory Secondary Memory (Disk)

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

1  1998 Morgan Kaufmann Publishers Recap: Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the.

CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

1 CMPE 421 Parallel Computer Architecture PART3 Accessing a Cache.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1  1998 Morgan Kaufmann Publishers Chapter Seven.

The Memory Hierarchy (Lectures #17 - #20) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.

Chapter 5 Large and Fast: Exploiting Memory Hierarchy.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

Computer Organization CS224 Fall 2012 Lessons 37 & 38.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

CMSC 611: Advanced Computer Architecture

COSC3330 Computer Architecture

Computer Organization

Memory COMPUTER ARCHITECTURE

Yu-Lun Kuo Computer Sciences and Information Engineering

The Goal: illusion of large, fast, cheap memory

CS352H: Computer Systems Architecture

CSE 331 Computer Organization and Design Fall 2007 Week 15

CSC 4250 Computer Architectures

How will execution time grow with SIZE?

Cache Memory Presentation I

Morgan Kaufmann Publishers Memory & Cache

Morgan Kaufmann Publishers

Mary Jane Irwin ( ) CSE 431 Computer Architecture Fall 2005 Lecture 19: Cache Introduction Review Mary Jane Irwin (

ECE 445 – Computer Organization

Cache Memories September 30, 2008

Systems Architecture II

Lecture 08: Memory Hierarchy Cache Performance

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

CMSC 611: Advanced Computer Architecture

Morgan Kaufmann Publishers

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007

Morgan Kaufmann Publishers Memory Hierarchy: Introduction

CS-447– Computer Architecture Lecture 20 Cache Memories

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Some of the slides are adopted from David Patterson (UCB)

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Cache Memory Rabi Mahapatra

CS Computer Architecture Spring Lecture 19: Cache Introduction

Memory & Cache.

10/18: Lecture Topics Using spatial locality

Presentation transcript:

ENG3380 Computer Organization and Architecture “Cache Memory Part II” Winter 2017 S. Areibi School of Engineering University of Guelph

Topics Summary Memory Hierarchy Locality Motivation Cache Memory Principles Elements of Cache Design: Cache Addresses Cache Size Mapping Function Replacement Algorithms Write Policy Line Size Summary With thanks to W. Stallings, Hamacher, J. Hennessy, M. J. Irwin for lecture slide contents Many slides adapted from the PPT slides accompanying the textbook and CSE331 Course School of Engineering

References “Computer Organization and Architecture: Designing for Performance”, 10th edition, by William Stalling, Pearson. “Computer Organization and Design: The Hardware/Software Interface”, 5th edition, by D. Patterson and J. Hennessy, Morgan Kaufmann Computer Organization and Architecture: Themes and Variations”, 2014, by Alan Clements, CENGAGE Learning School of Engineering

Memory Hierarchy

Memory Hierarchy The design constraints on a computer memory can be summed up by three questions (i) How Much (ii) How Fast (iii) How expensive. There is a tradeoff among the three key characteristics A variety of technologies are used to implement memory system Dilemma facing designer is clear  large capacity, fast, low cost!! Solution  Employ memory hierarchy Flip Flops registers Static RAM Cache Dynamic RAM Main Memory Disk Cache Magnetic Disk Removable Media

Memory Hierarchy As you go further, capacity and latency increase Disk 80 GB 10M cycles Memory 1GB 300 cycles L2 cache 2MB 15 cycles L1 data or instruction Cache 32KB 2 cycles Registers 1KB 1 cycle

Memory Hierarchy Levels Morgan Kaufmann Publishers 17 February, 2019 Block (aka line): unit of copying May be multiple words If accessed data is present in upper level Hit: access satisfied by upper level Hit ratio: hits/accesses If accessed data is absent Miss: block copied from lower level Time taken: miss penalty Miss ratio: misses/accesses = 1 – hit ratio Then accessed data supplied from upper level Upper Level Lower Level Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Main Memory vs. Cache Dynamic RAM Static RAM Registers Static RAM

CPU + Bus + Memory Registers Static RAM Dynamic RAM Bus CPU Cache Controller Memory PCI DRAM EISA/PCI Bridge Hard Drive Video Adaptor PC Card 1 PC Card 2 SCSI PC Card 3 Local CPU / Memory Bus Peripheral Component Interconnect Bus EISA PC Bus Bus Co-processor Dynamic RAM

How is the Hierarchy Managed? registers  memory by compiler (programmer?) cache  main memory by the cache controller hardware main memory  disks by the operating system (virtual memory) virtual to physical address mapping assisted by the hardware (TLB) by the programmer (files)

Locality

Principle of Locality (Temporal) Morgan Kaufmann Publishers 17 February, 2019 Programs access a small proportion of their address space at any time (Locality in Time) Temporal locality Items accessed recently are likely to be accessed again soon Keep most recently accessed data items closer to the processor e.g., instructions in a loop, induction variables Lower Level Memory Upper Level To Processor From Processor Blk X Blk Y for (i=0; i<1000; i++) x[i] = x[i] + s; Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Principle of Locality (Spatial) Morgan Kaufmann Publishers Principle of Locality (Spatial) 17 February, 2019 Programs access a small proportion of their address space at any time (Locality in Space) Spatial locality Items near those accessed recently likely to be accessed soon Move blocks consisting of contiguous words to the upper levels E.g., sequential instruction access, scanning an array data for (i=0; i<1000; i++) x[i] = x[i] + s; Lower Level Memory Upper Level To Processor From Processor Blk X Blk Y Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Taking Advantage of Locality Morgan Kaufmann Publishers 17 February, 2019 Memory hierarchy Store everything on disk Copy recently accessed (and nearby) items from disk to smaller DRAM memory Main memory Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory Cache memory attached to CPU Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Cache and Locality Why do caches work? Temporal locality: if you used some data recently, you will likely use it again Spatial locality: if you used some data recently, you will likely access its neighbors No hierarchy: average access time for data = 300 cycles 32KB 1-cycle L1 cache that has a hit rate of 95%: average access time = 0.95 x 1 + 0.05 x (301) = 16 cycles

Motivation

A Typical Memory Hierarchy By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. Second Level Cache (SRAM) Control Datapath Secondary Memory (Disk) RegFile Main (DRAM) Data Instr ITLB DTLB Speed (ns): .1’s 1’s 10’s 100’s 1,000’s Size (bytes): 100’s K’s 10K’s M’s T’s Cost: highest lowest On-Chip Components Instead, the memory system of a modern computer consists of a series of black boxes ranging from the fastest to the slowest. Besides variation in speed, these boxes also varies in size (smallest to biggest) and cost. What makes this kind of arrangement work is one of the most important principle in computer design. The principle of locality. The design goal is to present the user with as much memory as is available in the cheapest technology (points to the disk). While by taking advantage of the principle of locality, we like to provide the user an average access speed that is very close to the speed that is offered by the fastest technology. (We will go over this slide in detail in the next lectures on caches).

(Relative) size of the memory at each level The Memory Hierarchy Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology Processor 4-8 bytes (word) 1 to 4 blocks 1,024+ bytes (disk sector = page) 8-32 bytes (block) Inclusive– what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory (Relative) size of the memory at each level

Processor-memory Performance Gap The Processor vs DRAM speed disparity continues to grow Good memory hierarchy (cache) design is increasingly important to overall performance

Why Pipeline? For Throughput! To avoid a structural hazard need two caches on-chip: one for instructions (I$) and one for data (D$) I n s t r. O r d e Time (clock cycles) Inst 0 Inst 1 Inst 2 Inst 4 Inst 3 ALU I$ Reg D$ To keep the pipeline running at its maximum rate both I$ and D$ need to satisfy a request from the datapath every cycle. What happens when they can’t do that?

Cache Memory

Terminology Hit: data is in some block in the upper level (Blk X) Hit Rate: fraction of memory accesses found in upper level Hit Time: Time to access the upper level which consists of SRAM access time + Time to determine hit/miss Miss: data is not in the upper level so needs to be retrieve from a block in the lower level (Blk Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to bring in a block from the lower level and replace a block in the upper level with it + Time to deliver the block the processor Hit Time << Miss Penalty Lower Level Memory Upper Level To Processor From Processor Blk X Blk Y A HIT is when the data the processor wants to access is found in the upper level (Blk X). The fraction of the memory access that are HIT is defined as HIT rate. HIT Time is the time to access the Upper Level where the data is found (X). It consists of: (a) Time to access this level. (b) AND the time to determine if this is a Hit or Miss. If the data the processor wants cannot be found in the Upper level. Then we have a miss and we need to retrieve the data (Blk Y) from the lower level. By definition (definition of Hit: Fraction), the miss rate is just 1 minus the hit rate. This miss penalty also consists of two parts: (a) The time it takes to replace a block (Blk Y to BlkX) in the upper level. (b) And then the time it takes to deliver this new block to the processor. It is very important that your Hit Time to be much much smaller than your miss penalty. Otherwise, there will be no reason to build a memory hierarchy.

Four Questions for Cache Design Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement strategy) Q4: What happens on a write? (Write strategy)

Cache: Block Placement Morgan Kaufmann Publishers Cache: Block Placement 17 February, 2019 Determined by associativity Direct mapped (1-way set associative) One choice for placement n-way set associative n choices within a set Fully associative Any location Higher associativity reduces miss rate, Increases complexity, cost, and access time Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Q1: Where can a block be placed in the upper level? Cache: Block Placement Q1: Where can a block be placed in the upper level? Block 12 placed in 8 block cache: Fully associative, direct mapped, 2-way set associative S.A. Mapping = Block Number Modulo Number Sets Direct Mapped (12 mod 8) = 4 2-Way Assoc (12 mod 4) = 0 Full Mapped 01234567 01234567 01234567 Cache 1111111111222222222233 01234567890123456789012345678901 Memory

Cache: Block Identification Morgan Kaufmann Publishers Cache: Block Identification 17 February, 2019 Cache memory The level of the memory hierarchy closest to the CPU Given accesses X1, …, Xn–1, Xn How do we know if the data is present? Where do we look? ? Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Block Identification: Finding a Block Morgan Kaufmann Publishers Block Identification: Finding a Block 17 February, 2019 Associativity Location method Tag comparisons Direct mapped Index 1 n-way set associative Set index, then search entries within the set n Fully associative Search all entries #entries Full lookup table Hardware caches Reduce comparisons to reduce cost Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers Block Identification 17 February, 2019 Use lower address part as index to Cache How do we know requested block is in cache (9 or 13)? Store block address as well as the data Block address: Actually, only need the high-order bits Called the tag Index Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Block Identification: Tags Morgan Kaufmann Publishers 17 February, 2019 Every Cache block has a tag in addition to data Tag: Upper part of address, that is not used to index cache Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers Valid Bits 17 February, 2019 What if there is no data in a location? Valid bit: 1 = present, 0 = not present Initially 0 Valid bit Tag Data 001110 01010101000100 1 001111 011111100010000 011111 111001111011100 111111 00100100100100 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Cache Hit/Miss: Example Consider the main memory word reference string 0 1 2 3 4 3 4 15 Start with an empty cache - all blocks initially marked as not valid Tag = 00, Index = 00 Tag = 00, Index = 01 Tag = 00, Index = 10 Tag = 00, Index = 11 tag miss 1 miss 2 miss 3 miss 00 Mem(0) 00 Mem(0) 00 Mem(1) 00 Mem(0) 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(1) 00 Mem(2) 00 Mem(3) Tag = 01, Index = 00 Tag = 00, Index = 11 Tag = 01, Index = 00 Tag = 11, Index = 11 4 miss 3 hit 4 hit 15 miss 01 4 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) For lecture 11 15 8 requests, 6 misses

Another Reference String Mapping Consider the main memory word reference string 0 4 0 4 0 4 0 4 Start with an empty cache - all blocks initially marked as not valid miss 4 miss miss 4 miss 01 4 00 01 4 00 Mem(0) 00 Mem(0) 01 Mem(4) 00 Mem(0) miss 4 miss miss 4 miss 01 4 00 01 4 00 01 Mem(4) 00 Mem(0) 01 Mem(4) 00 Mem(0) For class handout 8 requests, 8 misses Ping pong effect due to conflict misses - two memory locations that map into the same cache block

Direct Mapped Cache

Morgan Kaufmann Publishers Direct Mapped Cache Morgan Kaufmann Publishers 17 February, 2019 Location determined by address Direct mapped: only one choice CacheIndex = (BlockAddress) modulo (#Blocks in cache) Index = 9 mod 4 = 1 If #Blocks (i.e., number of entries in the cache) is a power of 2 then modulo (i.e., CacheIndex) can be computed simply by using the low order log2 (cache size in blocks) bits of the address (log2 (4) = 2) Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

The Direct Mapped Cache For each item of data at the lower level (main memory), there is exactly one location in the upper level (cache) where it might be - so lots of items at the lower level must share locations in the upper level Address mapping: (block address) modulo (# of blocks in the cache) Direct-mapped cache: each address maps to a unique address

Accessing the Cache Cache Equations for DM Cache BlockAddress =ByteAddress/BytesPerBlock CacheIndex = BlockAddress % #CacheBlocks Byte address 101000 Offset: 3 bits 8-byte words 8 sets: 3 index bits Direct-mapped cache: each address maps to a unique address 8 Sets (blocks) Data array

The Tag Array Because each cache location can contain the contents of a number of different memory location, a tag is added to every block to further identify the requested item. Byte address ... 101000 Tag Offset: 3 bits 8-byte words Compare 3 index bits 8 Sets Tag array Data array

Caching: Example Q1: How do we find it? Main Memory 0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx 1111xx Two low order bits define the byte in the word (32b words) Cache Index Valid Tag Data 00 01 10 11 Q1: How do we find it? Use next 2 low order memory address bits – the index – to determine which cache block (i.e., modulo the number of blocks in the cache) Q2: Is it there? Compare the cache tag to the high order 2 memory address bits to tell if the memory block is in the cache For lecture Put picture on board for worksheets that come later A 32 bit address box – with the two low order bits as the byte in the word, the next “chunk” of bits a the INDEX and the most significant “chunk” of bits as the TAB Then give a specific example for the worksheets to follow of a (word) address box with 2 INDEX bits and 2 TAG bits – and fill them in with specific addresses during the following examples Valid bit indicates whether an entry contains valid information – if the bit is not set, there cannot be a match for this block (block address) modulo (# of blocks in the cache)

Morgan Kaufmann Publishers Larger Block Size 17 February, 2019 A 64 block Cache, with16 bytes/block To what block number does “byte address” 1200 map? Block Address = Byte Address/Block Size Block address = 1200/16 = 75 CacheIndex = BlockAddress % # CacheBlocks CacheIndex (Block number) = 75 modulo 64 = 11 Tag = BlockAddress / # CacheBlocks Tag = 75 / 64 = 1 Tag Index Offset 3 4 9 10 31 4 bits 6 bits 22 bits Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Direct Mapped Cache Example A Processor generates byte addresses 88 104 88 104 64 12 64 72 It has Direct Mapped Cache (1-way set associative) with 4 sets (blocks) The set (block) size is 4-bytes For each access, is it hit or miss? Solution: Compute BlockAddress  ByteAddress / BlockSize Compute CacheIndex  BlockAddress % #CacheBlocks Compute Tags  BlockAddress / #CacheBlocks Byte Address 88 104 64 12 72 Block Address Cache Index Tag

Direct Mapped Cache Example #CacheBlocks = 4 BlockSize = 4 Compute BlockAddress  ByteAddress / BlockSize Compute CacheIndex  BlockAddress % # CacheBlocks Compute Tags  BlockAddress / # CacheBlocks Cache V Tag Data Byte Address 88 104 64 12 72 Block Address 22 26 16 3 18 Cache Index 2 Tag 5 6 4 Hit or Miss? ?

Direct Mapped Cache Example #CacheBlocks = 4 BlockSize = 4 Compute BlockAddress  ByteAddress / BlockSize Compute CacheIndex  BlockAddress % # CacheBlocks Compute Tags  BlockAddress / # CacheBlocks Cache V Tag Data 1 5 MemBlock[22] Byte Address 88 104 64 12 72 Block Address 22 26 16 3 18 Cache Index 2 Tag 5 6 4 Hit or Miss? m ?

Direct Mapped Cache Example #CacheBlocks = 4 BlockSize = 4 Compute BlockAddress  ByteAddress / BlockSize Compute CacheIndex  BlockAddress % # CacheBlocks Compute Tags  BlockAddress / # CacheBlocks Cache V Tag Data 1 6 MemBlock[26] Byte Address 88 104 64 12 72 Block Address 22 26 16 3 18 Cache Index 2 Tag 5 6 4 Hit or Miss? m ?

Direct Mapped Cache Example #CacheBlocks = 4 BlockSize = 4 Compute BlockAddress  ByteAddress / BlockSize Compute CacheIndex  BlockAddress % # CacheBlocks Compute Tags  BlockAddress / # CacheBlocks Cache V Tag Data 1 5 MemBlock[22] Byte Address 88 104 64 12 72 Block Address 22 26 16 3 18 Cache Index 2 Tag 5 6 4 Hit or Miss? m ?

Direct Mapped Cache Example #CacheBlocks = 4 BlockSize = 4 Compute BlockAddress  ByteAddress / BlockSize Compute CacheIndex  BlockAddress % # CacheBlocks Compute Tags  BlockAddress / # CacheBlocks Cache V Tag Data 1 6 MemBlock[26] Byte Address 88 104 64 12 72 Block Address 22 26 16 3 18 Cache Index 2 Tag 5 6 4 Hit or Miss? m ?

Direct Mapped Cache Example #CacheBlocks = 4 BlockSize = 4 Compute BlockAddress  ByteAddress / BlockSize Compute CacheIndex  BlockAddress % # CacheBlocks Compute Tags  BlockAddress / # CacheBlocks Cache V Tag Data 1 4 MemBlock[16] 6 MemBlock[26] Byte Address 88 104 64 12 72 Block Address 22 26 16 3 18 Cache Index 2 Tag 5 6 4 Hit or Miss? m ?

Direct Mapped Cache Example #CacheBlocks = 4 BlockSize = 4 Compute BlockAddress  ByteAddress / BlockSize Compute CacheIndex  BlockAddress % # CacheBlocks Compute Tags  BlockAddress / # CacheBlocks Cache V Tag Data 1 4 MemBlock[16] 6 MemBlock[26] MemBlock[3] Byte Address 88 104 64 12 72 Block Address 22 26 16 3 18 Cache Index 2 Tag 5 6 4 Hit or Miss? m ?

Direct Mapped Cache Example #CacheBlocks = 4 BlockSize = 4 Compute BlockAddress  ByteAddress / BlockSize Compute CacheIndex  BlockAddress % # CacheBlocks Compute Tags  BlockAddress / # CacheBlocks Cache V Tag Data 1 4 MemBlock[16] 6 MemBlock[26] MemBlock[3] Byte Address 88 104 64 12 72 Block Address 22 26 16 3 18 Cache Index 2 Tag 5 6 4 Hit or Miss? m h ?

Direct Mapped Cache Example #CacheBlocks = 4 BlockSize = 4 Compute BlockAddress  ByteAddress / BlockSize Compute CacheIndex  BlockAddress % # CacheBlocks Compute Tags  BlockAddress / # CacheBlocks Cache V Tag Data 1 4 MemBlock[16] MemBlock[18] MemBlock[3] Byte Address 88 104 64 12 72 Block Address 22 26 16 3 18 Cache Index 2 Tag 5 6 4 Hit or Miss? m h 8 requests, 7 misses

Block Size Considerations Morgan Kaufmann Publishers 17 February, 2019 Larger blocks should reduce miss rate Due to spatial locality But in a fixed-sized cache Larger blocks  fewer of them More competition  increased miss rate Larger blocks  pollution Larger miss penalty (i.e., cost (time) for transfer) Can override benefit of reduced miss rate Early restart? and critical-word-first can help Early restart: is simply to resume execution as soon as the requested word of the block is returned, rather than wait for the entire block. Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Reduce Misses via Larger Block Size Increasing the cache size decreases miss rate Increasing block size lowers miss rates. However the miss rate may go up eventually if the block size becomes a significant fraction of the cache size, Why? Because the number of blocks that can be held in the cache will become small, and there will be a great deal of competition for those blocks.

DM Cache Size

Direct-Mapped Cache Size Morgan Kaufmann Publishers Direct-Mapped Cache Size 17 February, 2019 The total number of bits needed for a cache is a function of the (a) cache size, (b) address size, because the cache includes both the storage for the data and the tags. For the following situation: 32-bit addresses A direct-mapped cache The Cache size is 2n blocks, so n bits are used for index The block size is 2m words (2m+2 bytes), so m bits are used for the word within the block, two bits used for the byte part of the address The size of the tag field is  32 – (n + m + 2) The total number of bits in a direct-mapped cache is: 2n x (data size (block size) + tag size + valid field size) Since the block size is 2m words (a word is 32-bits i.e., 25 bits) (2m+5 bits), and we need 1 bit for the valid field, the number of bits: 2n x (2m x 25 +(32 – n – m – 2) +1 ) = 2n x (2m x 32 +31 – n – m) 32-bit TAG Size Index n-bit Block-offset Byte-offset Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers DM Cache Size: Example Morgan Kaufmann Publishers 17 February, 2019 How many total bits are required for a DM Cache with: 16 KiB of Data 4-word blocks Assuming a 32-bit address. Solution: We know that 16 KiB is 4096 (212) words. With a block size of 4-words (22), there are 1024 ( 210) blocks. Each block has Data: 4 x 32 = 128 bits, plus Tag: which is (32 – 10 – 2 – 2) = 18-bits, plus Valid bit: 1-bit Thus, the total cache size is (128 + 18 + 1) x 1024 blocks = 150528 bits (147 K bit) 210 x (4 x 32 +(32 – 10 – 2 – 2) +1 ) = 210 x 147 = 147 Kibibits Or 18.4 KiB for a 16 KiB cache Total number of bits is 1.5 times as many as needed for storage!! Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers Cache Access and Size Morgan Kaufmann Publishers 17 February, 2019 This Cache has 1024 entries. Each entry (block) is one word. Each word is 32-bits (4-bytes) Therefore: 2-bits are used as offset 10-bit used as index TAG = 32 – (10 + 2) = 20-bits The cache size in this case is 4 KiB Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Direct-Mapped Cache Size Morgan Kaufmann Publishers Direct-Mapped Cache Size 17 February, 2019 The total number of bits in a direct-mapped cache is #blocks x (block size + tag size + valid field size) Although this is the actual size in bits, the naming convention is to exclude the size of the tag and valid field and to count only the size of the data Valid bit Tag Data 001110 01010101000100 1 001111 011111100010000 011111 111001111011100 111111 00100100100100 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Cache Performance

Cache Performance Metrics HitRate = #CacheHits / #CacheAccesses MissRate = #CacheMisses / #CacheAccesses = 1 – HitRate HitTime = time for a hit MissPenalty = cost of a miss Average Memory Access Time (AMAT) = HitTime + MissRate x MissPenalty For lecture

Cache Miss Categories – 3 Cs Model Compulsory First access to a block is always a miss Also called cold start misses Misses in infinite size cache Conflict Multiple memory locations mapped to the same cache location Also called collision misses. All other misses. Capacity Cache cannot contain all blocks needed, Capacity misses occur due to blocks being discarded and later retrieved. (Capacity miss) That is the cache misses are due to the fact that the cache is simply not large enough to contain all the blocks that are accessed by the program. The solution to reduce the Capacity miss rate is simple: increase the cache size. Here is a summary of other types of cache miss we talked about. First is the Compulsory misses. These are the misses that we cannot avoid. They are caused when we first start the program. Then we talked about the conflict misses. They are the misses that caused by multiple memory locations being mapped to the same cache location. There are two solutions to reduce conflict misses. The first one is, once again, increase the cache size. The second one is to increase the associativity. For example, say using a 2-way set associative cache instead of directed mapped cache. But keep in mind that cache miss rate is only one part of the equation. You also have to worry about cache access time and miss penalty. Do NOT optimize miss rate alone. Finally, there is another source of cache miss we will not cover today. Those are referred to as invalidation misses caused by another process, such as IO , update the main memory so you have to flush the cache to avoid inconsistency between memory and cache. +2 = 43 min. (Y:23)

Cache Example

Cache Example (more blocks) Morgan Kaufmann Publishers 17 February, 2019 Recall earlier cache with 4 blocks (8 requests, 7 misses) This cache has: 8-blocks (instead of 4), 1 word/block, direct mapped 22 26 16 3 18 Index V Tag Data 000 N 001 010 011 100 101 110 111 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers Cache Example 17 February, 2019 22 26 16 3 18 Word addr Binary addr Hit/miss Cache block 22 10 110 Miss 110 Index V Tag Data 000 N 001 010 011 100 101 110 Y 10 Mem[10110] 111 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers Cache Example Morgan Kaufmann Publishers 17 February, 2019 22 26 16 3 18 Word addr Binary addr Hit/miss Cache block 26 11 010 Miss 010 Index V Tag Data 000 N 001 010 Y 11 Mem[11010] 011 100 101 110 10 Mem[10110] 111 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers Cache Example Morgan Kaufmann Publishers 17 February, 2019 22 26 16 3 18 Word addr Binary addr Hit/miss Cache block 22 10 110 Hit 110 26 11 010 010 Index V Tag Data 000 N 001 010 Y 11 Mem[11010] 011 100 101 110 10 Mem[10110] 111 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers Cache Example Morgan Kaufmann Publishers 17 February, 2019 22 26 16 3 18 Word addr Binary addr Hit/miss Cache block 16 10 000 Miss 000 3 00 011 011 Hit Index V Tag Data 000 Y 10 Mem[10000] 001 N 010 11 Mem[11010] 011 00 Mem[00011] 100 101 110 Mem[10110] 111 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers Cache Example Morgan Kaufmann Publishers 17 February, 2019 22 26 16 3 18 Word addr Binary addr Hit/miss Cache block 18 10 010 Miss 010 Index V Tag Data 000 Y 10 Mem[10000] 001 N 010 Mem[10010] 011 00 Mem[00011] 100 101 110 Mem[10110] 111 8 requests, 5 misses vs. 8 request with 7 misses Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Address Subdivision & Architecture

Address Subdivision 4 11 4 19 A Cache memory can hold 32 Kbytes. Data is transferred between MM and cache in blocks of 16 bytes each. The main memory consist of 512 Kbytes Show the format of main memory addresses in a DM Cache Organization. Assume that addressing is done at the byte-level Solution: Total address lines needed for 512 Kbytes is  19 bits # of blocks: 32 Kbytes/16 bytes = 2048  211  11 bits for index Byte Offset within a block: 16 bytes  24  4 bits for word (byte offset) Tag = 19 – 11 – 4 = 4 bits 4 11 4 19

Mapping Functions Direct Mapping Use small cache with 128 blocks of 16 words Use main memory with 64K words (4K blocks) Word-addressable memory, so 16-bit address Direct Mapping

Direct Mapped Cache Example Cache size = 1K words, One word/block 31 30 . . . 13 12 11 . . . 2 1 0 Byte offset 20 Tag 10 Index Hit Data 32 Data Index Tag Valid 1 2 . 1021 1022 1023 20 Let’s use a specific example with realistic numbers: assume we have a 1 K word (4Kbyte) direct mapped cache with block size equals to 4 bytes (1 word). In other words, each block associated with the cache tag will have 4 bytes in it (Row 1). With Block Size equals to 4 bytes, the 2 least significant bits of the address will be used as byte select within the cache block. Since the cache size is 1K word, the upper 32 minus 10+2 bits, or 20 bits of the address will be stored as cache tag. The rest of the (10) address bits in the middle, that is bit 2 through 11, will be used as Cache Index to select the proper cache entry Temporal! Comparator

Multiword Block Direct Mapped Cache Cache size = 1K words, Four words/block 31 30 . . . 13 12 11 . . . 4 3 2 1 0 Byte offset Hit Data 32 Block offset 20 Tag 8 Index Data Index Tag Valid 1 2 . 253 254 255 20 to take advantage for spatial locality want a cache block that is larger than word word in size. What kind of locality are we taking advantage of?

Multiword Block Direct Mapped Cache Cache size = 16K words, Four words/block Address (bit positions)

Cache Miss & Hits

Morgan Kaufmann Publishers Cache Misses Morgan Kaufmann Publishers 17 February, 2019 On cache hit, CPU proceeds normally On cache miss Stall the CPU pipeline Fetch block from next level of hierarchy Instruction cache miss Restart instruction fetch Data cache miss Complete data access Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Cache Misses/Improving Performance Compulsory: First access to a block, “cold” fact of life, Conflict: Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity Capacity: Cache cannot contain all blocks accessed by the program Solution: increase cache size AMAT = HitTime + MissRate x MissPenalty Reduce HitTime: Small and simple cache Reduce MissRate: Larger Block Size, Higher Associativity Reduce MissPenalty: MultiLevel Caches, Give priority to read misses. (Capacity miss) That is the cache misses are due to the fact that the cache is simply not large enough to contain all the blocks that are accessed by the program. The solution to reduce the Capacity miss rate is simple: increase the cache size. Here is a summary of other types of cache miss we talked about. First is the Compulsory misses. These are the misses that we cannot avoid. They are caused when we first start the program. Then we talked about the conflict misses. They are the misses that caused by multiple memory locations being mapped to the same cache location. There are two solutions to reduce conflict misses. The first one is, once again, increase the cache size. The second one is to increase the associativity. For example, say using a 2-way set associative cache instead of directed mapped cache. But keep in mind that cache miss rate is only one part of the equation. You also have to worry about cache access time and miss penalty. Do NOT optimize miss rate alone. Finally, there is another source of cache miss we will not cover today. Those are referred to as invalidation misses caused by another process, such as IO , update the main memory so you have to flush the cache to avoid inconsistency between memory and cache. +2 = 43 min. (Y:23)

Example: Intrinsity FastMATH Morgan Kaufmann Publishers 17 February, 2019 Example: Intrinsity FastMATH Embedded MIPS processor 12-stage pipeline Instruction and data access on each cycle Split cache: separate I-cache and D-cache Each 16KB: 256 blocks × 16 words/block D-cache: write-through or write-back SPEC2000 miss rates I-cache: 0.4% D-cache: 11.4% Weighted average: 3.2% Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Example: Intrinsity FastMATH Morgan Kaufmann Publishers 17 February, 2019 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Summary

Cache Summary The Principle of Locality: Program likely to access a relatively small portion of the address space at any instant of time Temporal Locality: Locality in Time Spatial Locality: Locality in Space Three major categories of cache misses: Compulsory misses: sad facts of life, e.g., cold start misses Conflict misses: increase cache size and/or associativity Nightmare Scenario: ping pong effect! Capacity misses: increase cache size What's Next? Set Associative Caches Write, Replacement Policies Let’s summarize today’s lecture. I know you have heard this many times and many ways but it is still worth repeating. Memory hierarchy works because of the Principle of Locality which says a program will access a relatively small portion of the address space at any instant of time. There are two types of locality: temporal locality, or locality in time and spatial locality, or locality in space. So far, we have covered three major categories of cache misses. Compulsory misses are cache misses due to cold start. You cannot avoid them but if you are going to run billions of instructions anyway, compulsory misses usually don’t bother you. Conflict misses are misses caused by multiple memory location being mapped to the same cache location. The nightmare scenario is the ping pong effect when a block is read into the cache but before we have a chance to use it, it was immediately forced out by another conflict miss. You can reduce Conflict misses by either increase the cache size or increase the associativity, or both. Finally, Capacity misses occurs when the cache is not big enough to contains all the cache blocks required by the program. You can reduce this miss rate by making the cache larger. There are two write policy as far as cache write is concerned. Write through requires a write buffer and a nightmare scenario is when the store occurs so frequent that you saturates your write buffer. The second write polity is write back. In this case, you only write to the cache and only when the cache block is being replaced do you write the cache block back to memory. No fancy replacement policy is needed for the direct mapped cache. That is what caused direct mapped cache trouble to begin with – only one place to go in the cache causing conflict misses.

End Slides

Example Access Pattern Byte address Assume that addresses are 8 bits long How many of the following address requests are hits/misses? 4, 7, 10, 13, 16, 68, 73, 78, 83, 88, 4, 7, 10… 101000 Tag 8-byte words Compare Direct-mapped cache: each address maps to a unique address Tag array Data array

Memory Systems that Support Caches The off-chip interconnect and memory architecture affects overall system performance dramatically on-chip Assume 1 clock cycle (1 ns) to send the address from the cache to the Main Memory 50 ns (50 processor clock cycles) for DRAM first word access time, 10 ns (10 clock cycles) cycle time (remaining words in burst for SDRAM) 1 clock cycle (1 ns) to return a word of data from the Main Memory to the cache Memory-Bus to Cache bandwidth number of bytes accessed from Main Memory and transferred to cache/CPU per clock cycle CPU Cache bus 32-bit data & 32-bit addr per cycle Main Memory

One Word Wide Memory Organization If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall the number of cycles required to return one data word from memory cycle(s) to send address cycle(s) to read DRAM cycle(s) to return data total clock cycles miss penalty Number of bytes transferred per clock cycle (bandwidth) for a miss is bytes per clock on-chip CPU Cache 1 50 52 bus Main Memory For lecture 4/52 = 0.077

Burst Memory Organization What if the block size is four words and a (DDR) SDRAM is used? cycle(s) to send 1st address cycle(s) to read DRAM cycle(s) to return last data word total clock cycles miss penalty Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock on-chip CPU 1 50 + 3*10 = 80 82 Cache bus 50 cycles 10 cycles Main Memory For lecture (4 x 4)/82 = 0.183