CS224 Spring 2011 Computer Organization CS224 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Spring 2011 With thanks to M.J. Irwin, D. Patterson,

Slides:



Advertisements
Similar presentations
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2 Instructors: Krste Asanovic & Vladimir Stojanovic
Advertisements

CS.305 Computer Architecture Memory: Structures Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made.
331 Week13.1Spring :332:331 Computer Architecture and Assembly Language Spring 2006 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Computer Systems Organization: Lecture 11
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Chapter 7 Large and Fast: Exploiting Memory Hierarchy Bo Cheng.
Memory Chapter 7 Cache Memories.
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
Review: The Memory Hierarchy
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( ) [Adapted from Computer.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
CS3350B Computer Architecture Winter 2015 Lecture 3.2: Exploiting Memory Hierarchy: How? Marc Moreno Maza [Adapted from.
CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5A: Exploiting the Memory Hierarchy, Part 3 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.
Computer Architecture and Design – ECEN 350 Part 9 [Some slides adapted from M. Irwin, D. Paterson. D. Garcia and others]
CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.
CPE232 Memory Hierarchy1 CPE 232 Computer Organization Spring 2006 Memory Hierarchy Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
CSE331 W15.1Irwin&Li Fall 2006 PSU CSE 331 Computer Organization and Design Fall 2006 Week 15 Section 1: Mary Jane Irwin (
CSIE30300 Computer Architecture Unit 07: Main Memory Hsin-Chou Chi [Adapted from material by and
Caching & Virtual Memory Systems Chapter 7  Caching l To address bottleneck between CPU and Memory l Direct l Associative l Set Associate  Virtual Memory.
Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy.
EEE-445 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output Cache Main Memory Secondary Memory (Disk)
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CSIE30300 Computer Architecture Unit 9: Improving Cache Performance Hsin-Chou Chi [Adapted from material by and
CS.305 Computer Architecture Improving Cache Performance Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5A: Exploiting the Memory Hierarchy, Part 2 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.
1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity.
1Chapter 7 Memory Hierarchies (Part 2) Outline of Lectures on Memory Systems 1. Memory Hierarchies 2. Cache Memory 3. Virtual Memory 4. The future.
Additional Slides By Professor Mary Jane Irwin Pennsylvania State University Group 3.
1010 Caching ENGR 3410 – Computer Architecture Mark L. Chang Fall 2006.
The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
Improving Memory Access 2/3 The Cache and Virtual Memory
CPE232 Improving Cache Performance1 CPE 232 Computer Organization Spring 2006 Improving Cache Performance Dr. Gheith Abandah [Adapted from the slides of.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
CS35101 Computer Architecture Spring 2006 Lecture 18: Memory Hierarchy Paul Durand ( ) [Adapted from M Irwin (
CSE431 L20&21 Improving Cache Performance.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 20&21. Improving Cache Performance Mary Jane.
CSE431 L18 Memory Hierarchy.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 18: Memory Hierarchy Review Mary Jane Irwin (
CPEG3231 Integration of cache and MIPS Pipeline  Data-path control unit design  Pipeline stalls on cache misses.
1Chapter 7 Memory Hierarchies Outline of Lectures on Memory Systems 1. Memory Hierarchies 2. Cache Memory 3. Virtual Memory 4. The future.
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Set-Associative Caches Instructors: Randy H. Katz David A. Patterson
CS 61C: Great Ideas in Computer Architecture Caches Part 2 Instructors: Nicholas Weaver & Vladimir Stojanovic
CS 110 Computer Architecture Lecture 15: Caches Part 2 Instructor: Sören Schwertfeger School of Information Science and Technology.
CMSC 611: Advanced Computer Architecture
COSC3330 Computer Architecture
Yu-Lun Kuo Computer Sciences and Information Engineering
The Goal: illusion of large, fast, cheap memory
Morgan Kaufmann Publishers Memory & Cache
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
CS 314 Computer Organization Fall 2017 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Haojin Zhu Professor [Adapted from Computer Organization.
Morgan Kaufmann Publishers Memory Hierarchy: Introduction
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache Memory Rabi Mahapatra
Presentation transcript:

CS224 Spring 2011 Computer Organization CS224 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Spring 2011 With thanks to M.J. Irwin, D. Patterson, and J. Hennessy for some lecture slide contents

CS224 Spring 2011 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output Cache Mai n Me mor y Secon dary Memo ry (Disk)

CS224 Spring 2011 Processor-Memory Performance Gap “Moore’s Law” µProc 55%/year (2X/1.5yr) DRAM 7%/year (2X/10yrs) Processor-Memory Performance Gap (grows 50%/year)

CS224 Spring 2011 The “Memory Wall”  Processor vs DRAM speed disparity continues to grow Clocks per instruction Clocks per DRAM access  Good memory hierarchy (cache) design is increasingly important to overall performance

CS224 Spring 2011 Memory Technology  Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB  Dynamic RAM (DRAM) 50ns – 70ns, $20 – $75 per GB  Magnetic disk 5ms – 20ms, $0.20 – $2 per GB  Ideal memory Access time of SRAM Capacity and cost/GB of disk §5.1 Introduction

CS224 Spring 2011 The Memory Hierarchy Goal  Fact: Large memories are slow and fast memories are small  How do we create a memory that gives the illusion of being large, cheap and fast (most of the time)? With hierarchy With parallelism

CS224 Spring 2011 Second Level Cache (SRAM) A Typical Memory Hierarchy Control Datapath Secondary Memory (Disk) On-Chip Components R e g F il e Main Memory (DRAM) Data Cache Instr Cache ITLB DTLB Speed (%cycles): ½’s 1’s 10’s 100’s 10,000’s Size (bytes): 100’s 10K’s M’s G’s T’s Cost: highest lowest  Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology

CS224 Spring 2011 Memory Hierarchy Technologies  Caches use SRAM for speed and technology compatibility Fast (typical access times of 0.5 to 2.5 nsec) Low density (6 transistor cells), higher power, expensive ($2000 to $5000 per GB in 2008) Static: content will last “forever” (as long as power is left on)  Main memory uses DRAM for size (density) Slower (typical access times of 50 to 70 nsec) High density (1 transistor cells), lower power, cheaper ($20 to $75 per GB in 2008) Dynamic: needs to be “refreshed” regularly (~ every 8 ms) - consumes1% to 2% of the active cycles of the DRAM Addresses divided into 2 halves (row and column) -RAS or Row Access Strobe triggering the row decoder -CAS or Column Access Strobe triggering the column selector

CS224 Spring 2011 The Memory Hierarchy: Why Does it Work?  Temporal Locality (locality in time) If a memory location is referenced then it will tend to be referenced again soon  Keep most recently accessed data items closer to the processor  Spatial Locality (locality in space) If a memory location is referenced, the locations with nearby addresses will tend to be referenced soon  Move blocks consisting of contiguous words closer to the processor

CS224 Spring 2011 The Memory Hierarchy: Terminology  Block (or line): the minimum unit of information that is present (or not) in a cache  Hit Rate: the fraction of memory accesses found in a level of the memory hierarchy Hit Time: Time to access that level which consists of Time to access the block + Time to determine hit/miss  Miss Rate: the fraction of memory accesses not found in a level of the memory hierarchy  1 - (Hit Rate) Miss Penalty: Time to replace a block in that level with the corresponding block from a lower level which consists of Time to access the block in the lower level + Time to transmit that block to the level that experienced the miss + Time to insert the block in that level + Time to pass the block to the requestor Hit Time << Miss Penalty

CS224 Spring 2011 Characteristics of the Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of the memory at each level Inclusive– what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM 4-8 bytes (word) 1 to 4 blocks 1,024+ bytes (disk sector = page) 8-32 bytes (block)

CS224 Spring 2011 Cache Memory  Cache memory The level of the memory hierarchy closest to the CPU  Given accesses X 1, …, X n–1, X n §5.2 The Basics of Caches  How do we know if the data is present?  Where do we look?

CS224 Spring 2011 How is the Hierarchy Managed?  registers  memory by compiler (programmer?)  cache  main memory by the cache controller hardware  main memory  disks by the operating system (virtual memory) virtual to physical address mapping assisted by the hardware (TLB) by the programmer (files)

CS224 Spring 2011  Two questions to answer (in hardware): Q1: How do we know if a data item is in the cache? Q2: If it is, how do we find it?  Direct mapped Each memory block is mapped to exactly one block in the cache -lots of lower level blocks must share blocks in the cache Address mapping (to answer Q2): (block address) modulo (# of blocks in the cache) Have a tag associated with each cache block that contains the address information (the upper portion of the address) required to identify the block (to answer Q1) Cache Basics

CS224 Spring 2011 Caching: A Simple First Example Cache 0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx 1111xx Main Memory TagData Q1: Is it there? Compare the cache tag to the high order 2 memory address bits to tell if the memory block is in the cache Valid One word blocks Two low order bits define the byte in the word (32b words) Q2: How do we find it? Use next 2 low order memory address bits – the index – to determine which cache block (i.e., modulo the number of blocks in the cache) (block address) modulo (# of blocks in the cache) Index

CS224 Spring 2011 Caching: A Simple First Example Cache Main Memory Q2: How do we find it? Use next 2 low order memory address bits – the index – to determine which cache block (i.e., modulo the number of blocks in the cache) TagData Q1: Is it there? Compare the cache tag to the high order 2 memory address bits to tell if the memory block is in the cache Valid 0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx 1111xx One word blocks Two low order bits define the byte in the word (32b words) (block address) modulo (# of blocks in the cache) Index

CS224 Spring 2011 Direct Mapped Cache  Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid

CS224 Spring 2011 Direct Mapped Cache  Consider the main memory word reference string Mem(0) 00 Mem(1) 00 Mem(0) 00 Mem(1) 00 Mem(2) miss hit 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) Mem(1) 00 Mem(2) 00 Mem(3) Start with an empty cache - all blocks initially marked as not valid 8 requests, 6 misses

CS224 Spring 2011  One word blocks, cache size = 1K words (or 4KB) MIPS Direct Mapped Cache Example 20 Tag 10 Index Data IndexTagValid Byte offset What kind of locality are we taking advantage of? 20 Data 32 Hit

CS224 Spring 2011 Multiword Block Direct Mapped Cache 8 Index Data IndexTagValid Byte offset 20 Tag HitData 32 Block offset  Four words/block, cache size = 1K words What kind of locality are we taking advantage of?

CS224 Spring 2011 Taking Advantage of Spatial Locality 0  Let cache block hold more than one word Start with an empty cache - all blocks initially marked as not valid

CS224 Spring 2011 Taking Advantage of Spatial Locality 0  Let cache block hold more than one word Mem(1) Mem(0) miss 00 Mem(1) Mem(0) hit 00 Mem(3) Mem(2) 00 Mem(1) Mem(0) miss hit 00 Mem(3) Mem(2) 00 Mem(1) Mem(0) miss 00 Mem(3) Mem(2) 00 Mem(1) Mem(0) hit 00 Mem(3) Mem(2) 01 Mem(5) Mem(4) hit 00 Mem(3) Mem(2) 01 Mem(5) Mem(4) 00 Mem(3) Mem(2) 01 Mem(5) Mem(4) miss Start with an empty cache - all blocks initially marked as not valid 8 requests, 4 misses

CS224 Spring 2011 Miss Rate vs Block Size vs Cache Size  Miss rate goes up if the block size becomes a significant fraction of the cache size because the number of blocks that can be held in the same size cache is smaller (increasing capacity misses)

CS224 Spring 2011 Cache Field Sizes  The number of bits in a cache includes both the storage for data and for the tags 32-bit byte address For a direct mapped cache with 2 n blocks, n bits are used for the index For a block size of 2 m words (2 m+2 bytes), m bits are used to address the word within the block and 2 bits are used to address the byte within the word  What is the size of the tag field?  The total number of bits in a direct-mapped cache is then 2 n x (block size + tag field size + valid field size)  How many total bits are required for a direct mapped cache with 16KB of data and 4-word blocks assuming a 32-bit address?

CS224 Spring 2011  Read hits (I$ and D$) this is what we want!  Write hits (D$ only) require the cache and memory to be consistent -always write the data into both the cache block and the next level in the memory hierarchy (write-through) -writes run at the speed of the next level in the memory hierarchy – so slow! – or can use a write buffer and stall only if the write buffer is full allow cache and memory to be inconsistent -write the data only into the cache block (write-back the cache block to the next level in the memory hierarchy when that cache block is “evicted”) -need a dirty bit for each data cache block to tell if it needs to be written back to memory when it is evicted – can use a write buffer to help “buffer” write-backs of dirty blocks Handling Cache Hits

CS224 Spring 2011 Sources of Cache Misses  Compulsory (cold start or process migration, first reference): First access to a block, “cold” fact of life, not a whole lot you can do about it. If you are going to run “millions” of instruction, compulsory misses are insignificant Solution: increase block size (increases miss penalty; very large blocks could increase miss rate)  Capacity: Cache cannot contain all blocks accessed by the program Solution: increase cache size (may increase access time)  Conflict (collision): Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity (stay tuned) (may increase access time)

CS224 Spring 2011 Handling Cache Misses (Single Word Blocks)  Read misses (I$ and D$) stall the pipeline, fetch the block from the next level in the memory hierarchy, install it in the cache and send the requested word to the processor, then let the pipeline resume  Write misses (D$ only) 1. stall the pipeline, fetch the block from next level in the memory hierarchy, install it in the cache (which may involve having to evict a dirty block if using a write-back cache), write the word from the processor to the cache, then let the pipeline resume or 1. Write allocate – just write the word into the cache updating both the tag and data, no need to check for cache hit, no need to stall or 1. No-write allocate – skip the cache write (but must invalidate that cache block since it will now hold stale data) and just write the word to the write buffer (and eventually to the next memory level), no need to stall if the write buffer isn’t full

CS224 Spring 2011 Multiword Block Considerations  Read misses (I$ and D$) Processed the same as for single word blocks – a miss returns the entire block from memory Miss penalty grows as block size grows -Early restart – processor resumes execution as soon as the requested word of the block is returned -Requested word first – requested word is transferred from the memory to the cache (and processor) first Nonblocking cache – allows the processor to continue to access the cache while the cache is handling an earlier miss  Write misses (D$) If using write allocate must first fetch the block from memory and then write the word to the block (or could end up with a “garbled” block in the cache (e.g., for 4 word blocks, a new tag, one word of data from the new block, and three words of data from the old block)

CS224 Spring 2011  The off-chip interconnect and memory architecture can affect overall system performance in dramatic ways Memory Systems that Support Caches CPU Cache DRAM Memory bus One word wide organization (one word wide bus and one word wide memory)  Assume 1. 1 memory bus clock cycle to send the addr memory bus clock cycles to get the 1 st word in the block from DRAM (row cycle time), 5 memory bus clock cycles for 2 nd, 3 rd, 4 th words (column access time) 3. 1 memory bus clock cycle to return a word of data  Memory-Bus to Cache bandwidth number of bytes accessed from memory and transferred to cache/CPU per memory bus clock cycle 32-bit data & 32-bit addr per cycle on-chip

CS224 Spring 2011 Review: (DDR) SDRAM Operation N rows N cols DRAM Column Address M-bit Output M bit planes N x M SRAM Row Address  After a row is read into the SRAM register Input CAS as the starting “burst” address along with a burst length Transfers a burst of data (ideally a cache block) from a series of sequential addr’s within that row - The memory bus clock controls transfer of successive words in the burst +1 Row Address CAS RAS Col Address 1 st M-bit Access2 nd M-bit3 rd M-bit4 th M-bit Cycle Time Row Add

CS224 Spring 2011 DRAM Size Increase  Add a table like figure 5.12 to show DRAM growth since 1980

CS224 Spring 2011 One Word Wide Bus, One Word Blocks CPU Cache DRAM Memory bus on-chip  If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall for the number of cycles required to return one data word from memory cycle to send address cycles to read DRAM cycle to return data total clock cycles miss penalty  Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per memory bus clock cycle

CS224 Spring 2011 One Word Wide Bus, One Word Blocks CPU Cache DRAM Memory bus on-chip  If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall for the number of cycles required to return one data word from memory 1 memory bus clock cycle to send address 16 memory bus clock cycles to read DRAM 1 memory bus clock cycle to return data 17 total clock cycles miss penalty  Number of bytes transferred per clock cycle (bandwidth) for a single miss is: 4/17 = bytes per memory bus clock cycle

CS224 Spring 2011 One Word Wide Bus, Four Word Blocks CPU Cache DRAM Memory bus on-chip  What if the block size is four words and each word is in a different DRAM row? cycle to send 1 st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty  Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock

CS224 Spring 2011 One Word Wide Bus, Four Word Blocks CPU Cache DRAM Memory bus on-chip  What if the block size is four words and each word is in a different DRAM row? cycle to send 1 st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty Number of bytes transferred per clock cycle (bandwidth) for a single miss is: (4 x 4)/62 = bytes per clock 15 cycles 1 4 x 15 =

CS224 Spring 2011 One Word Wide Bus, Four Word Blocks CPU Cache DRAM Memory bus on-chip  What if the block size is four words and all words are in the same DRAM row? cycle to send 1 st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty  Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock

CS224 Spring 2011 One Word Wide Bus, Four Word Blocks CPU Cache DRAM Memory bus on-chip  What if the block size is four words and all words are in the same DRAM row? cycle to send 1 st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty Number of bytes transferred per clock cycle (bandwidth) for a single miss is: (4 x 4)/32 = 0.5 bytes per clock cycle 15 cycles 5 cycles *5 = (4 x 4)/32 = 0.5 bytes perclock cycle(4 x 4)/32 = 0.5 bytes perclock cycle

CS224 Spring 2011 Interleaved Memory, One Word Wide Bus  For a block size of four words cycle to send 1 st address cycles to read DRAM banks cycles to return last data word total clock cycles miss penalty CPU Cache DRAM Memory bank 1 bus on-chip DRAM Memory bank 0 DRAM Memory bank 2 DRAM Memory bank 3  Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock

CS224 Spring 2011 Interleaved Memory, One Word Wide Bus  For a block size of four words cycle to send 1 st address cycles to read DRAM banks cycles to return last data word total clock cycles miss penalty CPU Cache bus on-chip  Number of bytes transferred per memory bus clock cycle (bandwidth) for a single miss is: (4 x 4)/20 = 0.8 bytes per clock 15 cycles (4 x 4)/2 0 = *1 = 4 20 DRAM Memory bank 1 DRAM Memory bank 0 DRAM Memory bank 2 DRAM Memory bank 3

CS224 Spring 2011 DRAM Memory System Summary  Its important to match the cache characteristics caches access one block at a time (usually more than one word)  with the DRAM characteristics use DRAMs that support fast multiple word accesses, preferably ones that match the block size of the cache  with the memory-bus characteristics make sure the memory-bus can support the DRAM access rates and patterns with the goal of increasing the Memory-Bus to Cache bandwidth

CS224 Spring 2011 Measuring Cache Performance  Components of CPU time Program execution cycles -Includes cache hit time Memory stall cycles -Mainly from cache misses  With simplifying assumptions: §5.3 Measuring and Improving Cache Performance

CS224 Spring 2011 Measuring Cache Performance  Assuming cache hit costs are included as part of the normal CPU execution cycle, then CPU time = IC × CPI × CC = IC × (CPI ideal + Memory-stall cycles) × CC CPI stall  Memory-stall cycles come from cache misses (a sum of read-stalls and write-stalls) Read-stall cycles = reads/program × read miss rate × read miss penalty Write-stall cycles = (writes/program × write miss rate × write miss penalty) + write buffer stalls  For write-through caches, we can simplify this to Memory-stall cycles = accesses/program × miss rate × miss penalty

CS224 Spring 2011 Impacts of Cache Performance  Relative cache penalty increases as processor performance improves (faster clock rate and/or lower CPI) The memory speed is unlikely to improve as fast as processor cycle time. When calculating CPI stall, the cache miss penalty is measured in processor clock cycles needed to handle a miss The lower the CPI ideal, the more pronounced the impact of stalls  A processor with a CPI ideal of 2, a 100 cycle miss penalty, 36% load/store instr’s, and 2% I$ and 4% D$ miss rates Memory-stall cycles = 2% × % × 4% × 100 = 3.44 So CPI stalls = = 5.44 more than twice the CPI ideal !  What if the CPI ideal is reduced to 1? 0.5? 0.25?  What if the D$ miss rate went up 1%? 2%?  What if the processor clock rate is doubled (doubling the miss penalty)?

CS224 Spring 2011 Average Memory Access Time (AMAT)  A larger cache will have a longer access time. An increase in hit time will likely add another stage to the pipeline. At some point the increase in hit time for a larger cache will overcome the improvement in hit rate leading to a decrease in performance.  Average Memory Access Time (AMAT) is the average to access memory considering both hits and misses AMAT = Time for a hit + Miss rate x Miss penalty  What is the AMAT for a processor with a 20 psec clock, a miss penalty of 50 clock cycles, a miss rate of 0.02 misses per instruction and a cache access time of 1 clock cycle?

CS224 Spring 2011 Reducing Cache Miss Rates #1 Allow more flexible block placement  In a direct mapped cache a memory block maps to exactly one cache block At the other extreme, could allow a memory block to be mapped to any cache block – fully associative cache  A compromise is to divide the cache into sets each of which consists of n “ways” (n-way set associative). A memory block maps to a unique set (specified by the index field) and can be placed in any way of that set (so there are n choices) (block address) modulo (# sets in the cache)

CS224 Spring 2011 Another Reference String Mapping  Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid

CS224 Spring 2011 Another Reference String Mapping  Consider the main memory word reference string miss 00 Mem(0) Mem(4) Mem(0) Mem(0) Mem(0) Mem(4) Mem(4) 0 00 Start with an empty cache - all blocks initially marked as not valid  Ping pong effect due to conflict misses - two memory locations that map into the same cache block 8 requests, 8 misses

CS224 Spring 2011 Set Associative Cache Example 0 Cache Main Memory Q2: How do we find it? Use next 1 low order memory address bit to determine which cache set (i.e., modulo the number of sets in the cache) TagData Q1: Is it there? Compare all the cache tags in the set to the high order 3 memory address bits to tell if the memory block is in the cache V 0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx 1111xx Set Way 0 1 One word blocks Two low order bits define the byte in the word (32b words)

CS224 Spring 2011 Another Reference String Mapping 0404  Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid

CS224 Spring 2011 Another Reference String Mapping 0404  Consider the main memory word reference string miss hit 000 Mem(0) Start with an empty cache - all blocks initially marked as not valid 010 Mem(4) 000 Mem(0) 010 Mem(4)  Solves the ping pong effect in a direct mapped cache due to conflict misses since now two memory locations that map into the same cache set can co-exist! 8 requests, 2 misses

CS224 Spring 2011 Four-Way Set Associative Cache  2 8 = 256 sets each with four ways (each with one block) Byte offset Data TagV Data TagV Data TagV Index Data TagV Index 22 Tag HitData 32 4x1 select Way 0Way 1Way 2Way 3

CS224 Spring 2011 Range of Set Associative Caches  For a fixed size cache, each increase by a factor of two in associativity doubles the number of blocks per set (i.e., the number or ways) and halves the number of sets – decreases the size of the index by 1 bit and increases the size of the tag by 1 bit Block offsetByte offsetIndexTag

CS224 Spring 2011 Range of Set Associative Caches  For a fixed size cache, each increase by a factor of two in associativity doubles the number of blocks per set (i.e., the number or ways) and halves the number of sets – decreases the size of the index by 1 bit and increases the size of the tag by 1 bit Block offsetByte offsetIndexTag Decreasing associativity Fully associative (only one set) Tag is all the bits except block and byte offset Direct mapped (only one way) Smaller tags, only a single comparator Increasing associativity Selects the setUsed for tag compareSelects the word in the block

CS224 Spring 2011 Costs of Set Associative Caches  When a miss occurs, which way’s block do we pick for replacement? Least Recently Used (LRU): the block replaced is the one that has been unused for the longest time -Must have hardware to keep track of when each way’s block was used relative to the other blocks in the set -For 2-way set associative, takes one bit per set → set the bit when a block is referenced (and reset the other way’s bit)  N-way set associative cache costs N comparators (delay and area) MUX delay (set selection) before data is available Data available after set selection (and Hit/Miss decision). In a direct mapped cache, the cache block is available before the Hit/Miss decision -So its not possible to just assume a hit and continue and recover later if it was a miss

CS224 Spring 2011 Benefits of Set Associative Caches  The choice of direct mapped or set associative depends on the cost of a miss versus the cost of implementation Data from Hennessy & Patterson, Computer Architecture, 2003  Largest gains are in going from direct mapped to 2-way (20%+ reduction in miss rate)

CS224 Spring 2011 Reducing Cache Miss Rates #2 1. Use multiple levels of caches  With advancing technology have more than enough room on the die for bigger L1 caches or for a second level of caches – normally a unified L2 cache (i.e., it holds both instructions and data) and in some cases even a unified L3 cache  For our example, CPI ideal of 2, 100 cycle miss penalty (to main memory) and a 25 cycle miss penalty (to UL2$), 36% load/stores, a 2% (4%) L1 I$ (D$) miss rate, add a 0.5% UL2$ miss rate CPI stalls = × ×.04× × ×.005×100 = 3.54 (as compared to 5.44 with no L2$)!!

CS224 Spring 2011 Multilevel Cache Design Considerations  Design considerations for L1 and L2 caches are very different Primary cache should focus on minimizing hit time in support of a shorter clock cycle -Smaller with smaller block sizes Secondary cache(s) should focus on reducing miss rate to reduce the penalty of long main memory access times -Larger with larger block sizes -Higher levels of associativity  The miss penalty of the L1 cache is significantly reduced by the presence of an L2 cache – so it can be smaller (i.e., faster) but have a higher miss rate  For the L2 cache, hit time is less important than miss rate The L2$ hit time determines L1$’s miss penalty L2$ local miss rate >> than the global miss rate

CS224 Spring 2011 Two Machines’ Cache Parameters Intel NehalemAMD Barcelona L1 cache organization & size Split I$ and D$; 32KB for each per core; 64B blocks Split I$ and D$; 64KB for each per core; 64B blocks L1 associativity4-way (I), 8-way (D) set assoc.; ~LRU replacement 2-way set assoc.; LRU replacement L1 write policywrite-back, write-allocate L2 cache organization & size Unified; 256MB (0.25MB) per core; 64B blocks Unified; 512KB (0.5MB) per core; 64B blocks L2 associativity8-way set assoc.; ~LRU16-way set assoc.; ~LRU L2 write policywrite-back L2 write policywrite-back, write-allocate L3 cache organization & size Unified; 8192KB (8MB) shared by cores; 64B blocks Unified; 2048KB (2MB) shared by cores; 64B blocks L3 associativity16-way set assoc.32-way set assoc.; evict block shared by fewest cores L3 write policywrite-back, write-allocatewrite-back; write-allocate

CS224 Spring 2011 Two Machines’ Cache Parameters Intel P4AMD Opteron L1 organizationSplit I$ and D$ L1 cache size8KB for D$, 96KB for trace cache (~I$) 64KB for each of I$ and D$ L1 block size64 bytes L1 associativity4-way set assoc.2-way set assoc. L1 replacement~ LRULRU L1 write policywrite-throughwrite-back L2 organizationUnified L2 cache size512KB1024KB (1MB) L2 block size128 bytes64 bytes L2 associativity8-way set assoc.16-way set assoc. L2 replacement~LRU L2 write policywrite-back

CS224 Spring 2011 Summary: Improving Cache Performance 1. Reduce the time to hit in the cache smaller cache direct mapped cache smaller blocks for writes -no write allocate – no “hit” on cache, just write to write buffer -write allocate – to avoid two cycles (first check for hit, then write) pipeline writes via a delayed write buffer to cache 2. Reduce the miss rate bigger cache more flexible placement (increase associativity) larger blocks (16 to 64 bytes typical) victim cache – small buffer holding most recently discarded blocks

CS224 Spring 2011 Summary: Improving Cache Performance 3. Reduce the miss penalty smaller blocks use a write buffer to hold dirty blocks being replaced so don’t have to wait for the write to complete before reading check write buffer (and/or victim cache) on read miss – may get lucky for large blocks fetch critical word first use multiple cache levels – L2 cache not tied to CPU clock rate faster backing store/improved memory bandwidth -wider buses -memory interleaving, DDR SDRAMs

CS224 Spring 2011 Summary: The Cache Design Space  Several interacting dimensions cache size block size associativity replacement policy write-through vs write-back write allocation  The optimal choice is a compromise depends on access characteristics -workload -use (I-cache, D-cache, TLB) depends on technology / cost  Simplicity often wins Associativity Cache Size Block Size Bad Good LessMore Factor AFactor B