Lecture Objectives: 1)Explain the relationship between miss rate and block size in a cache. 2)Construct a flowchart explaining how a cache miss is handled.

Slides:



Advertisements
Similar presentations
Cache Memory Exercises. Questions I Given: –memory is little-endian and byte addressable; memory size; –number of cache blocks, size of cache block –An.
Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Lecture 12 Reduce Miss Penalty and Hit Time
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
Cache Performance 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
Performance of Cache Memory
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
Cache Here we focus on cache improvements to support at least 1 instruction fetch and at least 1 data access per cycle – With a superscalar, we might need.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
How caches take advantage of Temporal locality
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Lecture 31: Chapter 5 Today’s topic –Direct mapped cache Reminder –HW8 due 11/21/
 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.
CMPE 421 Parallel Computer Architecture
Lecture Objectives: 1)Define set associative cache and fully associative cache. 2)Compare and contrast the performance of set associative caches, direct.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Virtual Memory Expanding Memory Multiple Concurrent Processes.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
B. Ramamurthy.  12 stage pipeline  At peak speed, the processor can request both an instruction and a data word on every clock.  We cannot afford pipeline.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Cache Memory.
1 CMPE 421 Parallel Computer Architecture PART3 Accessing a Cache.
Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,
Lecture 20 Last lecture: Today’s lecture: Types of memory
Computer Organization CS224 Fall 2012 Lessons 39 & 40.
COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.
The Memory Hierarchy (Lectures #17 - #20) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Memory Hierarchy and Cache Design (3). Reducing Cache Miss Penalty 1. Giving priority to read misses over writes 2. Sub-block placement for reduced miss.
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
COSC3330 Computer Architecture
Cache Memory and Performance
CSC 4250 Computer Architectures
How will execution time grow with SIZE?
Morgan Kaufmann Publishers Memory & Cache
Morgan Kaufmann Publishers
ECE 445 – Computer Organization
Exploiting Memory Hierarchy Chapter 7
CSCI206 - Computer Organization & Programming
Systems Architecture II
Lecture 08: Memory Hierarchy Cache Performance
Module IV Memory Organization.
10/16: Lecture Topics Memory problem Memory Solution: Caches Locality
ECE232: Hardware Organization and Design
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
CS 704 Advanced Computer Architecture
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Cache Memory Rabi Mahapatra
10/18: Lecture Topics Using spatial locality
Caches & Memory.
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Lecture Objectives: 1)Explain the relationship between miss rate and block size in a cache. 2)Construct a flowchart explaining how a cache miss is handled. 3)Compare and contrast write through and write back caching schemes. 4)Define the term write buffer. 5)Perform cache related calculations to analyze the impact of having a cache on a computer system.

CS-280 Dr. Mark L. Hornick Consider the lw instruction that loads a word from memory to a register: lw $t0, ($t1) # $t1=0x After loading, $t0 will contain the value 0x2e0267ce. A direct load from main memory would consume many CPU cycles, resulting in a long stall of every lw instruction ce x F 0x E 0x D 0x C 0x B 0x A 0x e x x x x x Word 23 40

Ex: Direct Address cache, block size of 16B, #blocks=256 (4KB total) lw $t0, ($t1) # t1=0x (4-byte address) The Cache Index is computed from Memory Block Address 0x The cache block at that index is initially empty (Valid bit=0), resulting in a miss. The cache manager loads a block of 16 bytes into cache index 04, from main memory addresses 0x – 0x F. This initial load from main to cache memory consumes many CPU cycles (~100), resulting in a long stall of this particular lw instruction. CS2710 Computer Organization3 Dirty (1 bit) Valid (1 bit) Tag (20 bits) Data (16 bytes per block, individually addressed by 4 LSB) 000x x x x x ………… 000x Cache Index = (Memory Block Address) mod (#Blocks in Cache)

Once the cache block is loaded, the 4 bytes at address 0x are loaded into t1. lw $t0, ($t1) # t1=0x , t0=0x2e02676e The Valid Bit for the block is set to 1 and the Tag Bits are set to 0x Note that the original address of the memory location of the data can be reconstructed from the Tag Bits (10010) + Cache Index (04) + Byte Offset within the Block (8) CS2710 Computer Organization4 Dirty (1 bit) Valid (1 bit) Tag (20 bits) Data (16 bytes per block, individually addressed by 4 LSB) 000x x x x x E e e ………… 000x

Suppose the next instruction attempts to load the subsequent word into t0 lw $t0, ($t1) # t1=0x , t0=0x2e02676e lw $t0, 4($t1) # load next word from 0x c The Cache Index is (again) computed from Memory Block Address 0x The cache block at that index is valid (Valid bit=1), and the tags match, resulting this time in a hit. This subsequent load from cache memory consumes only 1 CPU cycle, avoiding a stall of the lw instruction. CS2710 Computer Organization5 Dirty (1 bit) Valid (1 bit) Tag (20 bits) Data (16 bytes per block, individually addressed by 4 LSB) 000x x x x x E e e ………… 000x

The 3 rd instruction loads a word from a different location in memory lw $t0, ($t1) # t1=0x , t0=0x2e02676e lw $t0, 4($t1) # load next word from 0x c lw $t0, ($sp) # sp=0x7FFF EFFC, t0=0x1928ef3c The Cache Index is computed from Memory Block Address 0x7FFFEFF. The cache block at that index is initially empty (Valid bit=0), resulting in another miss. The cache manager loads a block of 16 bytes into cache index FF, from main memory addresses 0x7FFFEFF0 – 0x7FFFEFFF. Then the word at 0x7FFFEFFC is loaded into $t0. CS2710 Computer Organization6 Dirty (1 bit) Valid (1 bit) Tag (20 bits) Data (16 bytes per block, individually addressed by 4 LSB) 000x x x x x E e e ………… 010x7FFFE ad f7 92 c3 4f ef 3c

The 4 th instruction loads a word … lw $t0, ($t1) # t1=0x , t0=0x2e02676e lw $t0, 4($t1) # load next word from 0x c lw $t0, ($sp) # sp=0x7FFF EFFC, t0=0x1928ef3c lw $t0, ($t3) # t3=0x The Cache Index is computed from Memory Block Address 0x The cache block at that index is currently valid (Valid bit=1), but the tags don’t match, resulting in a miss. The cache manager reloads a block of 16 bytes into cache index 04, from main memory addresses 0x – 0x243714F. The value 0xaabbccdd is loaded into $t0. CS2710 Computer Organization7 Dirty (1 bit) Valid (1 bit) Tag (20 bits) Data (16 bytes per block, individually addressed by 4 LSB) 000x x x x x aa bb cc dd 1a 2b 3c 4d 5e 6f 7a 8b ………… 010x7FFFE ad f7 92 c3 4f ef 3c

Discussion – An engineer is designing a computer system and is designing a system with either 64KB of cache or 256KB of cache. Which will perform better? – Another engineer is designing a system with 16KB of cache, and is trying to decide whether to use a block size of 16 or 64. Which will perform better? CS2710 Computer Organization8

Experimental Results CS2710 Computer Organization9

Block size considerations – Larger blocks exploit spatial locality to lower miss rates. – But miss rates increase when the block size becomes a significant portion of the cache size – Larger miss penalty: the larger the block, the longer it takes to load it CS2710 Computer Organization10

Handling Misses CS2710 Computer Organization11

Suppose the next instruction attempts to write the subsequent word to t1 lw $t0, ($t1) # t1=0x , t0=0x2e02676e sw $t0, 4($t1) # now store $t0 to 0x c The Cache Index is computed from Memory Block Address 0x The cache block at that index is valid (Valid bit=1), and the tags match, resulting in a hit. This subsequent store to cache memory changes the contents of cache, but the main data memory is still (at this point) unchanged and out of sync with the cache. The change to the cache is noted by setting the Dirty Bit to 1. CS2710 Computer Organization12 Dirty (1 bit) Valid (1 bit) Tag (20 bits) Data (16 bytes per block, individually addressed by 4 LSB) 000x x x x x E e e 2e e ………… 000x

Another instruction loads a word … lw $t0, ($t1) # t1=0x , t0=0x2e02676e sw $t0, 4($t1) # now store $t0 to 0x c lw $t0, ($t3) # t3=0x The Cache Index is computed from Memory Block Address 0x The cache block at that index is currently valid (Valid bit=1), but the tags don’t match, resulting in a miss. The Dirty Bit is also set, indicating that the cache has to be written to main memory before the reload can take place. This results in a double-length stall: first the changed cache has to be written to memory, then the new block needs to be loaded. CS2710 Computer Organization13 Dirty (1 bit) Valid (1 bit) Tag (20 bits) Data (16 bytes per block, individually addressed by 4 LSB) 000x x x x x x E e e 2e e aa bb cc dd 1a 2b 3c 4d 5e 6f 7a 8b ………… 010x7FFFE ad f7 92 c3 4f ef 3c

Approaches to Writing Data to memory Write through – Update cache plus block in main memory – A separate Write Buffer is used CS2710 Computer Organization14

Cache Performance Two different configurations for a cache system are being developed. In both cases, the miss penalty is 100 clock cycles, and the processor has an average CPI of 2 if no memory stalls occur. – The first cache system has a miss rate of 4% in the data cache. – The second system has a miss rate of 2% in the data cache. – 36% of instructions are loads and stores. Determine the performance relationship between processor A and processor B. CS2710 Computer Organization15

Problem solution CS2710 Computer Organization16 CPI A / CPI B =3.44/2.72=1.26 Processor A is 26% slower than Processor B

Average Memory Access Time Average Memory Access Time (AMAT) – The average of the time (per instruction) it takes to access memory considering both hits and misses. CS2710 Computer Organization17

AMAT Problem Solve the following: – A processor has a 1ns/inst clock cycle time, a miss penalty of 50 clock cycles, a miss rate of 0.1 misses per instruction, and a cache access time of 1 clock cycle. – Determine the average memory access time per instruction. CS2710 Computer Organization18

Solution CS2710 Computer Organization19