CS2100 Computer Organisation Cache I (AY2010/2011) Semester 2.

Slides:

Advertisements

Similar presentations

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Advertisements

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

CS2100 Computer Organisation Cache II (AY2014/2015) Semester 2.

Memory Subsystem and Cache Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.

CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.

1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.

331 Week13.1Spring :332:331 Computer Architecture and Assembly Language Spring 2006 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Memory Chapter 7 Cache Memories.

COMP3221: Microprocessors and Embedded Systems Lecture 26: Cache - II Lecturer: Hui Wu Session 2, 2005 Modified from.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

COMP3221 lec34-Cache-II.1 Saeid Nooshabadi COMP 3221 Microprocessors and Embedded Systems Lectures 34: Cache Memory - II

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

L18 – Memory Hierarchy 1 Comp 411 – Fall /30/2009 Memory Hierarchy Memory Flavors Principle of Locality Program Traces Memory Hierarchies Associativity.

331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

1  2004 Morgan Kaufmann Publishers Chapter Seven.

1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.

Systems I Locality and Caching

CMPE 421 Parallel Computer Architecture

CS1104: Computer Organisation School of Computing National University of Singapore.

Chapter 5 Large and Fast: Exploiting Memory Hierarchy CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University.

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

Lecture 14 Memory Hierarchy and Cache Design Prof. Mike Schulte Computer Architecture ECE 201.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

EEL5708/Bölöni Lec 4.1 Fall 2004 September 10, 2004 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Review: Memory Hierarchy.

CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.

CML CML CS 230: Computer Organization and Assembly Language Aviral Shrivastava Department of Computer Science and Engineering School of Computing and Informatics.

The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.

CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.

Computer Organization & Programming

CS2100 Computer Organisation Cache II (AY2010/2011) Semester 2.

CS2100 Computer Organisation Cache II (AY2015/6) Semester 1.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

CS2100 Computer Organisation Cache I (AY2015/6) Semester 1.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1  1998 Morgan Kaufmann Publishers Chapter Seven.

The Memory Hierarchy (Lectures #17 - #20) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.

CMSC 611: Advanced Computer Architecture

COSC3330 Computer Architecture

CSE 351 Section 9 3/1/12.

Cache Memory and Performance

Yu-Lun Kuo Computer Sciences and Information Engineering

The Goal: illusion of large, fast, cheap memory

Cache Memory Presentation I

Morgan Kaufmann Publishers Memory & Cache

Lecture 21: Memory Hierarchy

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

CS-447– Computer Architecture Lecture 20 Cache Memories

Lecture 21: Memory Hierarchy

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Dr. Hadi AL Saadi Faculty Of Information Technology

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Presentation transcript:

CS2100 Computer Organisation Cache I (AY2010/2011) Semester 2

CS2100 Cache I 2 WHERE ARE WE NOW? Number systems and codes Boolean algebra Logic gates and circuits Simplification Combinational circuits Sequential circuits Performance Assembly language The processor: Datapath and control Pipelining Memory hierarchy: Cache Input/output Preparation Logic Design Computer organisation

CS2100 Cache I 3 CACHE I Memory Hierarchy Locality The Cache Principle Direct-Mapped Cache Cache Structure and Circuitry Write Policy

CS2100 Cache I 4 TODAY’S FOCUS Computer Processor Control Memory Devices Input OutputDatapath Memory Ack: Some slides here are taken from Dr Tulika Mitra’s CS1104 notes.

CS2100 Cache I 5 DATA TRANSFER: THE BIG PICTURE Registers are in the datapath of the processor. If operands are in memory we have to load them to processor (registers), operate on them, and store them back to memory Computer Processor Control Datapath + Registers MemoryDevices Input Output Load from memory Store to memory

CS2100 Cache I 6 MEMORY TECHNOLOGY: 1950s 1948: Maurice Wilkes examining EDSAC’s delay line memory tubes 16-tubes each storing bit words Maurice Wilkes: : IBM KB magnetic core memory

CS2100 Cache I 7 MEMORY TECHNOLOGY 2005: DRAM Infineon Technologies: stores data equivalent to 640 books, 32,000 standard newspaper pages, and 1,600 still pictures or 64 hours of sound Is this latest? Try to find out from Google

CS2100 Cache I 8 DRAM CAPACITY GROWTH 4X capacity increase almost every 3 years, i.e., 60% increase per year for 20 years Unprecedented growth in density, but we still have a problem

CS2100 Cache I 9 PROCESSOR-DRAM PERFORMANCE GAP Memory Wall: 1 GHz Processor  1 ns per clock cycle but 50 ns to go to DRAM 50 processor clock cycles per memory access!!

CS2100 Cache I 10 FASTER MEMORY TECHNOLOGIES: SRAM SRAM 6 transistors per memory cell  Low density Fast access latency of 0.5 – 5 ns DRAM 1 transistor per memory cell  High density Slow access latency of 50-70ns SRAM

CS2100 Cache I 11 SLOW MEMORY TECHNOLOGIES: MAGNETIC DISK Typical high-end hard disk: Average latency: 5-20 ms Capacity: 250GB

CS2100 Cache I 12 QUALITY VERSUS QUANTITY CapacityLatencyCost/GB Register100s Bytes 20 ps$$$$ SRAM100s KB0.5-5 ns$$$ DRAM100s MB50-70 ns$ Hard Disk100s GB 5-20 msCents Ideal1 GB 1 nsCheap Processor Control Datapath Registers Memory (DRAM) Devices Input Output Hard Disk

CS2100 Cache I 13 BEST OF BOTH WORLDS What we want: A BIG and FAST memory Memory system should perform like 1GB of SRAM (1ns access time) but cost like 1GB of slow memory Key concept: Use a hierarchy of memory technologies  Small but fast memory near CPU  Large but slow memory farther away from CPU

CS2100 Cache I 14 MEMORY HIERARCHY Level 1 Level 2 Level n Size Speed CPU Registers SRAM DRAM Disk

CS2100 Cache I 15 THE BASIC IDEA Keep the frequently and recently used data in smaller but faster memory Refer to bigger and slower memory only when you cannot find data/instruction in the faster memory Why does it work? Principle of Locality Program accesses only a small portion of the memory address space within a small time interval

CS2100 Cache I 16 LOCALITY Temporal locality  If an item is referenced, it will tend to be referenced again soon Spatial locality  If an item is referenced, nearby items will tend to be referenced soon Locality for instruction? Locality for data?

CS2100 Cache I 17 WORKING SET Set of locations accessed during  t Different phases of execution may use different working sets Want to capture the working set and keep it in the memory closest to CPU

CS2100 Cache I 18 EXPLOITING MEMORY HIERARCHY (1/2) Visible to Programmer (Cray etc.)  Various storage alternatives, e.g., register, main memory, hard disk  Tell programmer to use them cleverly Transparent to Programmer (except for performance)  Single address space for programmer  Processor automatically assigns locations to fast or slow memory depending on usage patterns

CS2100 Cache I 19 EXPLOITING MEMORY HIERARCHY (2/2) How to make SLOW main memory appear faster? Cache – a small but fast SRAM near CPU  Often in the same chip as CPU  Introduced in 1960; almost every processor uses it today  Hardware managed: Transparent to programmer How to make SMALL main memory appear bigger than it is? Virtual memory [covered in CS2106]  OS managed: Again transparent to programmer Processor Control Datapath Registers Memory (DRAM) Devices Input Output Hard Disk Cache (SRAM)

CS2100 Cache I 20 THE CACHE PRINCIPLE 10 minute access time 10 sec access time Look for the document in your desk first. If not on desk, then look in the cabinet. Find LEE Nicholas

CS2100 Cache I 21 CACHE BLOCK/LINE (1/2) Unit of transfer between memory and cache Recall: MIPS word size is 4 bytes Block size is typically one or more words Example:16-byte block  4-word block 32-byte block  8-word block Why block size is bigger than word size? What is the unit of transfer for virtual memory (hard disk and memory)? Should it be bigger than or smaller than cache blocks?

CS2100 Cache I 22 CACHE BLOCK/LINE (2/2) Block Word Byte Address Word0 Word1 Word2 Word3 Block0 Block1 8-byte blocks Observations: 1.2 N -byte blocks are aligned at 2 N -byte boundaries 2.The addresses of words within a 2 N -byte block have identical (32-N) most significant bits (MSB). Example: For 2 3 = 8-byte block, (32-3) = 29 MSB of the memory address are identical 3.29-MSB signifies the block number 4.N-LSB specifies the byte offset within a block Memory Address Block Number 310N-1N Offset

CS2100 Cache I 23 CACHE TERMINOLOGY Hit: Block appears in cache (e.g., X)  Hit rate: Fraction of memory accesses that hit  Hit time: Time to access cache Miss: Block is not in cache (e.g., Y)  Miss rate = 1 – Hit rate  Miss penalty: Time to replace cache block + deliver data Hit time < Miss penalty CPU Cache Main memory X Y

CS2100 Cache I 24 MEMORY ACCESS TIME Average access time = Hit rate  Hit Time + (1 – Hit rate)  Miss penalty Suppose our on-chip SRAM (cache) has 0.8 ns access time, but the fastest DRAM we can get has an access time of 10ns. How high a hit rate do we need to sustain an average access time of 1ns? 1 = h x (1-h) x 10 h = 97.8% Wow ! Can we really achieve this high hit rate with cache?

CS2100 Cache I 25 DIRECT-MAPPED CACHE (1/2) Look for the document in only one place on the desk. The place is determined from the request, e.g., L for LEE Find LEE Nicholas A B Z L

CS2100 Cache I 26 DIRECT-MAPPED CACHE (2/2) Block Number (Not Address) Memory Cache Cache Index Mapping Function: Cache Index = (Block Number) modulo (Number of Cache Blocks) Observation: Let Number of Cache Blocks = 2 M ; then the last M-bits of the block number determines the cache index; i.e., where the memory block will be placed in cache. Example: Cache has 2 2 = 4 blocks; last 2-bits of the block number determines the cache index. Index

CS2100 Cache I 27 CLOSER LOOK AT MAPPING Block size = 2 N bytes Number of cache blocks = 2 M Offset: N bits Index: M bits Tag: 32 – (N + M) Memory Address Block Number 310N-1N Block size = 2 N bytes Offset Memory Address Block Number 310N-1N OffsetIndex N+M-1 Tag

CS2100 Cache I 28 Tag WHY DO WE NEED TAG? Block Number (Not Address) Memory Block Number (Not Address) Memory Cache Cache Index Mapping Function: Cache Index = (Block Number) modulo (Number of Cache Blocks) Observation: 1.Multiple memory blocks can map to the same cache block 2.How do we differentiate the memory blocks that can map to the same cache block? Use the Tag 3. Tag =  Block number/Number of Cache Blocks 

CS2100 Cache I 29 CACHE STRUCTURE Cache IndexDataTagValid Along with a data block (line), cache contains 1.Tag of the memory block 2.Valid bit indicating whether the cache line contains valid data – why? Cache hit: (Tag[index] == Tag[memory address]) AND (Valid[index] == TRUE)

CS2100 Cache I 30 CACHE: EXAMPLE (1/2) We have 4GB main memory and block size is 2 4 =16 bytes (N = 4). How many memory blocks are there?  4GB/16 bytes = 256M memory blocks How many bits are required to uniquely identify the memory blocks?  256M =2 28 memory blocks will require 28 bits  We can also calculate it as (32-N) = 28 We have 16KB cache. How many cache blocks are there?  16KB/16 bytes = 1024 cache blocks

CS2100 Cache I 31 CACHE: EXAMPLE (2/2) How many bits do we need for cache index?  1024 cache blocks  10-bit index (M=10) How many memory blocks can map to the same cache block?  256M memory blocks/1K cache blocks = 256K How many Tag bits do we need?  256K = 2 18 memory blocks can be uniquely distinguished using 18 bits  We can also calculate it as 32 – (M+N) = 18 offsetindextag

CS2100 Cache I 32 CACHE CIRCUITRY 10 Index Data (32  4 bits = 16 bytes) IndexTagValid Byte offset 18 Tag HitData 32 Block offset 1 word= 4 bytes A 16-KB cache using 4- word (16-byte) blocks.

CS2100 Cache I 33 ACCESSING DATA (1/17)  Let’s go through accessing some data in a direct mapped, 16KB cache:  16-byte blocks x 1024 cache blocks.  Examples: 4 Addresses divided (for convenience) into Tag, Index, Byte Offset fields. TagIndexOffset

CS2100 Cache I 34 ACCESSING DATA (2/17)  Initially cache is empty; all Valid bits are IndexValidTag Data Word0Word1Word2Word3 Byte 0-3Byte 4-7Byte 8-11Byte

CS2100 Cache I 35 ACCESSING DATA (3/17) Load from Address: TagIndexOffset 1. Look at cache block IndexValidTag Data Word0Word1Word2Word3 Byte 0-3Byte 4-7Byte 8-11Byte

CS2100 Cache I 36 ACCESSING DATA (4/17) Load from Address: TagIndexOffset 2. Data in block 1 is not valid  Cache Miss [Cold/Compulsory Miss] IndexValidTag Data Word0Word1Word2Word3 Byte 0-3Byte 4-7Byte 8-11Byte

CS2100 Cache I 37 ACCESSING DATA (5/17) Load from Address: TagIndexOffset 3. Load 16-byte data from memory to cache; set tag and valid bit IndexValidTag Data Word0Word1Word2Word3 Byte 0-3Byte 4-7Byte 8-11Byte ABCD

CS2100 Cache I 38 ACCESSING DATA (6/17) IndexValidTag Data Word0Word1Word2Word3 Byte 0-3Byte 4-7Byte 8-11Byte ABCD Load from Address: TagIndexOffset 4. Load Word1 (byte offset = 4) from cache to register

CS2100 Cache I 39 ACCESSING DATA (7/17) IndexValidTag Data Word0Word1Word2Word3 Byte 0-3Byte 4-7Byte 8-11Byte ABCD Load from Address: TagIndexOffset 1. Look at cache block 1

CS2100 Cache I 40 ACCESSING DATA (8/17) IndexValidTag Data Word0Word1Word2Word3 Byte 0-3Byte 4-7Byte 8-11Byte ABCD Load from Address: TagIndexOffset 2. Data in block 1 is valid + Tag Match  Cache Hit

CS2100 Cache I 41 ACCESSING DATA (9/17) IndexValidTag Data Word0Word1Word2Word3 Byte 0-3Byte 4-7Byte 8-11Byte ABCD Load from Address: TagIndexOffset 3. Load Word3 (byte offset = 12) from cache to register [Spatial Locality]

CS2100 Cache I 42 ACCESSING DATA (10/17) IndexValidTag Data Word0Word1Word2Word3 Byte 0-3Byte 4-7Byte 8-11Byte ABCD Load from Address: TagIndexOffset 1. Look at cache block 3

CS2100 Cache I 43 ACCESSING DATA (11/17) IndexValidTag Data Word0Word1Word2Word3 Byte 0-3Byte 4-7Byte 8-11Byte ABCD Load from Address: TagIndexOffset 2. Data at cache block 3 in not valid  Cache Miss [Cold/Compulsory Miss]

CS2100 Cache I 44 ACCESSING DATA (12/17) IndexValidTag Data Word0Word1Word2Word3 Byte 0-3Byte 4-7Byte 8-11Byte ABCD IJKL Load from Address: TagIndexOffset 3. Load 16-byte data from main memory to cache; set tag and valid bit

CS2100 Cache I 45 ACCESSING DATA (13/17) IndexValidTag Data Word0Word1Word2Word3 Byte 0-3Byte 4-7Byte 8-11Byte ABCD IJKL Load from Address: TagIndexOffset 4. Return Word 1 (byte offset = 4) from cache to register

CS2100 Cache I 46 ACCESSING DATA (14/17) IndexValidTag Data Word0Word1Word2Word3 Byte 0-3Byte 4-7Byte 8-11Byte ABCD IJKL Load from Address: TagIndexOffset 1. Look at cache block 1

CS2100 Cache I 47 ACCESSING DATA (15/17) IndexValidTag Data Word0Word1Word2Word3 Byte 0-3Byte 4-7Byte 8-11Byte ABCD IJKL Load from Address: TagIndexOffset 2. Data at cache block 1 is valid but tag mismatch  Cache Miss [Cold Miss]

CS2100 Cache I 48 ACCESSING DATA (16/17) IndexValidTag Data Word0Word1Word2Word3 Byte 0-3Byte 4-7Byte 8-11Byte EFGH IJKL Load from Address: TagIndexOffset 3. Replace cache block 1 with new data and tag

CS2100 Cache I 49 ACCESSING DATA (17/17) IndexValidTag Data Word0Word1Word2Word3 Byte 0-3Byte 4-7Byte 8-11Byte EFGH IJKL Load from Address: TagIndexOffset 4. Load Word 1 (byte offset = 4) from cache to register

CS2100 Cache I 50 TRY IT YOURSELF #1 (1/2) We have 4GB main memory and block size is 2 3 =8 bytes (N = 3). How many memory blocks are there? How many bits are required to uniquely identify the memory blocks? We have 32-byte cache. How many cache blocks (lines) are there? 

CS2100 Cache I 51 TRY IT YOURSELF #1 (2/2) How many bits do we need for cache index? How many memory blocks can map to the same cache block (line)? How many Tag bits do we need? 

CS2100 Cache I 52 TRY IT YOURSELF #2 (1/10) Here is a series of memory address references: 4, 0, 8, 12, 36, 0, 4 for a direct-mapped cache with four 8- byte blocks. Indicate hit/miss for each reference. Assume that 1 word contains 4 bytes.  Number of blocks in cache = 4 = 2 2 Index = ? bits Size of block = 8 bytes = 2 3 bytes Offset = ? bits Memory size = unknown ? bits TagIndexOffset

CS2100 Cache I 53 TRY IT YOURSELF #2 (2/10) Addresses: 4, 0, 8, 12, 36, 0, 4 IndexValidTagWord0Word

CS2100 Cache I 54 TRY IT YOURSELF #2 (3/10) Memory AddressData 0A 4B 8C 12D …… 32I 36J ……

CS2100 Cache I 55 TRY IT YOURSELF #2 (4/10)  IndexValidTagWord0Word Addresses: 4, 0, 8, 12, 36, 0, 4 Address 4: …

CS2100 Cache I 56 TRY IT YOURSELF #2 (5/10)  IndexValidTagWord0Word Addresses: 4, 0, 8, 12, 36, 0, 4 Address 0: …

CS2100 Cache I 57 TRY IT YOURSELF #2 (6/10)  IndexValidTagWord0Word Addresses: 4, 0, 8, 12, 36, 0, 4 Address 8: …

CS2100 Cache I 58 TRY IT YOURSELF #2 (7/10)  IndexValidTagWord0Word Addresses: 4, 0, 8, 12, 36, 0, 4 Address 12: …

CS2100 Cache I 59 TRY IT YOURSELF #2 (8/10)  IndexValidTagWord0Word Addresses: 4, 0, 8, 12, 36, 0, 4 Address 36: …

CS2100 Cache I 60 TRY IT YOURSELF #2 (9/10)  IndexValidTagWord0Word Addresses: 4, 0, 8, 12, 36, 0, 4 Address 0: …

CS2100 Cache I 61 TRY IT YOURSELF #2 (10/10)  IndexValidTagWord0Word Addresses: 4, 0, 8, 12, 36, 0, 4 Address 4: …

CS2100 Cache I 62 WRITE TO MEMORY? (1/2) Store X at Address: TagIndexOffset IndexValidTag Data Word0Word1Word2Word3 Byte 0-3Byte 4-7Byte 8-11Byte EFGH IJKL

CS2100 Cache I 63 WRITE TO MEMORY? (2/2) Store X at Address: TagIndexOffset IndexValidTag Data Word0Word1Word2Word3 Byte 0-3Byte 4-7Byte 8-11Byte EXGH IJKL 1. Valid bit set + Tag match at cache block 1 2. Replace F with X in Word1 (byte offset = 4) Do you see any problem here?

CS2100 Cache I 64 WRITE POLICY Cache and main memory are inconsistent  Data is modified in cache; but not in main memory  So what to do on a data-write hit? Solution 1: Write-through cache  Write data both to cache and to main memory  Problem: Write will operate at the speed of main memory  very very slow  Solution: Put a write buffer between cache and main memory Solution 2: Write-back cache  Only write to cache; delay the write to main memory as much as possible

CS2100 Cache I 65 WRITE BUFFER FOR WRITE THROUGH Write Buffer between Cache and Memory  Processor: writes data to cache + write buffer  Memory controller: write contents of the buffer to memory Write buffer is just a FIFO:  Typical number of entries: 4  Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle; otherwise overflow Processor Cache Write Buffer DRAM

CS2100 Cache I 66 WRITE-BACK CACHE Add an additional bit (Dirty bit) to each cache block Write data to cache and set Dirty bit of that cache block to 1  Do not write to main memory When you replace a cache block, check its Dirty bit; If Dirty = 1, then write that cache block to main memory Cache write only on replacement

CS2100 Cache I 67 CACHE MISS ON WRITE On a read miss, you need to bring data to cache and then load from there to register – no choice here What happens on a write miss? Option 1: Write allocate  Load the complete block (16 bytes in our example) from main memory to cache  Change only the required word in cache  Write to main memory depends on write policy Option 2: Write no-allocate  Do not load the block to cache  Write the required word in main memory only Which one is best for write-back cache?

CS2100 Cache I 68 SUMMARY Memory hierarchy gives the illusion for fast but big memory Hardware-managed cache is an integral component of today’s processors Next lecture: How to improve cache performance

CS2100 Cache I 69 READING ASSIGNMENT Large and Fast: Exploiting Memory Hierarchy  Chapter 7 sections 7.1 – 7.2 (3 rd edition)  Chapter 5 sections 5.1 – 5.2 (4 th edition)

CS2100 Cache I 70 END