Memory – Caching: Writes

Slides:

Advertisements

Similar presentations

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time

Performance of Cache Memory

Cache Here we focus on cache improvements to support at least 1 instruction fetch and at least 1 data access per cycle – With a superscalar, we might need.

Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

CS 300 – Lecture 20 Intro to Computer Architecture / Assembly Language Caches.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

Lecture 31: Chapter 5 Today’s topic –Direct mapped cache Reminder –HW8 due 11/21/

CMPE 421 Parallel Computer Architecture

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:

Computer Organization CS224 Fall 2012 Lessons 39 & 40.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

CDA 3101 Spring 2016 Introduction to Computer Organization Physical Memory, Virtual Memory and Cache 22, 29 March 2016.

Memory – Virtual Memory, Virtual Machines

Pipelining – Loop unrolling and Multiple Issue

Virtual Memory: Implementing Paging

CMSC 611: Advanced Computer Architecture

Memory – Caching: Performance

Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.

Soner Onder Michigan Technological University

Virtual Memory: the Page Table and Page Swapping

The Goal: illusion of large, fast, cheap memory

A Real Problem What if you wanted to run a program that needs more memory than you have? September 11, 2018.

Memory: Putting it all together

CSC 4250 Computer Architectures

Multilevel Memories (Improving performance using alittle “cash”)

CS/COE 1541 (term 2174) Jarrett Billingsley

How will execution time grow with SIZE?

Basic Performance Parameters in Computer Architecture:

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Morgan Kaufmann Publishers

Morgan Kaufmann Publishers

William Stallings Computer Organization and Architecture 7th Edition

Mary Jane Irwin ( ) CSE 431 Computer Architecture Fall 2005 Lecture 19: Cache Introduction Review Mary Jane Irwin (

/ Computer Architecture and Design

ECE 445 – Computer Organization

Exploiting Memory Hierarchy Chapter 7

Cache Memories September 30, 2008

Lecture 08: Memory Hierarchy Cache Performance

Understanding the TigerSHARC ALU pipeline

Performance metrics for caches

ECE232: Hardware Organization and Design

Performance metrics for caches

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007

CS-447– Computer Architecture Lecture 20 Cache Memories

CS 3410, Spring 2014 Computer Science Cornell University

Pipelining, Superscalar, and Out-of-order architectures

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Update : about 8~16% are writes

CS Computer Architecture Spring Lecture 19: Cache Introduction

Performance metrics for caches

10/18: Lecture Topics Using spatial locality

Presentation transcript:

Memory – Caching: Writes CS/COE 1541 (term 2174) Jarrett Billingsley

Class Announcements I'm improving my time management and work environment! Project 1 writeup tomorrow GUARANTEED Due Sunday, March 5th...? Or during spring break? (Your choice!) Class chat – contact me if you need a link Are you getting my emails? Sometimes I just don't know Would like a more permanent/centralized way to notify and communicate with you all... not sure what 2/13/2017 CS/COE 1541 term 2174

Some clarifications from HW and exam... You can't fetch an instruction on every cycle when stalling. Fetches must stall too. Single-issue pipelines cannot have multiple instructions in same phase at once, and later instructions cannot finish before earlier instructions. Out-of-order is inherently superscalar and pipelined (overlapping!) Sum of instruction latencies / number of instructions is not CPI, but it is a useful metric: average instruction latency. sw reads the first register, confusingly. sw t0, 0(s0) reads t0. la and li do not touch memory! They just put constants in registers. 2/13/2017 CS/COE 1541 term 2174

Better terminology Some of the confusion on CPI is on the book, some on me... From now on I will try to use the following terms: CPI (and IPC) will refer to throughput. This is an average across a program. The two terms are "how many cycles it takes to complete the entire program" and "how many instructions". Instruction latency will refer to "how many cycles it takes to complete a given kind of instruction." I called this "intrinsic CPI" before. Amortized latency will be "instruction latency x percentage of program that consists of that instruction." I called this "average CPI" before. 2/13/2017 CS/COE 1541 term 2174

Handling Writes 2/13/2017 CS/COE 1541 term 2174

Let's start simple: cache write HITS. Very common pattern: x++ lw t0, &x addi t0, t0, 1 sw t0, &x Assuming &x is 1110102, how will the lw change the cache? Now how will the sw change the cache...? Uh oh, now the cache is inconsistent. The contents of the cache differ from memory. How can we solve this? V Tag Data 000 001 010 011 100 101 110 111 1 111 24 25 2/13/2017 CS/COE 1541 term 2174

Technique 1: Write-through When you write to cache, write the same data to memory simultaneously. What if we wrote to address 0000102? What happens to data in cache? Eh, whatever. Cache is always consistent. Just overwrite it and change the tag. Consistency is solved! But what's the problem with write-through? Memory is slow, and we have to stall. How could we fix this... We could CACHE THE CACHE AAA Or use a tiny buffer! V Tag Data 000 001 010 1 111 24 ... 000 94 25 Memory Address Data ... 000010 17 111010 24 94 25 2/13/2017 CS/COE 1541 term 2174

Write buffers Instead of immediately writing the data to memory, buffer it. This lets the CPU "fire and forget" – it doesn't have to stall for memory; the buffer will write the data while the CPU keeps executing. Eventually the data ends up in memory. What if another write while the buffer is full? Stall.  Speed at which buffer empties into memory (words/cycle) must exceed speed at which writes happen, or it's not worth it. Multi-entry buffers are common. Wide buses to memory are useful! V Tag Data 000 001 010 1 111 24 ... 25 Buffer Address Data - 111010 25 Memory Address Data ... 111010 24 25 2/13/2017 CS/COE 1541 term 2174

Uh oh Aold B Anew B Aold Anew Cache Memory Buffer There are complications when you add a write buffer: Write to block A. Read block B. Collides with A. Read block A…? Uh-oh. Cache Memory Aold B Anew B Aold Buffer Anew 2/13/2017 CS/COE 1541 term 2174

Check that buffer To ensure consistency, we have to check the buffer too. Write buffers are essentially fully-associative FIFO caches. Since they’re fully-associative, a large write buffer would require a LOT of comparators. This is one reason why they’re usually 4-8 blocks in practice. Adding a write buffer can help amortize the cost of writes, which reduces miss penalty, but… It adds another step to checking for a hit, increasing hit time. 2/13/2017 CS/COE 1541 term 2174

Writing into the void sw t0, 0000002 What if we write to an address that is not cached? (empty/wrong block in cache) sw t0, 0000002 Should we put an entry in the cache? Called write-allocate. Temporal locality says we might need to read it again soon! Or we could write to memory (or the write buffer...) and leave the cache unchanged. Called write-no-allocate. A common term for write-through paired with no-allocate is write-around, since you’re writing “around” the cache – skipping it. V Tag Data 000 ... 1 000 94 Memory Address Data 000000 24 ... 94 2/13/2017 CS/COE 1541 term 2174

Something more intuitive... When you get your notebook out... you write things in it. And when you're done... you put it away. This is the intuition behind write-back: we only write changed cache data back to memory when we're "done with it." In other words, when it gets kicked out of the cache! But this scheme is more complex... We need to keep a dirty bit on each word which says whether the word has changed since it was brought in. HITs turn it on. We need to check the dirty bit when we want to overwrite a cache word. And if it's dirty, we need to write back to memory. This incurs an extra stall! Unless we PUT ANOTHER BUFFER ON THE CACHE, SO WE BUFFER WRITES TO THE CACHE AS WELL AAAAAAAHHHHH 2/13/2017 CS/COE 1541 term 2174