DAP Spr.‘98 ©UCB 1 Lecture 13: Memory Hierarchy—Ways to Reduce Misses.

Slides:

Advertisements

Similar presentations

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,

Memory Hierarchy: The motivation

Memory Subsystem and Cache Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.

©UCB CS 161 Ch 7: Memory Hierarchy LECTURE 15 Instructor: L.N. Bhuyan

331 Week13.1Spring :332:331 Computer Architecture and Assembly Language Spring 2006 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Memory Chapter 7 Cache Memories.

Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

CS 61C L35 Caches IV / VM I (1) Garcia, Fall 2004 © UCB Andy Carle inst.eecs.berkeley.edu/~cs61c-ta inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures.

331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

EENG449b/Savvides Lec /1/04 April 1, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

CIS629 - Fall 2002 Caches 1 Caches °Why is caching needed? Technological development and Moore’s Law °Why are caches successful? Principle of locality.

CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.

ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

Cs 61C L17 Cache.1 Patterson Spring 99 ©UCB CS61C Cache Memory Lecture 17 March 31, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson) www-inst.eecs.berkeley.edu/~cs61c/schedule.html.

331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

EECC551 - Shaaban #1 lec # 7 Winter Memory Hierarchy: The motivation The gap between CPU performance and main memory has been widening with.

EEM 486 EEM 486: Computer Architecture Lecture 6 Memory Systems and Caches.

CS252/Kubiatowicz Lec 3.1 1/24/01 CS252 Graduate Computer Architecture Lecture 3 Caches and Memory Systems I January 24, 2001 Prof. John Kubiatowicz.

EENG449b/Savvides Lec /7/05 April 7, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.

©UCB CS 161 Ch 7: Memory Hierarchy LECTURE 14 Instructor: L.N. Bhuyan

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

DAP Spr.‘98 ©UCB 1 Lecture 11: Memory Hierarchy—Ways to Reduce Misses.

Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.

CMPE 421 Parallel Computer Architecture

CS1104: Computer Organisation School of Computing National University of Singapore.

Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy.

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

Lecture 14 Memory Hierarchy and Cache Design Prof. Mike Schulte Computer Architecture ECE 201.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

223 Memory Hierarchy: Caches, Virtual Memory Big memories are slow Fast memories are small Need to get fast, big memories Processor Computer Control Datapath.

3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

1010 Caching ENGR 3410 – Computer Architecture Mark L. Chang Fall 2006.

EEL-4713 Ann Gordon-Ross 1 EEL-4713 Computer Architecture Memory hierarchies.

The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.

Computer Organization & Programming

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.

CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,

1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 14: Memory Hierarchy Chapter 5 (4.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.

CPE 626 CPU Resources: Introduction to Cache Memories Aleksandar Milenkovic Web:

CMSC 611: Advanced Computer Architecture

Soner Onder Michigan Technological University

The Goal: illusion of large, fast, cheap memory

CPE 631 Lecture 05: Cache Design

CMSC 611: Advanced Computer Architecture

EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007

Cache Memory Rabi Mahapatra

Memory & Cache.

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

DAP Spr.‘98 ©UCB 1 Lecture 13: Memory Hierarchy—Ways to Reduce Misses

DAP Spr.‘98 ©UCB 2 Review: Who Cares About the Memory Hierarchy? µProc 60%/yr. DRAM 7%/yr DRAM CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) Performance “Moore’s Law” Processor Only Thus Far in Course: –CPU cost/performance, ISA, Pipelined Execution CPU-DRAM Gap 1980: no cache in µproc; level cache on chip (1989 first Intel µproc with a cache on chip)

DAP Spr.‘98 ©UCB 3 The Goal: Illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and fast (most of the time)? Hierarchy of Levels –Uses smaller and faster memory technologies close to the processor –Fast access time in highest level of hierarchy –Cheap, slow memory furthest from processor The aim of memory hierarchy design is to have access time close to the highest level and size equal to the lowest level

DAP Spr.‘98 ©UCB 4 Recap: Memory Hierarchy Pyramid Processor (CPU) Size of memory at each level Level 1 Level 2 Level n Increasing Distance from CPU, Decreasing cost / MB Level 3... transfer datapath: bus Decreasing distance from CPU, Decreasing Access Time (Memory Latency)

DAP Spr.‘98 ©UCB 5 Memory Hierarchy: Terminology Hit: data appears in level X: Hit Rate: the fraction of memory accesses found in the upper level Miss: data needs to be retrieved from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate) Hit Time: Time to access the upper level which consists of Time to determine hit/miss + memory access time Miss Penalty: Time to replace a block in the upper level + Time to deliver the block to the processor Note: Hit Time << Miss Penalty

DAP Spr.‘98 ©UCB 6 Current Memory Hierarchy Control Data- path Processor regs Secon- dary Mem- ory L2 Cache Speed(ns):0.5ns2ns6ns100ns10,000,000ns Size (MB): ,000 Cost ($/MB):--$100$30$1 $0.05 Technology:RegsSRAMSRAMDRAMDisk L1 cache Main Mem- ory

DAP Spr.‘98 ©UCB 7 Memory Hierarchy: Why Does it Work? Locality! Temporal Locality (Locality in Time): => Keep most recently accessed data items closer to the processor Spatial Locality (Locality in Space): => Move blocks consists of contiguous words to the upper levels Lower Level Memory Upper Level Memory To Processor From Processor Blk X Blk Y Address Space 02^n - 1 Probability of reference

DAP Spr.‘98 ©UCB 8 Memory Hierarchy Technology Random Access: –“Random” is good: access time is the same for all locations –DRAM: Dynamic Random Access Memory »High density, low power, cheap, slow »Dynamic: need to be “refreshed” regularly –SRAM: Static Random Access Memory »Low density, high power, expensive, fast »Static: content will last “forever”(until lose power) “Not-so-random” Access Technology: –Access time varies from location to location and from time to time –Examples: Disk, CDROM Sequential Access Technology: access time linear in location (e.g.,Tape) We will concentrate on random access technology –The Main Memory: DRAMs + Caches: SRAMs

DAP Spr.‘98 ©UCB 9 Introduction to Caches Cache –is a small very fast memory (SRAM, expensive) –contains copies of the most recently accessed memory locations (data and instructions): temporal locality –is fully managed by hardware (unlike virtual memory) –storage is organized in blocks of contiguous memory locations: spatial locality –unit of transfer to/from main memory (or L2) is the cache block General structure –n blocks per cache organized in s sets –b bytes per block –total cache size n*b bytes

DAP Spr.‘98 ©UCB 10 Cache Organization (1) How do you know if something is in the cache? (2) If it is in the cache, how to find it? Answer to (1) and (2) depends on type or organization of the cache In a direct mapped cache, each memory address is associated with one possible block within the cache –Therefore, we only need to look in a single location in the cache for the data if it exists in the cache

DAP Spr.‘98 ©UCB 11 Simplest Cache: Direct Mapped Main Memory 4-Block Direct Mapped Cache Block Address Cache Index index determines block in cache index = (address) mod (# blocks) If number of cache blocks is power of 2, then cache index is just the lower n bits of memory address [ n = log 2 (# blocks) ] tag index Memory block address

DAP Spr.‘98 ©UCB 12 Issues with Direct-Mapped If block size > 1, rightmost bits of index are really the offset within the indexed block ttttttttttttttttt iiiiiiiiii oooo tagindexbyte to checkto offset if have selectwithin correct blockblockblock

DAP Spr.‘98 ©UCB Byte offset HitData K entries 16 bits128 bits Mux Block offsetIndex Tag Address (showing bit positions) KB Cache with 4-word (16-byte) blocks TagData V

DAP Spr.‘98 ©UCB 14 Direct-mapped Cache Contd. The direct mapped cache is simple to design and its access time is fast (Why?) Good for L1 (on-chip cache) Problem: Conflict Miss, so low hit ratio Conflict Misses are misses caused by accessing different memory locations that are mapped to the same cache index In direct mapped cache, no flexibility in where memory block can be placed in cache, contributing to conflict misses

DAP Spr.‘98 ©UCB 15 Another Extreme: Fully Associative Fully Associative Cache (8 word block) –Omit cache index; place item in any block! –Compare all Cache Tags in parallel By definition: Conflict Misses = 0 for a fully associative cache Byte Offset : Cache Data B : Cache Tag (27 bits long) Valid : B 1B 31 : Cache Tag = = = = = :

DAP Spr.‘98 ©UCB 16 Fully Associative Cache Must search all tags in cache, as item can be in any cache block Search for tag must be done by hardware in parallel (other searches too slow) But, the necessary parallel comparator hardware is very expensive Therefore, fully associative placement practical only for a very small cache

DAP Spr.‘98 ©UCB 17 Compromise: N-way Set Associative Cache N-way set associative: N cache blocks for each Cache Index –Like having N direct mapped caches operating in parallel –Select the one that gets a hit Example: 2-way set associative cache –Cache Index selects a “set” of 2 blocks from the cache –The 2 tags in set are compared in parallel –Data is selected based on the tag result (which matched the address)

DAP Spr.‘98 ©UCB 18 Example: 2-way Set Associative Cache Cache Data Block 0 Cache Tag Valid ::: Cache Data Block 0 Cache Tag Valid ::: Cache Block Hit mux tag index offset address = =

DAP Spr.‘98 ©UCB 19 Set Associative Cache Contd. Direct Mapped, Fully Associative can be seen as just variations of Set Associative block placement strategy Direct Mapped = 1-way Set Associative Cache Fully Associative = n-way Set associativity for a cache with exactly n blocks

DAP Spr.‘98 ©UCB 20 Alpha Cache Organization

DAP Spr.‘98 ©UCB 21 Block Replacement Policy N-way Set Associative or Fully Associative have choice where to place a block, (which block to replace) –Of course, if there is an invalid block, use it Whenever get a cache hit, record the cache block that was touched When need to evict a cache block, choose one which hasn't been touched recently: “Least Recently Used” (LRU) –Past is prologue: history suggests it is least likely of the choices to be used soon –Flip side of temporal locality

DAP Spr.‘98 ©UCB 22 Review: Four Questions for Memory Hierarchy Designers Q1: Where can a block be placed in the upper level? (Block placement) –Fully Associative, Set Associative, Direct Mapped Q2: How is a block found if it is in the upper level? (Block identification) –Tag/Block Q3: Which block should be replaced on a miss? (Block replacement) –Random, LRU Q4: What happens on a write? (Write strategy) –Write Back or Write Through (with Write Buffer)

DAP Spr.‘98 ©UCB 23 Write Policy: Write-Through vs Write-Back Write-through: all writes update cache and underlying memory/cache –Can always discard cached data - most up-to-date data is in memory –Cache control bit: only a valid bit Write-back: all writes simply update cache –Can’t just discard cached data - may have to write it back to memory –Flagged write-back –Cache control bits: both valid and dirty bits Other Advantages: –Write-through: »memory (or other processors) always have latest data »Simpler management of cache –Write-back: »Needs much lower bus bandwidth due to infrequent access »Better tolerance to long-latency memory?

DAP Spr.‘98 ©UCB 24 Write Through: Write Allocate vs Non-Allocate Write allocate: allocate new cache line in cache –Usually means that you have to do a “read miss” to fill in rest of the cache-line! –Alternative: per/word valid bits Write non-allocate (or “write-around”): –Simply send write data through to underlying memory/cache - don’t allocate new cache line!

DAP Spr.‘98 ©UCB 25 Write Buffers Write Buffers (for wrt- through) –buffers words to be written in L2 cache/memory along with their addresses. –2 to 4 entries deep –all read misses are checked against pending writes for dependencies (associatively) –allows reads to proceed ahead of writes –can coalesce writes to same address Write-back Buffers –between a write-back cache and L2 or MM –algorithm »move dirty block to write-back buffer »read new block »write dirty block in L2 or MM –can be associated with victim cache (later) L1 L2 Write buffer to CPU

DAP Spr.‘98 ©UCB 26 Write Merge

DAP Spr.‘98 ©UCB 27 Miss-oriented Approach to Memory Access: –CPI Execution includes ALU and Memory instructions Review: Cache performance Separating out Memory component entirely –AMAT = Average Memory Access Time –CPI ALUOps does not include memory instructions

DAP Spr.‘98 ©UCB 28 Impact on Performance Suppose a processor executes at –Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1 –50% arith/logic, 30% ld/st, 20% control Suppose that 10% of memory operations (Data) get 50 cycle miss penalty Suppose that 1% of instructions get same miss penalty CPI = ideal CPI + average stalls per instruction = 1.1(cycles/ins) + [ 0.30 (DataMops/ins) x 0.10 (miss/DataMop) x 50 (cycle/miss)] + [ 1 (InstMop/ins) x 0.01 (miss/InstMop) x 50 (cycle/miss)] = ( ) cycle/ins = % (1.5/2.6) of the time the proc is stalled waiting for data memory! Total no. of memory accesses = one per instrn for data = 1.3 Thus, AMAT=(1/1.3)x[1+0.01x50]+(0.3/1.3)x[1+0.1x50]=2.54 cycles => instead of one cycle.

DAP Spr.‘98 ©UCB 29 Suppose a processor has the following parameters: –CPI = 2 (w/o memory stalls) –mem access per instruction = 1.5 Compare AMAT and CPU time for a direct mapped cache and a 2-way set associative cache assuming: –AMAT d = hit time + miss rate * miss penalty = 1* *75 = 2.05 ns –AMAT 2 = 1* *75 = 2 ns < 2.05 ns –CPI d = (CPI*cc + mem. stall time)*IC = (2* *0.014*75)IC = 3.575*IC –CPI 2 = (2* *0.01*75)IC = 3.625*IC > CPI d ! Change in cc affects all instructions while reduction in miss rate benefit only memory instructions. Impact of Change in cc ccHit cycleMiss penaltyMiss rate Direct map1ns175 ns1.4% 2-way associative 1.25ns(why?)175 ns1.0%

DAP Spr.‘98 ©UCB 30 Miss Penalty for Out-of-Order (OOO) Exe. Processor. In OOO processors, memory stall cycles are overlapped with execution of other instructions. Miss penalty should not include this overlapped part. mem stall cycle per instruction = mem miss per instruction x (total miss penalty – overlapped miss penalty) For the previous example. Suppose 30% of the 75ns miss penalty can be overlapped, what is the AMAT and CPU time? –Assume using direct map cache, cc=1.25ns to handle out of order execution. AMATd = 1* *(75*0.7) = ns With 1.5 memory accesses per instruction, CPU time =( 2* * * (75*0.7))*IC = IC < CPU 2

DAP Spr.‘98 ©UCB 31 Lock-Up Free Cache Using MSHR (Miss Status Holding Register)

DAP Spr.‘98 ©UCB 32 Avg. Memory Access Time vs. Miss Rate Associativity reduces miss rate, but increases hit time due to increase in hardware complexity! Example: For on-chip cache, assume CCT = 1.10 for 2- way, 1.12 for 4-way, 1.14 for 8-way vs. CCT direct mapped Cache SizeAssociativity (KB)1-way2-way4-way8-way (Red means A.M.A.T. not improved by more associativity)

DAP Spr.‘98 ©UCB 33 Unified vs Split Caches Unified vs Separate I&D Example: –16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47% –32KB unified: Aggregate miss rate=1.99% Which is better (ignore L2 cache)? –Assume 33% data ops  75% accesses from instructions (1.0/1.33) –hit time=1, miss time=50 –Note that data hit has 1 stall for unified cache (only one port) AMAT Harvard =75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05 AMAT Unified =75%x(1+1.99%x50)+25%x( %x50)= 2.24 Proc I-Cache-1 Proc Unified Cache-1 Unified Cache-2 D-Cache-1 Proc Unified Cache-2