ECE/CS 552: Cache Performance Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes based on notes by Mark Hill Updated by.

Slides:



Advertisements
Similar presentations
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Memory Data Flow Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
CS2100 Computer Organisation Cache II (AY2014/2015) Semester 2.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
1 Lecture 20: Cache Hierarchies, Virtual Memory Today’s topics:  Cache hierarchies  Virtual memory Reminder:  Assignment 8 will be posted soon (due.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Overview of Cache and Virtual MemorySlide 1 The Need for a Cache (edited from notes with Behrooz Parhami’s Computer Architecture textbook) Cache memories.
February 9, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and.
1 Lecture 12: Cache Innovations Today: cache access basics and innovations (Sections )
Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.
ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.
3-1: Memory Hierarchy Prof. Mikko H. Lipasti University of Wisconsin-Madison Companion Lecture Notes for Modern Processor Design: Fundamentals of Superscalar.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
Cache intro CSE 471 Autumn 011 Principle of Locality: Memory Hierarchies Text and data are not accessed randomly Temporal locality –Recently accessed items.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CSE 378 Cache Performance1 Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache /
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.
Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
CS6290 Caches. Locality and Caches Data Locality –Temporal: if data item needed now, it is likely to be needed again in near future –Spatial: if data.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
CSE 351 Section 9 3/1/12.
Associativity in Caches Lecture 25
Multilevel Memories (Improving performance using alittle “cash”)
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Morgan Kaufmann Publishers
Lecture 21: Memory Hierarchy
Lecture 21: Memory Hierarchy
ECE/CS 552: Cache Design © Prof. Mikko Lipasti
Lecture 23: Cache, Memory, Virtual Memory
Chapter 5 Memory CSE 820.
Lecture 08: Memory Hierarchy Cache Performance
Adapted from slides by Sally McKee Cornell University
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Lecture 22: Cache Hierarchies, Memory
CS 3410, Spring 2014 Computer Science Cornell University
Lecture 21: Memory Hierarchy
Memory Data Flow Prof. Mikko H. Lipasti
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Cache Memory Rabi Mahapatra
Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

ECE/CS 552: Cache Performance Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes based on notes by Mark Hill Updated by Mikko Lipasti

© Hill, Lipasti 2 Memory Hierarchy CPU I & D L1 Cache Shared L2 Cache Main Memory Disk Temporal Locality Keep recently referenced items at higher levels Future references satisfied quickly Spatial Locality Bring neighbors of recently referenced to higher levels Future references satisfied quickly

© Hill, Lipasti 3 Caches and Performance Caches –Enable design for common case: cache hit Cycle time, pipeline organization Recovery policy –Uncommon case: cache miss Fetch from next level –Apply recursively if multiple levels What to do in the meantime? What is performance impact? Various optimizations are possible

© Hill, Lipasti 4 Performance Impact Cache hit latency –Included in “pipeline” portion of CPI E.g. IBM study: 1.15 CPI with 100% cache hits –Typically 1-3 cycles for L1 cache Intel/HP McKinley: 1 cycle –Heroic array design –No address generation: load r1, (r2) IBM Power4: 3 cycles –Address generation –Array access –Word select and align

© Hill, Lipasti 5 Cache Hit continued Cycle stealing common –Address generation < cycle –Array access > cycle –Clean, FSD cycle boundaries violated Speculation rampant –“Predict” cache hit –Don’t wait for tag check –Consume fetched word in pipeline –Recover/flush when miss is detected Reportedly 7 (!) cycles later in Pentium-IV AGENCACHE AGENCACHE

© Hill, Lipasti 6 Cache Hits and Performance Cache hit latency determined by: –Cache organization Associativity –Parallel tag checks expensive, slow –Way select slow (fan-in, wires) Block size –Word select may be slow (fan-in, wires) Number of block (sets x associativity) –Wire delay across array –“Manhattan distance” = width + height –Word line delay: width –Bit line delay: height Array design is an art form –Detailed analog circuit/wire delay modeling Word Line Bit Line

© Hill, Lipasti 7 Cache Misses and Performance Miss penalty –Detect miss: 1 or more cycles –Find victim (replace line): 1 or more cycles Write back if dirty –Request line from next level: several cycles –Transfer line from next level: several cycles (block size) / (bus width) –Fill line into data array, update tag array: 1+ cycles –Resume execution In practice: 6 cycles to 100s of cycles

© Hill, Lipasti 8 Cache Miss Rate Determined by: –Program characteristics Temporal locality Spatial locality –Cache organization Block size, associativity, number of sets

© Hill, Lipasti 9 Improving Locality Instruction text placement –Profile program, place unreferenced or rarely referenced paths “elsewhere” Maximize temporal locality –Eliminate taken branches Fall-through path has spatial locality

© Hill, Lipasti 10 Improving Locality Data placement, access order –Arrays: “block” loops to access subarray that fits into cache Maximize temporal locality –Structures: pack commonly-accessed fields together Maximize spatial, temporal locality –Trees, linked lists: allocate in usual reference order Heap manager usually allocates sequential addresses Maximize spatial locality Hard problem, not easy to automate: –C/C++ disallows rearranging structure fields –OK in Java

© Hill, Lipasti 11 Cache Miss Rates: 3 C’s [Hill] Compulsory miss –First-ever reference to a given block of memory Capacity –Working set exceeds cache capacity –Useful blocks (with future references) displaced Conflict –Placement restrictions (not fully-associative) cause useful blocks to be displaced –Think of as capacity within set

© Hill, Lipasti 12 Cache Miss Rate Effects Number of blocks (sets x associativity) –Bigger is better: fewer conflicts, greater capacity Associativity –Higher associativity reduces conflicts –Very little benefit beyond 8-way set-associative Block size –Larger blocks exploit spatial locality –Usually: miss rates improve until 64B-256B –512B or more miss rates get worse Larger blocks less efficient: more capacity misses Fewer placement choices: more conflict misses

© Hill, Lipasti 13 Cache Miss Rate Subtle tradeoffs between cache organization parameters –Large blocks reduce compulsory misses but increase miss penalty #compulsory = (working set) / (block size) #transfers = (block size)/(bus width) –Large blocks increase conflict misses #blocks = (cache size) / (block size) –Associativity reduces conflict misses –Associativity increases access time Can associative cache ever have higher miss rate than direct-mapped cache of same size?

© Hill, Lipasti 14 Cache Miss Rates: 3 C’s Vary size and associativity –Compulsory misses are constant –Capacity and conflict misses are reduced

© Hill, Lipasti 15 Cache Miss Rates: 3 C’s Vary size and block size –Compulsory misses drop with increased block size –Capacity and conflict can increase with larger blocks

© Hill, Lipasti 16 Cache Misses and Performance How does this affect performance? Performance = Time / Program Cache organization affects cycle time –Hit latency Cache misses affect CPI Instructions Cycles Program Instruction Time Cycle (code size) = XX (CPI) (cycle time)

© Hill, Lipasti 17 Cache Misses and CPI Cycles spent handling misses are strictly additive Miss_penalty is recursively defined at next level of cache hierarchy as weighted sum of hit latency and miss latency

© Hill, Lipasti 18 Cache Misses and CPI P l is miss penalty at each of n levels of cache MPI l is miss rate per instruction at each of n levels of cache Miss rate specification: –Per instruction: easy to incorporate in CPI –Per reference: must convert to per instruction Local: misses per local reference Global: misses per ifetch or load or store

© Hill, Lipasti 19 Cache Performance Example Assume following: –L1 instruction cache with 98% per instruction hit rate –L1 data cache with 96% per instruction hit rate –Shared L2 cache with 40% local miss rate –L1 miss penalty of 8 cycles –L2 miss penalty of: 10 cycles latency to request word from memory 2 cycles per 16B bus transfer, 4x16B = 64B block transferred Hence 8 cycles transfer plus 1 cycle to fill L2 Total penalty = 19 cycles

© Hill, Lipasti 20 Cache Performance Example

© Hill, Lipasti 21 Cache Misses and Performance CPI equation –Only holds for misses that cannot be overlapped with other activity –Store misses often overlapped Place store in store queue Wait for miss to complete Perform store Allow subsequent instructions to continue in parallel –Modern out-of-order processors also do this for loads Cache performance modeling requires detailed modeling of entire processor core

© Hill, Lipasti 22 Caches Summary Four questions –Placement Direct-mapped, set-associative, fully-associative –Identification Tag array used for tag check –Replacement LRU, FIFO, Random –Write policy Write-through, writeback

© Hill, Lipasti 23 Caches: Set-associative SRAM Cache Hash Address Data Out Offset Index a Tags a Data Blocks Index ?= Tag

© Hill, Lipasti 24 Caches: Direct-Mapped Hash Address Data Out Offset Index Tag Data Index ?= Tag

© Hill, Lipasti 25 Caches: Fully-associative SRAM Cache Hash Address Data Out Offset a Tags a Data Blocks ?= Tag

© Hill, Lipasti 26 Caches Summary Hit latency –Block size, associativity, number of blocks Miss penalty –Overhead, fetch latency, transfer, fill Miss rate –3 C’s: compulsory, capacity, conflict –Determined by locality, cache organization