ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.

Slides:



Advertisements
Similar presentations
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Caches Vincent H. Berk October 21, 2005
CIS429/529 Cache Basics 1 Caches °Why is caching needed? Technological development and Moore’s Law °Why are caches successful? Principle of locality °Three.
Now, Review of Memory Hierarchy
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.
Chapter 7 Large and Fast: Exploiting Memory Hierarchy Bo Cheng.
Memory Chapter 7 Cache Memories.
Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.
CIS629 - Fall 2002 Caches 1 Caches °Why is caching needed? Technological development and Moore’s Law °Why are caches successful? Principle of locality.
CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.
Caching I Andreas Klappenecker CPSC321 Computer Architecture.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
EEM 486 EEM 486: Computer Architecture Lecture 6 Memory Systems and Caches.
Memory: PerformanceCSCE430/830 Memory Hierarchy: Performance CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu (U. Maine)
DAP Spr.‘98 ©UCB 1 Lecture 11: Memory Hierarchy—Ways to Reduce Misses.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
Storage HierarchyCS510 Computer ArchitectureLecture Lecture 12 Storage Hierarchy.
CMPE 421 Parallel Computer Architecture
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
Lecture 14 Memory Hierarchy and Cache Design Prof. Mike Schulte Computer Architecture ECE 201.
Lecture 19 Today’s topics Types of memory Memory hierarchy.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
1  1998 Morgan Kaufmann Publishers Recap: Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the.
Computer Architecture Lecture Notes Spring 2005 Dr. Michael P. Frank Competency Area 6: Cache Memory.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.
Computer Organization & Programming
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Memory Hierarchy How to improve memory access. Outline Locality Structure of memory hierarchy Cache Virtual memory.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
CPE 626 CPU Resources: Introduction to Cache Memories Aleksandar Milenkovic Web:
CMSC 611: Advanced Computer Architecture
Soner Onder Michigan Technological University
Yu-Lun Kuo Computer Sciences and Information Engineering
The Goal: illusion of large, fast, cheap memory
Improving Memory Access 1/3 The Cache and Virtual Memory
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Morgan Kaufmann Publishers
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Lecture 08: Memory Hierarchy Cache Performance
CPE 631 Lecture 05: Cache Design
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
/ Computer Architecture and Design
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Cache Memory Rabi Mahapatra
CPE 631 Lecture 04: Review of the ABC of Caches
Lecture 7 Memory Hierarchy and Cache Design
Memory & Cache.
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for Monday: Section C.4 – C.7

ENGS 116 Lecture 122 Who Cares about the Memory Hierarchy? So far, have discussed only processor –CPU Cost/Performance, ISA, Pipelined Execution, ILP 1980: no cache in microprocessors 1995: 2-level cache, 60% transistors on Alpha : IBM experimenting with Main Memory on die. CPU-DRAM Gap

ENGS 116 Lecture 123 The Motivation for Caches Motivation: –Large memories (DRAM) are slow –Small memories (SRAM) are fast Make the average access time small by servicing most accesses from a small, fast memory Reduce the bandwidth required of the large memory ProcessorCache Main Memory Memory System

ENGS 116 Lecture 124 Principle of Locality of Reference Programs do not access their data or code all at once or with equal probability –Rule of thumb: Program spends 90% of its execution time in only 10% of the code Programs access a small portion of the address space at any one time Programs tend to reuse data and instructions that they have recently used Implication of locality: Can predict with reasonable accuracy what instructions and data a program will use in the near future based on its accesses in the recent past

ENGS 116 Lecture 125 Memory System Processor IllusionReality Processor Memory

ENGS 116 Lecture 126 General Principles Locality –Temporal Locality: referenced again soon –Spatial Locality: nearby items referenced soon Locality + smaller HW is faster memory hierarchy –Levels: each smaller, faster, more expensive/byte than level below –Inclusive: data found in top also found in lower levels Definitions –Upper is closer to processor –Block: minimum, address aligned unit that fits in cache –Address = Block frame address + block offset address –Hit time: time to access upper level, including hit determination

ENGS 116 Lecture 127 Cache Measures Hit rate: fraction of accesses found in that level –So high that we usually talk about the miss rate –Miss rate fallacy: miss rate induces miss penalty, determines average memory performance Average memory-access time (AMAT) = Hit time + Miss rate x Miss penalty (ns or clocks) Miss penalty: time to replace a block from lower level, including time to copy to and restart CPU – access time: time to lower level = ƒ(lower level latency) – transfer time: time to transfer block = ƒ(BW upper & lower, block size)

ENGS 116 Lecture 128 Block Size vs. Cache Measures Increasing block size generally increases the miss penalty Block Size Miss Rate Miss Penalty Avg. Memory Access Time  => Miss penalty Transfer time Access time Miss rate Average access time

ENGS 116 Lecture 129 Key Points of Memory Hierarchy Need methods to give illusion of large, fast memory Programs exhibit both temporal locality and spatial locality –Keep more recently accessed data closer to the processor –Keep multiple contiguous words together in memory blocks Use smaller, faster memory close to processor – hits are processed quickly; misses require access to larger, slower memory If hit rate is high, memory hierarchy has access time close to that of highest (fastest) level and size equal to that of lowest (largest) level

ENGS 116 Lecture 1210 Implications for CPU Fast hit check since every memory access needs this check –Hit is the common case Unpredictable memory access time –10s of clock cycles: wait –1000s of clock cycles: (Operating System) » Interrupt & switch & do something else » Lightweight: multithreaded execution

ENGS 116 Lecture 1211 Four Memory Hierarchy Questions Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy)

ENGS 116 Lecture 1212 Q1: Where can a block be placed in the cache? Block 12 placed in 8 block cache: –Fully associative, direct mapped, 2-way set associative –S.A. Mapping = Block number modulo number sets Fully associative: block 12 can go anywhere Direct mapped: block 12 can go only into block 4 (12 mod 8) Set associative: block 12 can go anywhere in set 0 (12 mod 4) Block no Cache Memory Block no. Block frame address Set 0 Set 1 Set 2 Set 3

ENGS 116 Lecture 1213 Direct Mapped Cache Each memory location is mapped to exactly one location in the cache Cache location assigned based on address of word in memory Mapping: (address of block) mod (# of blocks in cache)

ENGS 116 Lecture 1214 Associative Caches Fully Associative: block can go anywhere in the cache N-way Set Associative: block can go in one of N locations in the set

ENGS 116 Lecture 1215 Q2: How is a block found if it is in the cache? Tag on each block –No need to check index or block offset Increasing associativity shrinks index, expands tag Fully Associative: No index Direct Mapped: Large index Block Address Tag Index Block Offset

ENGS 116 Lecture 1216 Examples 512-byte cache, 4-way set associative, 16-byte blocks, byte addressable 8-KB cache, 2-way set associative, 32-byte blocks, byte addressable

ENGS 116 Lecture 1217 Q3: Which block should be replaced on a miss? Easy for direct mapped Set associative or fully associative: –Random (large associativities) –LRU (smaller associativities) –FIFO (large associativities) Associativity: 2-way 4-way SizeLRU RandomFIFOLRU RandomFIFO 16 KB KB KB

ENGS 116 Lecture 1218 Q4: What Happens on a Write? Write through: The information is written to both the block in the cache and to the block in the lower-level memory. Write back: The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. –Is block clean or dirty? Pros and Cons of each: –WT: read misses cannot result in writes (because of replacements) –WB: no writes of repeated writes WT always combined with write buffers so that we don’t wait for lower level memory WB write buffer, giving a read-miss precedence

ENGS 116 Lecture 1219 Example: Data Cache Index = 8 bits: 256 blocks = 8192/(32 x 1) Direct Mapped 4 CPU address Data Data in out Lower Level Memory Write buffer 2 Valid Tag Data (256 Blocks) =?34:1 MUX Block address 1 Block offset Index Tag

ENGS 116 Lecture way Set Associative, Address to Select Word CPU address Data Data in out Lower Level Memory Write buffer Block address Block offset Index Tag 2:1 mux selects data Two sets of address tags and data RAM Use address bits to select correct RAM 2:1 MU X =? Data Valid Tag

ENGS 116 Lecture 1221 Structural Hazard: Instruction and Data? Size Instruction CacheData CacheUnified Cache 8 KB KB KB KB KB KB Misses per 1000 instructions Mix: instructions 74%, data 26%

ENGS 116 Lecture 1222 Cache Performance CPU time = (CPU execution clock cycles + Memory-stall clock cycles)  Clock cycle time Memory-stall clock cycles = Read-stall cycles + Write-stall cycles = includes hit time

ENGS 116 Lecture 1223 Cache Performance CPU time = IC  (CPI execution + Mem accesses per instruction  Miss rate  Miss penalty)  Clock cycle time Misses per instruction = Memory accesses per instruction  Miss rate CPU time = IC  (CPI execution + Misses per instruction  Miss penalty)  Clock cycle time These formulas are conceptual only. Modern day out-of-order processors hide much latency though parallelism.

ENGS 116 Lecture 1224 Summary of Cache Basics Associativity Block size (cache line size) Write Back/Write Through, write buffers, dirty bits AMAT as a basic performance measure Larger block size decreases miss rate but can increase miss penalty Can increase bandwidth of main memory to transfer cache blocks more efficiently Memory system can have significant impact on program execution time, memory stalls can be over 100 cycles Faster processors => memory stalls more costly