CS 152 L7.2 Cache Optimization (1)K Meinz Fall 2003 © UCB CS152 – Computer Architecture and Engineering Lecture 13 – Fastest Cache Ever! 14 October 2003.

Slides:

Advertisements

Similar presentations

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

Performance of Cache Memory

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Cs 325 virtualmemory.1 Accessing Caches in Virtual Memory Environment.

Caches Vincent H. Berk October 21, 2005

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 10, 2003 Topic: Caches (contd.)

ECE 232 L26.Cache.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 26 Caches.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.

CS252/Culler Lec 4.1 1/31/02 CS203A Graduate Computer Architecture Lecture 14 Cache Design Taken from Prof. David Culler’s notes.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Memory Chapter 7 Cache Memories.

Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

CIS629 - Fall 2002 Caches 1 Caches °Why is caching needed? Technological development and Moore’s Law °Why are caches successful? Principle of locality.

ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.

EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)

1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.

CS61C L32 Caches III (1) Garcia, Fall 2006 © UCB Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c UC Berkeley CS61C.

Lec17.1 °For in-order pipeline, 2 options: Freeze pipeline in Mem stage (popular early on: Sparc, R4000) IF ID EX Mem stall stall stall … stall Mem Wr.

EEM 486 EEM 486: Computer Architecture Lecture 6 Memory Systems and Caches.

CS252/Kubiatowicz Lec 3.1 1/24/01 CS252 Graduate Computer Architecture Lecture 3 Caches and Memory Systems I January 24, 2001 Prof. John Kubiatowicz.

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

Lec17.1 °Q1: Where can a block be placed in the upper level? (Block placement) °Q2: How is a block found if it is in the upper level? (Block identification)

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

1  1998 Morgan Kaufmann Publishers Recap: Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the.

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

1 CMPE 421 Parallel Computer Architecture PART3 Accessing a Cache.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1  1998 Morgan Kaufmann Publishers Chapter Seven.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Improving Memory Access 2/3 The Cache and Virtual Memory

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.

CPE 626 CPU Resources: Introduction to Cache Memories Aleksandar Milenkovic Web:

CMSC 611: Advanced Computer Architecture

Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.

CSC 4250 Computer Architectures

The University of Adelaide, School of Computer Science

5.2 Eleven Advanced Optimizations of Cache Performance

Morgan Kaufmann Publishers Memory & Cache

Lecture 14: Reducing Cache Misses

CS203A Graduate Computer Architecture Lecture 13 Cache Design

CPE 631 Lecture 05: Cache Design

Siddhartha Chatterjee

Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate

Cache Memory Rabi Mahapatra

Cache Performance Improvements

Presentation transcript:

CS 152 L7.2 Cache Optimization (1)K Meinz Fall 2003 © UCB CS152 – Computer Architecture and Engineering Lecture 13 – Fastest Cache Ever! 14 October 2003 Kurt Meinz ( www-inst.eecs.berkeley.edu/~cs152/

CS 152 L7.2 Cache Optimization (2)K Meinz Fall 2003 © UCB Review SDRAM/SRAM –Clocks are good; handshaking is bad! (From a latency perspective.) 4 Types of cache misses: –Compulsory –Capacity –Conflict –(Coherence) 4 Questions of cache design: –Placement –Re-placement –Identification (Sorta determined by placement…) –Write Strategy

CS 152 L7.2 Cache Optimization (3)K Meinz Fall 2003 © UCB Recap: Measuring Cache Performance CPU time = Clock cycle time x (CPU execution clock cycles + Memory stall clock cycles) –Memory stall clock cycles = (Reads x Read miss rate x Read miss penalty + Writes x Write miss rate x Write miss penalty) –Memory stall clock cycles = Memory accesses x Miss rate x Miss penalty AMAT = Hit Time + (Miss Rate x Miss Penalty) Note: memory hit time is included in execution cycles.

CS 152 L7.2 Cache Optimization (4)K Meinz Fall 2003 © UCB Set of Operations that must be supported –read: data <= Mem[Physical Address] –write: Mem[Physical Address] <= Data Determine the internal register transfers Design the Datapath Design the Cache Controller Physical Address Read/Write Data Memory “Black Box” Inside it has: Tag-Data Storage, Muxes, Comparators,... Cache Controller Cache DataPath Address Data In Data Out R/W Active Control Points Signals wait How Do you Design a Memory System?

CS 152 L7.2 Cache Optimization (5)K Meinz Fall 2003 © UCB Options to reduce AMAT: 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache. Time = IC x CT x (ideal CPI + memory stalls) Average Memory Access time = Hit Time + (Miss Rate x Miss Penalty) = (Hit Rate x Hit Time) + (Miss Rate x Miss Time) Improving Cache Performance: 3 general options

CS 152 L7.2 Cache Optimization (6)K Meinz Fall 2003 © UCB 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache. Improving Cache Performance

CS 152 L7.2 Cache Optimization (7)K Meinz Fall 2003 © UCB 1. Reduce Misses via Larger Block Size (61c)

CS 152 L7.2 Cache Optimization (8)K Meinz Fall 2003 © UCB 2:1 Cache Rule: –Miss Rate DM cache size N ~ Miss Rate 2-way cache size N/2 Beware: Execution time is only final measure! –Will Clock Cycle time increase? –Hill [1988] suggested hit time for 2-way vs. 1-way external cache +10%, internal + 2% –Example … 2. Reduce Misses via Higher Associativity (61c)

CS 152 L7.2 Cache Optimization (9)K Meinz Fall 2003 © UCB Assume CCT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8- way vs. CCT direct mapped Cache SizeAssociativity (KB)1-way2-way4-way8-way (Red means A.M.A.T. not improved by more associativity) Example: Avg. Memory Access Time vs. Miss Rate

CS 152 L7.2 Cache Optimization (10)K Meinz Fall 2003 © UCB 3) Reduce Misses: Unified Cache Unified I&D Cache Miss rates: –16KB I&D: I=0.64% D=6.47% –32KB Unified: Miss rate=1.99% Does this mean Unified is better? Proc I-Cache-1 Proc Unified Cache-1 Unified Cache-2 D-Cache-1 Proc Unified Cache-2

CS 152 L7.2 Cache Optimization (11)K Meinz Fall 2003 © UCB Unified Cache Which is faster? –Assume 33% data ops 75% are from instructions –Hit time=1cs Miss Penalty=50cs –Data hit stalls one cycle for unified (Only 1 port) In terms of {Miss rate, AMAT} 1){U<S, U<S} 3) {S<U, U<S} 2){U<S, S<U} 4) {S<U, S< U}

CS 152 L7.2 Cache Optimization (12)K Meinz Fall 2003 © UCB Unified Cache Miss rate: –Unified: 1.99% –Separate: 0.64%x %x0.25 = 2.1% AMAT –Separate = 75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05 –Unified = 75%x(1+1.99%x50)+25%x(2+1.99%x50) = 2.24

CS 152 L7.2 Cache Optimization (13)K Meinz Fall 2003 © UCB To Next Lower Level In Hierarchy DATA TAGS One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator 3. Reducing Misses via a “Victim Cache” (New!) How to combine fast hit time of direct mapped yet still avoid conflict misses? Add buffer to place data discarded from cache Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache Used in Alpha, HP machines

CS 152 L7.2 Cache Optimization (14)K Meinz Fall 2003 © UCB E.g., Instruction Prefetching –Alpha fetches 2 blocks on a miss –Extra block placed in “stream buffer” –On miss check stream buffer Works with data blocks too: –Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 streams got 43% –Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches Prefetching relies on having extra memory bandwidth that can be used without penalty –Could reduce performance if done indiscriminantly!!! 4. Reducing Misses by Hardware Prefetching

CS 152 L7.2 Cache Optimization (15)K Meinz Fall 2003 © UCB 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache. Improving Cache Performance (Continued)

CS 152 L7.2 Cache Optimization (16)K Meinz Fall 2003 © UCB 0. Reducing Penalty: Faster DRAM / Interface New DRAM Technologies –Synchronous DRAM –Double Data Rate SDRAM –RAMBUS same initial latency, but much higher bandwidth Better BUS interfaces CRAY Technique: only use SRAM!

CS 152 L7.2 Cache Optimization (17)K Meinz Fall 2003 © UCB Before: After: 1. Add a (lower) level in the Hierarchy ProcessorCache DRAM ProcessorCache DRAM Cache

CS 152 L7.2 Cache Optimization (18)K Meinz Fall 2003 © UCB Don’t wait for full block to be loaded before restarting CPU –Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution –Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first –DRAM FOR LAB 5 can do this in burst mode! (Check out sequential timing) Generally useful only in large blocks, Spatial locality a problem; tend to want next sequential word, so not clear if benefit by early restart block 2. Early Restart and Critical Word First

CS 152 L7.2 Cache Optimization (19)K Meinz Fall 2003 © UCB Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss –requires F/E bits on registers or out-of-order execution –requires multi-bank memories “hit under miss” reduces the effective miss penalty by working during miss vs. ignoring CPU requests “hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses –Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses –Requires multiple memory banks (otherwise cannot support) –Pentium Pro allows 4 outstanding memory misses 3. Reduce Penalty: Non-blocking Caches

CS 152 L7.2 Cache Optimization (20)K Meinz Fall 2003 © UCB For in-order pipeline, 2 options: –Freeze pipeline in Mem stage (popular early on: Sparc, R4000) IF ID EX Mem stall stall stall … stall Mem Wr IF ID EX stall stall stall … stall stall Ex Wr –Use Full/Empty bits in registers + MSHR queue MSHR = “Miss Status/Handler Registers” (Kroft) Each entry in this queue keeps track of status of outstanding memory requests to one complete memory line. –Per cache-line: keep info about memory address. –For each word: register (if any) that is waiting for result. –Used to “merge” multiple requests to one memory line New load creates MSHR entry and sets destination register to “Empty”. Load is “released” from stalling pipeline. Attempt to use register before result returns causes instruction to block in decode stage. Limited “out-of-order” execution with respect to loads. Popular with in-order superscalar architectures. –Out-of-order pipelines already have this functionality built in… (load queues, etc). What happens on a Cache miss?

CS 152 L7.2 Cache Optimization (21)K Meinz Fall 2003 © UCB FP programs on average: AMAT= > > > 0.26 Int programs on average: AMAT= > > > KB Data Cache, Direct Mapped, 32B block, 16 cycle miss Integer Floating Point “Hit under n Misses” 0->1 1->2 2->64 Base Value of Hit Under Miss for SPEC

CS 152 L7.2 Cache Optimization (22)K Meinz Fall 2003 © UCB 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache. Improving Cache Performance (Continued)

CS 152 L7.2 Cache Optimization (23)K Meinz Fall 2003 © UCB 1. Add a (higher) level in the Hierarchy (61c) Before: After: ProcessorCache DRAM Processor Cache DRAM Cache

CS 152 L7.2 Cache Optimization (24)K Meinz Fall 2003 © UCB 2: Pipelining the Cache! (new!) Cache accesses now take multiple clocks: –1 to start the access, –X (> 0) to finish –PIII uses 2 stages; PIV takes 4 –Increases hit bandwidth, not latency! IF 1IF 2IF 3IF 4

CS 152 L7.2 Cache Optimization (25)K Meinz Fall 2003 © UCB 3: Way Prediction (new!) Remember: Associativity negatively impacts hit time. We can recover some of that time by pre-selecting one of the sets. Every block in the cache has a field that says which index in the set to try on the next access. Pre-select mux to that field. Guess right: Avoid mux propagate time Guess wrong: Recover and choose other index –Costs you a cycle or two.

CS 152 L7.2 Cache Optimization (26)K Meinz Fall 2003 © UCB 3: Way Prediction (new!) Does it work? –You can guess and be right 50% –Intelligent algorithms can be right ~85% –Must be able to recover quickly! –On Alpha 21264: Guess right: ICache latency 1 cycle Guess wrong: ICache latency 3 cycles (Presumably, without way-predict would require push clock period or #cycles/hit.)

CS 152 L7.2 Cache Optimization (27)K Meinz Fall 2003 © UCB PRS: Load Prediction (new!) Load-Value Prediction: –Small table of recent load instruction addresses, resulting data values, and confidence indicators. –On a load, look in the table. If a value exists and the confidence is high enough, use that value. Meanwhile, do the cache access … If the guess was correct: increase confidence bit and keep going If the guess was incorrect: quash the pipe and restart with correct value.

CS 152 L7.2 Cache Optimization (28)K Meinz Fall 2003 © UCB PRS: Load Prediction So, will it work? If so, what factor will it improve If not, why not? 1.No way! – There is no such thing as data locality! 2.No way! – Load-value mispredictions are too expensive! 3.Oh yeah! – Load prediction will decrease hit time 4.Oh yeah! – Load prediction will decrease the miss penalty 5.Oh yeah! – Load prediction will decrease miss rates 6) 1 and 2 7) 3 and 4 8) 4 and 5 9) 3 and 5 10) None!

CS 152 L7.2 Cache Optimization (29)K Meinz Fall 2003 © UCB Load Prediction In Integer programs, two loads back-to- back have a 50% chance of being the same value! –[Lipasti, Wilkerson and Shen; 1996] Quashing the pipe is (relatively) cheap operation – you’d have to wait anyway!

CS 152 L7.2 Cache Optimization (30)K Meinz Fall 2003 © UCB Two Different Types of Locality: –Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon. –Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon. SRAM is fast but expensive and not very dense: –6-Transistor cell (no static current) or 4-Transistor cell (static current) –Does not need to be refreshed –Good choice for providing the user FAST access time. –Typically used for CACHE DRAM is slow but cheap and dense: –1-Transistor cell (+ trench capacitor) –Must be refreshed –Good choice for presenting the user with a BIG memory system –Both asynchronous and synchronous versions –Limited signal requires “sense-amplifiers” to recover Memory Summary (1/3)

CS 152 L7.2 Cache Optimization (31)K Meinz Fall 2003 © UCB Memory Summary 2/ 3: The Principle of Locality: –Program likely to access a relatively small portion of the address space at any instant of time. Temporal Locality: Locality in Time Spatial Locality: Locality in Space Three (+1) Major Categories of Cache Misses: –Compulsory Misses: sad facts of life. Example: cold start misses. –Conflict Misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! –Capacity Misses: increase cache size –Coherence Misses: Caused by external processors or I/O devices Cache Design Space –total size, block size, associativity –replacement policy –write-hit policy (write-through, write-back) –write-miss policy

CS 152 L7.2 Cache Optimization (32)K Meinz Fall 2003 © UCB Summary 3 / 3: The Cache Design Space Several interacting dimensions –cache size –block size –associativity –replacement policy –write-through vs write-back –write allocation The optimal choice is a compromise –depends on access characteristics workload use (I-cache, D-cache, TLB) –depends on technology / cost Simplicity often wins Associativity Cache Size Block Size Bad Good LessMore Factor AFactor B