Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

Slides:

Advertisements

Similar presentations

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Advertisements

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Cache Performance 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

Performance of Cache Memory

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.

Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.

Systems I Locality and Caching

Chapter 5 Large and Fast: Exploiting Memory Hierarchy CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University.

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

CSE 378 Cache Performance1 Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache /

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

Lecture#15. Cache Function The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that.

CS/EE 5810 CS/EE 6810 F00: 1 Main Memory. CS/EE 5810 CS/EE 6810 F00: 2 Main Memory Bottom Rung of the Memory Hierarchy 3 important issues –capacity »BellÕs.

M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.

COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.

The Memory Hierarchy (Lectures #17 - #20) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.

Chapter 5 Large and Fast: Exploiting Memory Hierarchy.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

COMPSYS 304 Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots.

Microprocessor Microarchitecture Memory Hierarchy Optimization Lynn Choi Dept. Of Computer and Electronics Engineering.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

COSC3330 Computer Architecture

CSE 351 Section 9 3/1/12.

Cache Memory and Performance

Reducing Hit Time Small and simple caches Way prediction Trace caches

CSC 4250 Computer Architectures

Multilevel Memories (Improving performance using alittle “cash”)

The Hardware/Software Interface CSE351 Winter 2013

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Morgan Kaufmann Publishers Memory & Cache

William Stallings Computer Organization and Architecture 7th Edition

ECE 445 – Computer Organization

Lecture 08: Memory Hierarchy Cache Performance

Chapter 6 Memory System Design

Performance metrics for caches

Performance metrics for caches

Adapted from slides by Sally McKee Cornell University

Performance metrics for caches

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

If a DRAM has 512 rows and its refresh time is 9ms, what should be the frequency of row refresh operation on the average?

Performance metrics for caches

Cache - Optimization.

Performance metrics for caches

10/18: Lecture Topics Using spatial locality

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Multi-Level Caches Vittorio Zaccaria

Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once it’s been arranged What you will see today: Multi-level caches to improve performance Another layer to the memory hierarchy Permits employing diverse data organizations Permits exploiting diverse policies

Preview (2) Understanding Multi-Level Caches Why they are used Organization tradeoffs Policy tradeoffs Optimizing multi-level cache performance -- L1 vs. L2 diversity Organization Policy Make bandwidth vs. latency tradeoffs Cache pipelining techniques

Multilevel Caches Small, fast Level 1 (L1) cache Often on-chip for speed and bandwidth Larger, slower Level 2 (L2) cache Closely coupled to CPU; may be on-chip, or “nearby” on module

Multilevel cache sizes examples Intel: L1 L sometimes off-chip none 80486: 8K none; or 64K+ off-chip Pentium: (split) 8K each 256K - 512K off-chip Pentium Pro: (split) 8K each 256K - 512K on-module Pentium II: (split) 16K each 512K on-module Pentium III: (split) 16K each 256K - 512K on-module Pentium III- Celeron: (split) 16K each 128KB on-chip Pentium IV: 12K/I$, 4K/D$256K on chip

Why an L2 is slower than L1 Increased size implies longer critical path Line length (capacitive delay) grows as square root of memory array size Addressing & data multiplexing grow as n log n with array size Off-chip access is slower than on-chip access. Off-chip access is narrower than on-chip access (less bandwidth) Pins cost money

Two Level Miss Rates Two-Level Miss Rates Local miss rate: misses in cache / accesses to cache L1 cache => P miss1 L2 cache => P miss2 Useful for optimizing a particular level of cache given a fixed design otherwise Global miss rate: misses in cache / accesses from CPU L1 cache => P miss1 L2 cache => P miss1 * P miss2 (only L1 misses are seen by L2, which has a local miss ratio of P miss2 ) Good for measuring traffic that will be seen by next level down in memory hierarchy

Evaluating Multi-Level Miss Rates Use Global Miss rates when evaluating traffic filtering of 2-level caches Effectiveness of L2 strategy depends on which L1 strategy is used Changing L1 strategy may require changing L2 strategy as well Global L2 miss rate equals effective composite cache miss rate Sequential forward model (local miss rates): tea = t hitL1 + (P miss1 * t hitL2 ) + (P miss1 * P miss2 * t transport )

Diversity Motivation L1 and L2 should have differences to improve overall performance Small, fast, relatively inflexible L1 Larger, slower, relatively flexible L2 Issues: Split vs. Unified & bandwidth vs. flexibility Write through vs. write back & write allocation Block size & latency vs. bandwidth Associativity vs. cycle time

Split vs. Unified Split caches give bandwidth; unified caches give flexibility Use split L1 combined with unified L2 for good aggregate performance Split L1 cache advantages Can provide simultaneous data & instruction access -- high bandwidth Reduces hit rate, but not catastrophic if L2 cache is available to keep miss penalties low Unified L2 cache advantages Reduces pin & package count -- only one path needed to off-chip L2 Can be used for I-cache/D-cache coherence (invalidate I-cache line on modification) Reduces brittleness of assuming half of memory used is instructions Some working sets are mostly data, some are mostly instructions

Write Policies Write through? Write allocation? L1: write through + no-write allocate; L2 write back + write-allocate L1 cache: advantages of write through + no-write-allocate Simpler control No stalls for evicting dirty data on L1 miss with L2 hit Avoids L1 cache pollution with results that aren’t read for a long time Avoids problems with coherence (L2 always has modified L1 contents) L2 cache: advantages of write back + write-allocate Typically reduces overall bus traffic by “catching” all the L1 write- through traffic Able to capture temporal locality of infrequently written memory locations Provides a safety net for programs where write-allocate helps a lot

Block Size Balancing miss rate vs. traffic ratio; latency vs. bandwidth Smaller L1 cache sectors & blocks Smaller blocks reduces conflict/capacity misses Smaller blocks reduces time to refill cache block (which may reduce CPU stall due to cache being busy for refill) But, still want blocks > 32 bits because Direct access to long floats Larger L2 cache blocks Larger blocks create less conflict problems with large cache size Main memory has large latency on L2 miss, so proportionally lower cost to refill larger cache block once memory transfer started.

Associativity Balance complexity, speed, efficiency L1 -- no clear winner Direct mapped L1 gives faster cycle time But, lower hit rate on an already small cache Set associative L1 gives slower cycle time, better hit rate L2 -- no clear winner Direct mapped L2 minimizes pin & package count for cache Only 1 tag need be fetched No problem with multiplexing multiple data words based on tag match Set associativity less advantageous for really large caches Set associative L2 gives flexibility Less brittle to degenerate cases with data structures mapped to same location Associative time penalty less of an issue for L2 cache than L1 cache (smaller percentage of total miss delay)

Multi-Level Inclusion Complete inclusion means all elements in highest level of memory hierarchy are present in lower levels (also called “subset property”) For example, everything in L1 is also in L2 cache Useful for IO-cache and multiprocessor-cache coherence; only have to check the lowest cache level Inclusion requires Number of L2 sets >= number of L1 sets L2 associativity >= L1 associativity

L1 vs. L2 Tradeoff Examples Pentium Pro L1 L2 Size 16KB none - 256KB - 512KB Organization Split (8KB + 8KB) Unified Write Policies programmable; programmable Block size 32 bytes 32 bytes Associativity D: 2-way; I: 4-way 4-way MIPS R10000 L1 L2 Size 64KB 512KB - 16 MB Organization Split (32KB + 32KB) Unified Write Policies write back write back Block size D: 32 bytes I: 64 bytes 64 or 128 bytes Associativity 2-way 2-way

Cache Bandwidth vs. Latency Tradeoffs Pipelined caches Multiple concurrent access operations gives increased throughput But: Can slightly increase latency for read accesses Risk of stall for reads following writes Block size vs. transfer size Large blocks increase memory required bandwidth & refill latency But, large blocks decrease miss ratio

3 stage Generalized Cache Access Pipelining Example for a 3-stage DM pipelined cache Possible stages: Stage 1: (1) decode address (2) drive word lines Stage 2: (3) read & latch memory row (4) select data word; precharge bit lines for next access Stage 3: (5) check tag for match while speculatively using data (6) abort use of data if tag mismatch

MIPS R4000 Pipelined Cache Three clock cache pipeline: address decode, fetch, tag compare New cache access can be started each clock cycle Store buffer uses idle cache cycles to deposit written data Blocking cache -- tag mismatch causes stall for cache miss Split cache for bandwidth; both direct mapped

Pipelined Cache Tradeoffs Increases latency Takes 3 clocks until L1 cache miss is declared! 2 clock latency from Load instruction to data available at ALU 1 clock for ALU to do address arithmetic 2 clocks for D-cache read (assume result forwarded to ALU before register write) Increases throughput Up to 3x improvement in clock speed if cache+tag check was critical path Increasingly useful as larger, slower L1 caches are used