Cache - Optimization.

Slides:

Advertisements

Similar presentations

SE-292 High Performance Computing

Advertisements

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

CS2100 Computer Organisation Cache II (AY2014/2015) Semester 2.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.

The Memory Hierarchy II CPSC 321 Andreas Klappenecker.

ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

Lecture 33: Chapter 5 Today’s topic –Cache Replacement Algorithms –Multi-level Caches –Virtual Memories 1.

Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.

M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,

1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.

CMSC 611: Advanced Computer Architecture

Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.

COSC3330 Computer Architecture

CSE 351 Section 9 3/1/12.

The Goal: illusion of large, fast, cheap memory

Multilevel Memories (Improving performance using alittle “cash”)

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Morgan Kaufmann Publishers

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Consider a Direct Mapped Cache with 4 word blocks

Morgan Kaufmann Publishers

William Stallings Computer Organization and Architecture 7th Edition

Cache memory Direct Cache Memory Associate Cache Memory

Instructor: Justin Hsia

Lecture 21: Memory Hierarchy

Lecture 14: Reducing Cache Misses

Chapter 5 Memory CSE 820.

Systems Architecture II

Lecture 08: Memory Hierarchy Cache Performance

CPE 631 Lecture 05: Cache Design

Chapter 6 Memory System Design

Adapted from slides by Sally McKee Cornell University

ECE232: Hardware Organization and Design

Morgan Kaufmann Publishers

Part V Memory System Design

EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007

Lecture 22: Cache Hierarchies, Memory

CS 3410, Spring 2014 Computer Science Cornell University

Lecture 21: Memory Hierarchy

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Cache Memory Rabi Mahapatra

Memory & Cache.

10/18: Lecture Topics Using spatial locality

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Cache - Optimization

Performance CPU time = (CPU execution cycles + Memory-stall cycles) x cycle time Memory-stall cycles = read-stall cycles + write-stall cycles Read-stall cycles = #reads x read miss rate x read miss penalty Write-stall cycles = (#writes x write miss rate x write miss penalty) + write buffer stalls (assume write-through cache) Assume the following, to make a simpler calculation In most write-through cache, read and write miss penalties are the same Write buffer stalls are negligible Memory-stall cycles = #memory accesses x miss rate x miss penalty = #misses x miss penalty

Cache Performance Example SPECInt2000 Miss rates = 2% (I-cache), 4% (D-cache ) CPI = 2 (without memory stall) Miss penalty = 100 cycles Memory accesses = 36% of instructions Speedup of a perfect cache over the above The above (assume I is the #instructions) CPU time (cycles) = I x 2 CPI + memory stall cycles memory stall cycles = I x 2% miss rate x 100 cycles + I x 36% x 4% miss rate x 100 cycles = 3.44 x I Thus, CPU time = 2 x I + 3.44 x I = 5.44 x I Perfect cache CPU time (cycles) = I x 2 CPI + 0 cycles = 2 x I Speedup = 5.44 / 2 = 2.72 Perfect cache would have made an almost three times faster system!

Performance on Increased Clock Rate Doubling the clock rate in the previous example Cycle time decreases to a half But miss penalty increases from 100 to 200 cycles Speedup of the fast clock system CPU time of the previous (slow) system Assume C is the cycle time of the slow clock system CPU time slow clock = 5.44 x I x C (seconds) CPU time of the faster clock system CPU time (cycles) = I x 2 CPI + memory stall cycles memory stall cycles = I x 2% miss rate x 200 cycles + I x 36% x 4% miss rate x 200 cycles = 6.88 x I CPU time fast clock = (2 x I + 6.88 x I) x (C x ½) = 4.44 x I x C (seconds) Speedup = 5.44 / 4.44 = 1.23 Not twice faster than the slow clock system!

Improving Cache Performance Increase the performance of cache Decrease the average access time Average access time = hit time + miss rate x miss penalty Reduce the components of average access time Hit time Miss rate Miss penalty

Associativity Flexible block placement data cache tag search: main Fully associative (any block) Direct mapped (12 mod 8) = 4 2-way set associative (12 mod 4) = 0 cache main memory set# 0 1 2 3 block# 0 1 2 3 4 5 6 7 memory address 12 search: 12 tag data

Set-Associative Cache Multiple blocks in a set Each memory block maps to a unique set first Within the set, can be placed in any cache block in the set N-way set associative cache: each set consists of N blocks Mapping of a block to cache: Select a set Set index = (Block number) modulo (#Sets in the cache) Parallel tag matching Parallel comparisons of tags in the selected set

Four-way Set Associative Cache More HW increases cost Parallel tag comparison Multiplexor Multiplexor increases Hit time Increased associativity Reduces conflicts  Reduces miss rates Requires more complex MUX  Increases hit time a set set index tag

Multiple-Word Blocks In a Set Two-way set associative cache with 32 byte long cache block memory block# 32 bytes valid tag cache block set 0: 1 … 31 … valid tag cache block valid tag cache block set 1: 32 33 … 63 valid tag cache block • • • valid tag cache block set 31: valid tag cache block 227-1 22 bits 5 bits 00001 31 tag set index block offset selected set Address Space

Accessing a Word Tag matching and word selection Must compare the tag in each valid block in the selected set (a), (b) Select a word by using a multiplexer (c) selected set (I): (a) The valid bit must be set for the matching block. 1 Y MUX 1 X w0 w1 w2 w3 w4 w5 w6 w7 (c) If (a) and (b), then cache hit, block offset selects a word = ? (b) The tag bits in one of the cache blocks must match the tag bits in the address = ? 22 bits 5 bits 5 bits X I 10000 31 tag set index block offset

Block Replacement Easy for direct mapped Single candidate to replace Set associative or fully associative Random is good enough with large caches LRU is better with small size and low associativity caches Associativity 2-way 4-way 8-way Size LRU Random 16 KB 5.18% 5.69% 4.67% 5.29% 4.39% 4.96% 64 KB 1.88% 2.01% 1.54% 1.66% 1.39% 1.53% 256 KB 1.15% 1.17% 1.13% 1.12%

Types of Cache Misses Compulsory (or cold start, first reference) misses First access to an address Misses even with an infinite sized cache Capacity misses Misses because cache is not large enough to fit working set e.g. Block replaced from cache and later accessed again Misses in fully associative cache of a certain size Conflict (or collision, interference) misses Misses due to not enough associativity e.g. Two addresses map to the same block in direct mapped cache Coherence misses (only in multi-processor cache) Misses caused by cache coherence

Example: Cache Associativity Four one-word blocks in caches The size of cache is 16 bytes ( = 4 x 4 bytes ) Compare three caches A direct-mapped cache A 2-way set associative cache A fully associative cache Find the number misses for the following accesses Reference sequence: 0, 8, 0, 6, 8

Example: Direct Mapped Cache Block placement: direct mapping 0 mod 4 = 0 6 mod 4 = 2 8 mod 4 = 0 Cache simulation 8 6 # of cache misses = 5 (3 compulsory and 2 conflict misses) block# 0 1 2 3 M[0] miss miss M[8] M[0] miss M[0] M[6] miss miss M[8] M[6]

Example: 2-way Set Associative Cache Two sets with two cache blocks 0 mod 2 = 0th set 6 mod 2 = 0th set 8 mod 2 = 0th set Replacement policy within a set: Least Recently Used (LRU) Cache simulation 8 6 # of cache misses = 4 (3 compulsory and 1 conflict misses) set# 0 1 M[0] miss miss M[0] M[8] M[0] M[8] hit M[0] M[6] miss miss M[8] M[6]

Example: Fully Associative Cache Four cache blocks in a single set Cache simulation 8 6 # of cache misses = 3 Only the compulsory misses in this case one set with all blocks M[0] miss M[0] M[8] miss M[0] M[8] hit M[0] M[8] M[6] miss M[0] M[8] M[6] hit

Cache Parameters Cache size, block size, associativity Larger cache Affect the cache performance (hit time and miss rate) Need to select appropriate ones Larger cache Obvious way to reduce capacity misses But, higher cost and longer hit time Larger block size Spatial locality reduces compulsory misses (prefetch effect) But, large block size will increase conflict misses Higher associativity Reduce conflict misses But, large tag matching logic will increase the hit time

Block Size vs. Miss Rate Need appropriate block size Fixed size & associativity Increased conflict misses Popular choices Reduced compulsory misses

Miss Rates from 3 Types of Misses Associativity matters High associativity reduces miss rates for the same size caches SPEC2000 Conflict Capacity Compulsory 1-way 2-way 4-way 8-way Miss rate Per type

Multilevel Caches Reduce miss penalty Second level cache reduces the miss penalty of the first cache Optimize each cache for different constraints Exploit cost/capacity trade-offs at different levels L1 cache (first level cache, primary cache) Provides fast access time (1~3 CPU cycles) Built on the same chip as processors 8~64 KB, direct-mapped ~ 4-way set-associative L2 cache (second level cache, secondary cache) Provides low miss penalty (20~30 CPU cycles) Can be on the same chip or off-chip in a separate set of SRAMs 256KB~4MB, 4~16-way set-associative Pentium IV L1 and L2 caches on the same chip

Summary Performance with cache Reduce hit time Reduce miss rate from Reduce average access time (= hit time + miss rate x miss penalty) Reduce hit time Small cache with low associativity (direct-mapped cache) Reduce miss rate from Capacity misses  use large cache ( could increase hit-time) Compulsory misses  use large sized block ( could increase conflict) Conflict misses  use high associativity ( could increase hit-time) Select appropriate cache size, block size, and associativity Reduce miss penalty Multilevel cache reduces miss penalty for the primary cache