COSC3330 Computer Architecture

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Performance of Cache Memory
Cache Here we focus on cache improvements to support at least 1 instruction fetch and at least 1 data access per cycle – With a superscalar, we might need.
Lecture 34: Chapter 5 Today’s topic –Virtual Memories 1.
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2 Instructors: Krste Asanovic & Vladimir Stojanovic
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
Lecture 32: Chapter 5 Today’s topic –Cache performance assessment –Associative caches Reminder –HW8 due next Friday 11/21/2014 –HW9 due Wednesday 12/03/2014.
Lecture 33: Chapter 5 Today’s topic –Cache Replacement Algorithms –Multi-level Caches –Virtual Memories 1.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
CMPE 421 Parallel Computer Architecture
Lecture 6. Cache #2 Prof. Taeweon Suh Computer Science & Engineering Korea University ECM534 Advanced Computer Architecture.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
CSIE30300 Computer Architecture Unit 9: Improving Cache Performance Hsin-Chou Chi [Adapted from material by and
CS.305 Computer Architecture Improving Cache Performance Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly.
CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5A: Exploiting the Memory Hierarchy, Part 2 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.
1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity.
Additional Slides By Professor Mary Jane Irwin Pennsylvania State University Group 3.
1 CENG 450 Computer Systems and Architecture Cache Review Amirali Baniasadi
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Lecture 14: Caching, cont. EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
Computer Organization CS224 Fall 2012 Lessons 41 & 42.
Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,
Lecture 20 Last lecture: Today’s lecture: Types of memory
Caches 1 Computer Organization II © McQuain Memory Technology Static RAM (SRAM) – 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic RAM (DRAM)
Chapter 5 Large and Fast: Exploiting Memory Hierarchy.
CS 61C: Great Ideas in Computer Architecture Caches Part 2 Instructors: Nicholas Weaver & Vladimir Stojanovic
Cache Issues Computer Organization II 1 Main Memory Supporting Caches Use DRAMs for main memory – Fixed width (e.g., 1 word) – Connected by fixed-width.
Associative Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word Tag uniquely identifies block of.
CMSC 611: Advanced Computer Architecture
COSC3330 Computer Architecture
Cache Memory and Performance
Memory COMPUTER ARCHITECTURE
ENG3380 Computer Organization and Architecture “Cache Memory Part III”
CSC 4250 Computer Architectures
Basic Performance Parameters in Computer Architecture:
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Cache Memory Presentation I
Consider a Direct Mapped Cache with 4 word blocks
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers
ECE 445 – Computer Organization
Lecture 21: Memory Hierarchy
Lecture 23: Cache, Memory, Virtual Memory
Chapter 5 Memory CSE 820.
Systems Architecture II
Lecture 08: Memory Hierarchy Cache Performance
ECE 445 – Computer Organization
CPE 631 Lecture 05: Cache Design
CMSC 611: Advanced Computer Architecture
ECE232: Hardware Organization and Design
Morgan Kaufmann Publishers
How can we find data in the cache?
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
If a DRAM has 512 rows and its refresh time is 9ms, what should be the frequency of row refresh operation on the average?
Lecture 22: Cache Hierarchies, Memory
CSC3050 – Computer Architecture
Lecture 21: Memory Hierarchy
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Memory & Cache.
Presentation transcript:

COSC3330 Computer Architecture Lecture 19. Cache Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

Topics Cache

Cache Hardware Cost How many total bits are required for a direct mapped data cache with 16KB of data and 4-word blocks assuming a 32-bit address? tag (18 bits) Index (10 bits) Block offset (2 bits) Byte offset (2 bits) 31 30 . . . 15 14 13 . . . 4 3 2 1 0 Address from CPU Data Tag V D 1 2 . 1021 1022 1023 #bits = #blocks x (block size + tag size + valid size + dirty size) = 1024 x (16B + 18 bits + 1 bit + 1 bit) = 148 Kbits = 18.5 KB 15.6% larger than the storage for data 4 4

Cache Hardware Cost The number of bits in a cache includes both the storage for data and for the tags For a direct mapped cache with 2n blocks, n bits are used for the index For a block size of 2m words (2m+2 bytes), m bits are used to address the word within the block and 2 bits are used to address the byte within the word The total number of bits in a direct-mapped cache is I$: = 2n x (block size + tag size + valid size) D$: = 2n x (block size + tag size + valid size + dirty size) Data Tag V D 5 5

Measuring Cache Performance Execution time is composed of CPU execution cycles and memory stall cycles Assuming that cache hit does not require any extra CPU cycle for execution (that is, the MA stage takes 1 cycle upon a hit), CPU time (cycles)= #insts × CPI = CPU clock cycles = (CPU execution clock cycles + Memory-stall clock cycles) 6 6

Measuring Cache Performance Memory stall cycles come from cache misses Memory stall clock cycles = Read-stall cycles + Write-stall cycles Read-stall cycles = (reads/program) x (read miss rate) x (read miss penalty) Write-stall cycles = (writes/program) x (write miss rate) x (write miss penalty) For simplicity, assume that a read miss and a write miss incur the same miss penalty Memory stall clock cycles = (memory accesses/program) x (miss rate) x (miss penalty)

Impacts of Cache Performance A processor with I-cache miss rate = 2% D-cache miss rate = 4% Loads & stores are 36% of instructions Miss penalty = 100 cycles Base CPI with an ideal cache = 2 What is the execution cycle on the processor? CPU time (cycles) = CPU clock cycles = (CPU execution clock cycles + Memory-stall clock cycles) Memory stall clock cycles = (memory accesses/program) x (miss rate) x (miss penalty) Assume that # of instructions executed is I CPU execution clock cycles = 2I Memory stall clock cycles I$: I x 0.02 × 100 = 2I D$: (I x 0.36) × 0.04 × 100 = 1.44I Total memory stall cycles = 2I + 1.44I = 3.44I Total execution cycles = 2I + 3.44I = 5.44I 8 8

Impacts of Cache Performance (Cont) How much faster a processor with a perfect cache? Execution cycle time of the CPU with a perfect cache = 2I Thus, the CPU with the perfect cache is 2.72x (=5.44I/2I) faster! If you design a new computer with CPI = 1, how much is the system with the perfect cache faster? Total execution cycles = 1I + 3.44I = 4.44I Therefore, the CPU with the perfect cache is 4.44x faster! Then, the amount of execution time spent on memory stalls have risen from 63% (=3.44/5.44) to 77% (= 3.44/4.44) 9 9

Impacts of Cache Performance (Cont) Thoughts 2% of instruction cache misses with the miss penalty of 100 cycle increases CPI by 2! Data cache misses also increase CPI dramatically As CPU is made faster, the amount of time spent on memory stalls will take up an increasing fraction of the execution time

Average Memory Access Time (AMAT) Hit time is also important for performance Hit time is not always less than 1 CPU cycle For example, the hit latency of Core 2 Duo’s L1 cache is 3 cycles! To capture the fact that the time to access data for both hits and misses affect performance, designers sometimes use average memory access time (AMAT), which is the average time to access memory considering both hits and misses AMAT = Hit time + Miss rate x Miss penalty 11 11

Average Memory Access Time (AMAT) AMAT = Hit time + Miss rate x Miss penalty Example CPU with 1ns clock (1GHz) Hit time = 1 cycle Miss penalty = 20 cycles Cache miss rate = 5% AMAT = 1 + 0.05 × 20 = 2ns (2 cycles for each instruction)

Improving Cache Performance How to increase the cache performance? Increase the cache size? A larger cache will have a longer access time An increase in hit time will likely add another stage to the pipeline At some point, the increase in hit time for a larger cache will overcome the improvement in hit rate leading to a decrease in performance We’ll explore two different techniques AMAT = Hit time + Miss rate x Miss penalty 13 13

Improving Cache Performance We’ll explore two different techniques AMAT = Hit time + Miss rate x Miss penalty The first technique reduces the miss rate by reducing the probability that different memory blocks will contend for the same cache location The second technique reduces the miss penalty by adding additional cache levels to the memory hierarchy

Reducing Cache Miss Rates (#1) Alternatives for more flexible block placement Direct mapped cache maps a memory block to exactly one cache block Fully associative cache allows a memory block to be mapped to any cache block n-way set associative cache divides the cache into sets, each of which consists of n “ways” A memory block is mapped to a unique set (specified by the index field) and can be placed in any way of that set (so, there are n choices) 15 15

Associative Cache Example 2-way 4 sets Only 1 set 16 16

Spectrum of Associativity For a cache with 8 entries 17 17

4-Way Set Associative Cache 16KB cache 4-way set associative 4B (32-bit) per block How many sets are there? Byte offset 31 30 . . . 13 12 11 . . . 2 1 0 20 Tag 10 Index Index 1 2 . 1022 1023 V Tag Data V Tag Data V Tag Data V Tag Data 1 2 . 1022 1023 1 2 . 1022 1023 1 2 . 1022 1023 Way 0 Way 1 Way 2 Way 3 32 32 32 32 Hit Data 4x1 mux 18 18

Associative Caches Fully associative Allow a given block to go in any cache entry Requires all entries to be searched at once Comparator per entry (expensive) n-way set associative Each set contains n entries Block address (number) determines which set (Block number) modulo (#Sets in cache) Search all entries in a given set at once n comparators (less expensive) 19 19

Range of Set Associative Caches For a fixed size cache, the increase by a factor of 2 in associativity doubles the number of blocks per set (the number or ways) and halves the number of sets Example: An increase of associativity from 4 ways to 8 ways decreases the size of the index by 1 bit and increases the size of the tag by 1 bit Block address Used for tag comparison Selects the set Selects the word in the block address Tag Index Block offset Byte offset Increasing associativity Decreasing associativity Fully associative (only one set) Tag is all the bits except Block offset and byte offset Direct mapped (only one way) Smaller tags, only a single comparator 20 20

Associativity Example Assume that there are 3 small different caches with 4 one-word blocks, and CPU generates 8-bit address to cache Direct mapped 2-way set associative Fully associative Find the number of misses for each cache organization given the following sequence of block addresses: 0, 8, 0, 6, 8 We assume that the sequence is for all memory reads 21 21

Associativity Example Direct mapped cache 7 6 5 4 3 2 1 0 tag index tag Address Block address Cache Index Hit or Miss? b’0000 0000 (0x00) b’00 0000 (0x00) b’0010 0000 (0x20) b’00 1000 (0x08) b’0001 1000 (0x18) b’00 0110 (0x06) 2 miss 0, 8, 0, 6, 8 miss miss miss miss index V Tag Data 1 2 3 8 8 1 0010 0010 0000 0000 Mem[0] Mem[8] Mem[0] Mem[8] 6 1 0001 Mem[6] 22 22

Associativity Example (Cont) 2-way set associative 7 6 5 4 3 2 1 0 tag index tag Address Block address Cache Index Hit or Miss? b’0000 0000 (0x00) b’00 0000 (0x00) b’0010 0000 (0x20) b’00 1000 (0x08) b’0001 1000 (0x18) b’00 0110 (0x06) miss 0, 8, 0, 6, 8 miss hit miss miss index V Tag Data V Tag Data 8 8 6 1 1 00100 00000 Mem[8] Mem[0] 1 00100 00011 Mem[8] Mem[6] 23 23

Associativity Example (Cont) Fully associative 7 6 5 4 3 2 1 0 tag tag Address Block address Cache Index Hit or Miss? b’0000 0000 (0x00) b’00 0000 (0x00) b’0010 0000 (0x20) b’00 1000 (0x08) b’0001 1000 (0x18) b’00 0110 (0x06) miss 0, 8, 0, 6, 8 miss hit miss hit index V Tag Data V Tag Data V Tag Data V Tag Data 6 8 8 1 000000 Mem[0] 1 001000 Mem[8] 1 000110 Mem[6] 24 24

Benefits of Set Associative Caches The choice of direct mapped or set associative depends on the cost of a miss versus the cost of implementation Data from Hennessy & Patterson, Computer Architecture, 2003 Largest gains are achieved when going from direct mapped to 2-way (more than 20% reduction in miss rate) 25 25

Replacement Policy Upon a miss, which way’s block should be picked for replacement? Direct mapped cache has no choice Set associative cache prefers non-valid entry if there is one. Otherwise, choose among entries in the set Least-recently used (LRU) Choose the one unused for the longest time Hardware should keep track of when each way’s block was used relative to the other blocks in the set Simple for 2-way (takes one bit per set), manageable for 4-way, too hard beyond that Random Gives approximately the same performance as LRU with high-associativity cache 26 26

Example Cache Structure with LRU 2-way set associative data cache LRU V D Tag Data V D Tag Data Set 0 Set 1 Set 2 … Set n-3 Set n-2 Set n-1 27 27

Example Cache Structure with LRU 2 ways 1 bit per set to mark latest way accessed in set Evict way not pointed by bit K-way set associative LRU Requires full ordering of way accesses Hold a log2k bit counter per line When way j is accessed X = counter[j] Counter[j] = k-1; for (i=0; i<k; i++) if ((i!=j) and (counter[i]>X)) Counter[i]--; When replacement is needed Evict way with counter ==0