CPE 631 Lecture 04: Review of the ABC of Caches

Slides:

Advertisements

Similar presentations

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

ECE 232 L26.Cache.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 26 Caches.

CIS429/529 Cache Basics 1 Caches °Why is caching needed? Technological development and Moore’s Law °Why are caches successful? Principle of locality °Three.

Now, Review of Memory Hierarchy

Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Memory Chapter 7 Cache Memories.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

361 Computer Architecture Lecture 14: Cache Memory

CIS629 - Fall 2002 Caches 1 Caches °Why is caching needed? Technological development and Moore’s Law °Why are caches successful? Principle of locality.

CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.

ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

EEM 486 EEM 486: Computer Architecture Lecture 6 Memory Systems and Caches.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

Levels of the Memory Hierarchy CPU Registers 100s Bytes

Storage HierarchyCS510 Computer ArchitectureLecture Lecture 12 Storage Hierarchy.

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

Memory Hierarchy. Since 1980, CPU has outpaced DRAM... CPU 60% per yr 2X in 1.5 yrs DRAM 9% per yr 2X in 10 yrs 10 DRAM CPU Performance (1/latency) 100.

Lecture 5 Review of Memory Hierarchy (Appendix C in textbook)

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

CPE 442 cache.1 Introduction To Computer Architecture CpE 442 Cache Memory Design.

1  1998 Morgan Kaufmann Publishers Recap: Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the.

EEL5708/Bölöni Lec 4.1 Fall 2004 September 10, 2004 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Review: Memory Hierarchy.

1 Computer Architecture Cache Memory. 2 Today is brought to you by cache What do we want? –Fast access to data from memory –Large size of memory –Acceptable.

Memory Hierarchy— Motivation, Definitions, Four Questions about Memory Hierarchy, Improving Performance Professor Alvin R. Lebeck Computer Science 220.

CML CML CS 230: Computer Organization and Assembly Language Aviral Shrivastava Department of Computer Science and Engineering School of Computing and Informatics.

The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.

Computer Organization & Programming

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.

Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:

Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,

Cps 220 Cache. 1 ©GK Fall 1998 CPS220 Computer System Organization Lecture 17: The Cache Alvin R. Lebeck Fall 1999.

Summary of caches: The Principle of Locality: –Program likely to access a relatively small portion of the address space at any instant of time. Temporal.

1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.

Lecture slides originally adapted from Prof. Valeriu Beiu (Washington State University, Spring 2005, EE 334)

CPE 626 CPU Resources: Introduction to Cache Memories Aleksandar Milenkovic Web:

CMSC 611: Advanced Computer Architecture

Soner Onder Michigan Technological University

Yu-Lun Kuo Computer Sciences and Information Engineering

The Goal: illusion of large, fast, cheap memory

Since 1980, CPU has outpaced DRAM ...

Improving Memory Access 1/3 The Cache and Virtual Memory

CSC 4250 Computer Architectures

Lecture 13: Reduce Cache Miss Rates

Cache Memory Presentation I

CS61C : Machine Structures Lecture 6. 2

CS61C : Machine Structures Lecture 6. 2

So we create a memory hierarchy:

Rose Liu Electrical Engineering and Computer Sciences

Lecture 08: Memory Hierarchy Cache Performance

CPE 631 Lecture 05: Cache Design

CPE 631 Lecture 05: CPU Caches

CMSC 611: Advanced Computer Architecture

ECE232: Hardware Organization and Design

CS 704 Advanced Computer Architecture

EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007

CS-447– Computer Architecture Lecture 20 Cache Memories

Chapter Five Large and Fast: Exploiting Memory Hierarchy

September 1, 2000 Prof. John Kubiatowicz

Cache Memory Rabi Mahapatra

Memory & Cache.

10/18: Lecture Topics Using spatial locality

Presentation transcript:

CPE 631 Lecture 04: Review of the ABC of Caches Electrical and Computer Engineering University of Alabama in Huntsville

Outline Memory Hierarchy Four Questions for Memory Hierarchy Here is the Outline of today’s lecture. First, we shall give a short overview of major trends and breakthroughs in Computer Technology for last 50 years. Then, we will give an answer to the question “What is Computer Architecture”. After these two introductory topics we will consider Measuring and Reporting Performance and major Quantitative Principles of Computer Design. 10/05/2019 UAH-CPE631

Processor-DRAM Latency Gap 2x/1.5 year Performance Processor-Memory Performance Gap grows 50% / year Memory: 2x/10 years Time 10/05/2019 UAH-CPE631

Solution: The Memory Hierarchy (MH) User sees as much memory as is available in cheapest technology and access it at the speed offered by the fastest technology Processor Control Datapath Levels in Memory Hierarchy Upper Lower Fastest Slowest Speed: Capacity: Cost/bit: Smallest Biggest Highest Lowest 10/05/2019 UAH-CPE631

Why hierarchy works? Principal of locality Temporal locality: recently accessed items are likely to be accessed in the near future  Keep them close to the processor Spatial locality: items whose addresses are near one another tend to be referenced close together in time  Move blocks consisted of contiguous words to the upper level Rule of thumb: Programs spend 90% of their execution time in only 10% of code Probability of reference Address space 10/05/2019 UAH-CPE631

Cache Measures Hit time << Miss Penalty Hit: data appears in some block in the upper level (Bl. X) Hit Rate: the fraction of memory access found in the upper level Hit Time: time to access the upper level (RAM access time + Time to determine hit/miss) Miss: data needs to be retrieved from the lower level (Bl. Y) Miss rate: 1 - (Hit Rate) Miss penalty: time to replace a block in the upper level + time to retrieve the block from the lower level Average memory-access time = Hit time + Miss rate x Miss penalty (ns or clocks) Upper Level Memory Lower Level Memory To Processor Hit time << Miss Penalty Bl. X Bl. Y From Processor 10/05/2019 UAH-CPE631

Levels of the Memory Hierarchy Capacity Access Time Cost Upper Level Staging Xfer Unit faster Registers CPU Registers 100s Bytes <1s ns Instr. Operands prog./compiler 1-8 bytes Cache 10s-100s K Bytes 1-10 ns $10/ MByte Cache Blocks cache cntl 8-128 bytes Main Memory M Bytes 100ns- 300ns $1/ MByte Memory Pages OS 512-4K bytes Disk 10s G Bytes, 10 ms (10,000,000 ns) $0.0031/ MByte Disk Files user/operator Mbytes Tape infinite sec-min $0.0014/ MByte Larger Tape Lower Level 10/05/2019 UAH-CPE631

Four Questions for Memory Heir. Q#1: Where can a block be placed in the upper level?  Block placement direct-mapped, fully associative, set-associative Q#2: How is a block found if it is in the upper level?  Block identification Q#3: Which block should be replaced on a miss?  Block replacement Random, LRU (Least Recently Used) Q#4: What happens on a write?  Write strategy Write-through vs. write-back Write allocate vs. No-write allocate 10/05/2019 UAH-CPE631

Direct-Mapped Cache In a direct-mapped cache, each memory address is associated with one possible block within the cache Therefore, we only need to look in a single location in the cache for the data if it exists in the cache Block is the unit of transfer between cache and memory 10/05/2019 UAH-CPE631

Direct-Mapped Cache (cont’d) Memory Address Cache Index Cache (4 byte) Memory 1 1 2 2 3 3 4 5 6 7 8 9 A B C D E F 10/05/2019 UAH-CPE631

Direct-Mapped Cache (cont’d) Since multiple memory addresses map to same cache index, how do we tell which one is in there? What if we have a block size > 1 byte? Result: divide memory address into three fields: Block Address tttttttttttttttttt iiiiiiiiii oooo TAG: to check if have the correct block INDEX: to select block OFFSET: to select byte within the block 10/05/2019 UAH-CPE631

Direct-Mapped Cache Terminology INDEX: specifies the cache index (which “row” of the cache we should look in) OFFSET: once we have found correct block, specifies which byte within the block we want TAG: the remaining bits after offset and index are determined; these are used to distinguish between all the memory addresses that map to the same location BLOCK ADDRESS: TAG + INDEX 10/05/2019 UAH-CPE631

Direct-Mapped Cache Example Conditions 32-bit architecture (word=32bits), address unit is byte 8KB direct-mapped cache with 4 words blocks Determine the size of the Tag, Index, and Offset fields OFFSET (specifies correct byte within block): cache block contains 4 words = 16 (24) bytes  4 bits INDEX (specifies correct row in the cache): cache size is 8KB=213 bytes, cache block is 24 bytes #Rows in cache (1 block = 1 row): 213/24 = 29  9 bits TAG: Memory address length - offset - index = 32 - 4 - 9 = 19  tag is leftmost 19 bits 10/05/2019 UAH-CPE631

1 KB Direct Mapped Cache, 32B blocks For a 2 ** N byte cache: The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2 ** M) Cache Index 1 2 3 : Cache Data Byte 0 4 31 Cache Tag Example: 0x50 Ex: 0x01 0x50 Stored as part of the cache “state” Valid Bit Byte 1 Byte 31 Byte 32 Byte 33 Byte 63 Byte 992 Byte 1023 Byte Select Ex: 0x00 9 10/05/2019 UAH-CPE631

Two-way Set Associative Cache N-way set associative: N entries for each Cache Index N direct mapped caches operates in parallel (N typically 2 to 4) Example: Two-way set associative cache Cache Index selects a “set” from the cache The two tags in the set are compared in parallel Data is selected based on the tag result Cache Index Valid Cache Tag Cache Data Cache Data Cache Block 0 Cache Tag Valid : Cache Block 0 : This is called a 2-way set associative cache because there are two cache entries for each cache index. Essentially, you have two direct mapped cache works in parallel. This is how it works: the cache index selects a set from the cache. The two tags in the set are compared in parallel with the upper bits of the memory address. If neither tag matches the incoming address tag, we have a cache miss. Otherwise, we have a cache hit and we will select the data on the side where the tag matches occur. This is simple enough. What is its disadvantages? : : Compare Adr Tag Compare 1 Sel1 Mux Sel0 OR Cache Block 10/05/2019 UAH-CPE631 Hit

Disadvantage of Set Associative Cache N-way Set Associative Cache v. Direct Mapped Cache: N comparators vs. 1 Extra MUX delay for the data Data comes AFTER Hit/Miss In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: Possible to assume a hit and continue. Recover later if miss. Cache Data Cache Block 0 Cache Tag Valid : Cache Index Mux 1 Sel1 Sel0 Cache Block Compare Adr Tag OR Hit 10/05/2019 UAH-CPE631

Q1: Where can a block be placed in the upper level? Block 12 placed in 8 block cache: Fully associative, direct mapped, 2-way set associative S.A. Mapping = Block Number Modulo Number Sets Direct Mapped (12 mod 8) = 4 2-Way Assoc (12 mod 4) = 0 Full Mapped 01234567 01234567 01234567 Cache 1111111111222222222233 01234567890123456789012345678901 Memory 10/05/2019 UAH-CPE631

Q2: How is a block found if it is in the upper level? Tag on each block No need to check index or block offset Increasing associativity shrinks index, expands tag Block Offset Block Address Index Tag 10/05/2019 UAH-CPE631

Q3: Which block should be replaced on a miss? Easy for Direct Mapped Set Associative or Fully Associative: Random LRU (Least Recently Used) Assoc: 2-way 4-way 8-way Size LRU Ran LRU Ran LRU Ran 16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0% 64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% 10/05/2019 UAH-CPE631

Q4: What happens on a write? Write through—The information is written to both the block in the cache and to the block in the lower-level memory. Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. is block clean or dirty? Pros and Cons of each? WT: read misses cannot result in writes WB: no repeated writes to same location WT always combined with write buffers so that don’t wait for lower level memory 10/05/2019 UAH-CPE631

Write stall in write through caches When the CPU must wait for writes to complete during write through, the CPU is said to write stall Common optimization => Write buffer which allows the processor to continue as soon as the data is written to the buffer, thereby overlapping processor execution with memory updating However, write stalls can occur even with write buffer (when buffer is full) 10/05/2019 UAH-CPE631

Write Buffer for Write Through Processor Cache Write Buffer DRAM A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle Memory system designer’s nightmare: Store frequency (w.r.t. time) -> 1 / DRAM write cycle Write buffer saturation 10/05/2019 UAH-CPE631

What to do on a write-miss? Write allocate (or fetch on write) The block is loaded on a write-miss, followed by the write-hit actions No-write allocate (or write around) The block is modified in the memory and not loaded into the cache Although either write-miss policy can be used with write through or write back, write back caches generally use write allocate and write through often use no-write allocate 10/05/2019 UAH-CPE631

Summary: Pipelining & Performance Just overlap tasks; easy if tasks are independent Speed Up  Pipeline Depth; if ideal CPI is 1, then: Hazards limit performance on computers: Structural: need more HW resources Data (RAW,WAR,WAW): need forwarding, compiler scheduling Control: delayed branch, prediction Time is measure of performance: latency or throughput CPI Law: 10/05/2019 UAH-CPE631

Summary: Caches The Principle of Locality: Program access a relatively small portion of the address space at any instant of time. Temporal Locality: Locality in Time Spatial Locality: Locality in Space Three Major Categories of Cache Misses: Compulsory Misses: sad facts of life. Example: cold start misses. Capacity Misses: increase cache size Conflict Misses: increase cache size and/or associativity Write Policy: Write Through: needs a write buffer. Write Back: control can be complex Today CPU time is a function of (ops, cache misses) vs. just f(ops): What does this mean to Compilers, Data structures, Algorithms? 10/05/2019 UAH-CPE631

Summary: The Cache Design Space Cache Size Several interacting dimensions cache size block size associativity replacement policy write-through vs write-back The optimal choice is a compromise depends on access characteristics workload use (I-cache, D-cache, TLB) depends on technology / cost Simplicity often wins Associativity Block Size Bad Good Factor A Factor B Less More 10/05/2019 UAH-CPE631