Slide 1 Hitting the Memory Wall Memory density and capacity have grown along with the CPU power and complexity, but memory speed has not kept pace.

Slides:



Advertisements
Similar presentations
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Lecture 12 Reduce Miss Penalty and Hit Time
Performance of Cache Memory
1 Recap: Memory Hierarchy. 2 Memory Hierarchy - the Big Picture Problem: memory is too slow and or too small Solution: memory hierarchy Fastest Slowest.
July 2005Computer Architecture, Memory System DesignSlide 1 Part V Memory System Design.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Overview of Cache and Virtual MemorySlide 1 The Need for a Cache (edited from notes with Behrooz Parhami’s Computer Architecture textbook) Cache memories.
Now, Review of Memory Hierarchy
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
CIS629 - Fall 2002 Caches 1 Caches °Why is caching needed? Technological development and Moore’s Law °Why are caches successful? Principle of locality.
CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
EENG449b/Savvides Lec /7/05 April 7, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
1 Caching Basics CS Memory Hierarchies Takes advantage of locality of reference principle –Most programs do not access all code and data uniformly,
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
Memory Hierarchy— Motivation, Definitions, Four Questions about Memory Hierarchy, Improving Performance Professor Alvin R. Lebeck Computer Science 220.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
EEL-4713 Ann Gordon-Ross 1 EEL-4713 Computer Architecture Memory hierarchies.
Computer Organization & Programming
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
Cache Memory Chapter 17 S. Dandamudi To be used with S. Dandamudi, “Fundamentals of Computer Organization and Design,” Springer,  S. Dandamudi.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
Improving Memory Access 2/3 The Cache and Virtual Memory
Cps 220 Cache. 1 ©GK Fall 1998 CPS220 Computer System Organization Lecture 17: The Cache Alvin R. Lebeck Fall 1999.
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
FAMU-FSU College of Engineering 1 Computer Architecture EEL 4713/5764, Fall 2006 Dr. Linda DeBrunner Module #18—Cache Memory Organization.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.
CPE 626 CPU Resources: Introduction to Cache Memories Aleksandar Milenkovic Web:
CMSC 611: Advanced Computer Architecture
Improving Memory Access 1/3 The Cache and Virtual Memory
CSC 4250 Computer Architectures
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
CPE 631 Lecture 05: Cache Design
Part V Memory System Design
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Cache Memory Rabi Mahapatra
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Slide 1 Hitting the Memory Wall Memory density and capacity have grown along with the CPU power and complexity, but memory speed has not kept pace.

Slide 2 The Need for a Memory Hierarchy The widening speed gap between CPU and main memory Processor operations take of the order of 1 ns Memory access requires 10s or even 100s of ns Memory bandwidth limits the instruction execution rate Each instruction executed involves at least one memory access Hence, a few to 100s of MIPS is the best that can be achieved A fast buffer memory can help bridge the CPU-memory gap The fastest memories are expensive and thus not very large A second (third?) intermediate cache level is thus often used

Slide 3 Typical Levels in a Hierarchical Memory Names and key characteristics of levels in a memory hierarchy.

Data movement in a memory hierarchy. Memory Hierarchy Cache memory: provides illusion of very high speed Virtual memory: provides illusion of very large size Main memory: reasonable cost, but slow & small Slide 4

Slide 5 The Need for a Cache Cache memories act as intermediaries between the superfast processor and the much slower main memory. One level of cache with hit rate h C eff = hC fast + (1 – h)(C slow + C fast ) = C fast + (1 – h)C slow

Slide 6 Performance of a Two-Level Cache System Example CPU with CPIexecution = 1.1 running at clock rate = 500 MHZ 1.3 memory accesses per instruction. L1 cache operates at 500 MHZ with a miss rate of 5% L2 cache operates at 250 MHZ with local miss rate 40%, (T2 = 2 cycles) Memory access penalty, M = 100 cycles. Find CPI. CPI = CPIexecution + Mem Stall cycles per instruction With No Cache, CPI = x 100 = With single L1, CPI = x.05 x 100 = 7.6 Mem Stall cycles/instruction = Mem accesses /instruction x Stall cycles / access Stall cycles per memory access = (1-H1) x H2 x T2 + (1-H1)(1-H2) x M =.05 x.6 x x.4 x 100 = = 2.06 Mem Stall cycles/instruction = Mem accesses/instruction x Stall cycles/access = 2.06 x 1.3 = CPI = = Speedup = 7.6/3.778 = 2

Slide 7 Cache Memory Design Parameters (assuming a single cache level) Cache size (in bytes or words). A larger cache can hold more of the program’s useful data but is more costly and likely to be slower. Block or cache-line size (unit of data transfer between cache and main). With a larger cache line, more data is brought in cache with each miss. This can improve the hit rate but also may bring low-utility data in. Placement policy. Determining where an incoming cache line is stored. More flexible policies imply higher hardware cost and may or may not have performance benefits (due to more complex data location). Replacement policy. Determining which of several existing cache blocks (into which a new cache line can be mapped) should be overwritten. Typical policies: choosing a random or the least recently used block. Write policy. Determining if updates to cache words are immediately forwarded to main (write-through) or modified blocks are copied back to main if and when they must be replaced (write-back or copy-back).

Slide 8 What Makes a Cache Work? Assuming no conflict in address mapping, the cache will hold a small program loop in its entirety, leading to fast execution. Temporal locality Spatial locality

Slide 9 Temporal and Spatial Localities Addresses Time From Peter Denning’s CACM paper, July 2005 (Vol. 48, No. 7, pp ) Temporal: Accesses to the same address are typically clustered in time Spatial: When a location is accessed, nearby locations tend to be accessed also

Slide 10 Desktop, Drawer, and File Cabinet Analogy Items on a desktop (register) or in a drawer (cache) are more readily accessible than those in a file cabinet (main memory). Once the “working set” is in the drawer, very few trips to the file cabinet are needed.

Slide 11 Caching Benefits Related to Amdahl’s Law Example In the drawer & file cabinet analogy, assume a hit rate h in the drawer. Formulate the situation shown in previous figure in terms of Amdahl’s law. Solution Without the drawer, a document is accessed in 30 s. So, fetching 1000 documents, say, would take s. The drawer causes a fraction h of the cases to be done 6 times as fast, with access time unchanged for the remaining 1 – h. Speedup is thus 1/(1 – h + h/6) = 6 / (6 – 5h). Improving the drawer access time can increase the speedup factor but as long as the miss rate remains at 1 – h, the speedup can never exceed 1 / (1 – h). Given h = 0.9, for instance, the speedup is 4, with the upper bound being 10 for an extremely short drawer access time. Note: Some would place everything on their desktop, thinking that this yields even greater speedup. This strategy is not recommended!

Slide 12 Compulsory, Capacity, and Conflict Misses Compulsory misses: With on-demand fetching, first access to any item is a miss. Some “compulsory” misses can be avoided by prefetching. Capacity misses: We have to oust some items to make room for others. This leads to misses that are not incurred with an infinitely large cache. Conflict misses: Occasionally, there is free room, or space occupied by useless data, but the mapping/placement scheme forces us to displace useful items to bring in other items. This may lead to misses in future. Given a fixed-size cache, dictated, e.g., by cost factors or availability of space on the processor chip, compulsory and capacity misses are pretty much fixed. Conflict misses, on the other hand, are influenced by the data mapping scheme which is under our control. We study two popular mapping schemes: direct and set-associative.

Slide 13 Direct-Mapped Cache Direct-mapped cache holding 32 words within eight 4-word lines. Each line is associated with a tag and a valid bit.

Slide 14 Accessing a Direct-Mapped Cache Example 1 Components of the 32-bit address in an example direct-mapped cache with byte addressing. Show cache addressing for a byte-addressable memory with 32-bit addresses. Cache line W = 16 B. Cache size L = 4096 lines (64 KB). Solution Byte offset in line is log 2 16 = 4 b. Cache line index is log = 12 b. This leaves 32 – 12 – 4 = 16 b for the tag.

Slide 15 1 KB Direct Mapped Cache, 32B blocks For a 2 N byte cache: –The uppermost (32 - N) bits are always the Cache Tag –The lowest M bits are the Byte Select (Block Size = 2 M ) Cache Index : Cache Data Byte : Cache TagExample: 0x50 Ex: 0x01 0x50 Stored as part of the cache “state” Valid Bit : 31 Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Byte 992Byte 1023 : Cache Tag Byte Select Ex: 0x00 9 Example 2

Slide 16 Set-Associative Cache Two-way set-associative cache holding 32 words of data within 4-word lines and 2-line sets.

Slide 17 Accessing a Set-Associative Cache Example 1 Components of the 32- bit address in an example two-way set- associative cache. Show cache addressing scheme for a byte-addressable memory with 32-bit addresses. Cache line width 2 W = 16 B. Set size 2 S = 2 lines. Cache size 2 L = 4096 lines (64 KB). Solution Byte offset in line is log 2 16 = 4 b. Cache set index is (log /2) = 11 b. This leaves 32 – 11 – 4 = 17 b for the tag.

Slide 18 Two-way Set Associative Cache N-way set associative: N entries for each Cache Index –N direct mapped caches operates in parallel (N typically 2 to 4) Example: Two-way set associative cache –Cache Index selects a “set” from the cache –The two tags in the set are compared in parallel –Data is selected based on the tag result Cache Data Cache Block 0 Cache TagValid ::: Cache Data Cache Block 0 Cache TagValid ::: Cache Index Mux 01 Sel1Sel0 Cache Block Compare Adr Tag Compare OR Hit Example 2

Slide 19 Disadvantage of Set Associative Cache N-way Set Associative Cache v. Direct Mapped Cache: –N comparators vs. 1 –Extra MUX delay for the data –Data comes AFTER Hit/Miss In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: –Possible to assume a hit and continue. Recover later if miss. Advantage of Set Associative Cache Improves cache performance by reducing conflict misses In practice, degree of associativity is often kept at 4 or 8

Slide 20 Effect of Associativity on Cache Performance Performance improvement of caches with increased associativity.

Slide 21 Cache Write Strategies Write Though: Data is written to both the cache block and to a block of main memory. –The lower level always has the most updated data; an important feature for I/O and multiprocessing. –Easier to implement than write back. –A write buffer is often used to reduce CPU write stall while data is written to memory. Processor Cache Write Buffer DRAM

Slide 22 Cache Write Strategies cont. Write back: Data is written or updated only to the cache block. The modified or dirty cache block is written to main memory when it’s being replaced from cache. –Writes occur at the speed of cache –A status bit called a dirty bit, is used to indicate whether the block was modified while in cache; if not the block is not written to main memory. –Uses less memory bandwidth than write through.

Slide 23 Cache and Main Memory Harvard architecture: separate instruction and data memories von Neumann architecture: one memory for instructions and data Split cache: separate instruction and data caches (L1) Unified cache: holds instructions and data (L1, L2, L3)

Slide 24 Cache and Main Memory cont. (16KB instruction cache + 16KB data cache) vs. 32KB unified cache Hit cycle: 1, Miss cycle: 50, 75% instruction access 16KB I&D: instruction miss rate=0.64%, data miss rate=6.47% 32KB: miss rate = 1.99% Average memory access time = % instructions × (read hit time + read miss rate × miss penalty) + % data × (write hit time + write miss rate × miss penalty) Split= 75% × ( % × 50) + 25% × ( % × 50) = 2.05 Unified= 75% × ( × 50) + 25% × (1 + 1* % × 50) = 2.24 *: 1 extra clock cycle since there is only one cache port to satisfy two simultaneous requests

Slide 25 Improving Cache Performance For a given cache size, the following design issues and tradeoffs exist: Line width (2 W ). Too small a value for W causes a lot of maim memory accesses; too large a value increases the miss penalty and may tie up cache space with low-utility items that are replaced before being used. Set size or associativity (2 S ). Direct mapping (S = 0) is simple and fast; greater associativity leads to more complexity, and thus slower access, but tends to reduce conflict misses. Line replacement policy. Usually LRU (least recently used) algorithm or some approximation thereof; not an issue for direct-mapped caches. Somewhat surprisingly, random selection works quite well in practice. Write policy. Write through, write back

Slide 26 2:1 cache rule of thumb A direct-mapped cache of size N has about the same miss rate as a 2-way set-associative cache of size N/2. E.g. Ref. p. 424 fig. 5.14: miss rate 8 KB direct-mapped (0.068%) = 4 KB 2-way set associative (0.076%) 16 KB (0.049%) = 8 KB (0.049%) 32 KB (0.042%) = 16 KB (0.041%) 64 KB (0.037%) = 32 KB (0.038%) Caches larger than 128 KB do not prove the rule.

Slide 27 90/10 locality rule A program executes about 90% its instructions in 10% of its code

Slide 28 Four classic memory hierarchy questions Where can a block be placed in the upper level? (block placement): direct mapped, set-associative.. How is a block found if it is in the upper level? (block identification): tag, index, offset Which block should be replaced on a miss? (block replacement): Random, LRU, FIFO.. What happens on a write? (write strategy): write through, write back

Slide 29 Reducing cache miss penalty literature review 1 Multilevel caches Critical word first and early restart Giving priority to read misses over writes Merging write buffer Victim caches..

Slide 30 Reducing cache miss rate literature review 2 Larger block size Larger caches Higher associativity Way prediction and pseudoassociative caches Compiler optimizations..

Slide 31 Reducing cache miss penalty or miss rate via parallelism literature review 3 Nonblocking caches H/W prefetching of instructions and data Compiler controlled prefetching..

Slide 32 Reducing hit time literature review 4 Small and simple caches Avoiding address translation during indexing of the cache Pipelined cache access Trace caches..