CS151B Computer Systems Architecture Winter 2002 TuTh 2-4pm - 2444 BH Instructor: Prof. Jason Cong Lecture 15 More Caches and Virtual Memories.

CS151B Computer Systems Architecture Winter 2002 TuTh 2-4pm - 2444 BH Instructor: Prof. Jason Cong Lecture 15 More Caches and Virtual Memories

2 CS151BJason Cong Recap: Set Associative Cache °N-way set associative: N entries for each Cache Index N direct mapped caches operates in parallel °Example: Two-way set associative cache Cache Index selects a “set” from the cache The two tags in the set are compared to the input in parallel Data is selected based on the tag result Cache Data Cache Block 0 Cache TagValid ::: Cache Data Cache Block 0 Cache TagValid ::: Cache Index Mux 01 Sel1Sel0 Cache Block Compare Adr Tag Compare OR Hit

3 CS151BJason Cong Execution_Time = Instruction_Count x Cycle_Time x (ideal CPI + Memory_Stalls/Inst + Other_Stalls/Inst) Memory_Stalls/Inst = Instruction Miss Rate x Instruction Miss Penalty + Loads/Inst x Load Miss Rate x Load Miss Penalty + Stores/Inst x Store Miss Rate x Store Miss Penalty Average Memory Access time (AMAT) = Hit Time L1 + (Miss Rate L1 x Miss Penalty L1 ) = (Hit Rate L1 x Hit Time L1 ) + (Miss Rate L1 x Miss Time L1 ) Recap: Cache Performance

4 CS151BJason Cong °Compulsory (cold start or process migration, first reference): first access to a block “Cold” fact of life: not a whole lot you can do about it Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant °Conflict (collision): Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity °Capacity: Cache cannot contain all blocks access by the program Solution: increase cache size °Coherence (Invalidation): other process (e.g., I/O) updates memory Recap: A Summary on Sources of Cache Misses

5 CS151BJason Cong °The Five Classic Components of a Computer °Today’s Topics: Recap last lecture Virtual Memory Protection TLB Buses The Big Picture: Where are We Now? Control Datapath Memory Processor Input Output

6 CS151BJason Cong °Set of Operations that must be supported read: data <= Mem[Physical Address] write: Mem[Physical Address] <= Data °Determine the internal register transfers °Design the Datapath °Design the Cache Controller Physical Address Read/Write Data Memory “Black Box” Inside it has: Tag-Data Storage, Muxes, Comparators,... Cache Controller Cache DataPath Address Data In Data Out R/W Active Control Points Signals wait How Do you Design a Memory System?

7 CS151BJason Cong Impact on Cycle Time IR PC I -Cache D Cache AB R T IRex IRm IRwb miss invalid Miss Cache Hit Time: directly tied to clock rate increases with cache size increases with associativity Average Memory Access time = Hit Time + Miss Rate x Miss Penalty Time = IC x CT x (ideal CPI + memory stalls)

8 CS151BJason Cong Options to reduce AMAT: 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache. Time = IC x CT x (ideal CPI + memory stalls) Average Memory Access time = Hit Time + (Miss Rate x Miss Penalty) = (Hit Rate x Hit Time) + (Miss Rate x Miss Time) Improving Cache Performance: 3 general options

9 CS151BJason Cong 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache. Improving Cache Performance

10 CS151BJason Cong Conflict 3Cs Absolute Miss Rate (SPEC92)

11 CS151BJason Cong Conflict miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2 2:1 Cache Rule

12 CS151BJason Cong Conflict 3Cs Relative Miss Rate

13 CS151BJason Cong 1. Reduce Misses via Larger Block Size

14 CS151BJason Cong °2:1 Cache Rule: Miss Rate DM cache size N Miss Rate 2-way cache size N/2 °Beware: Execution time is only final measure! Will Clock Cycle time increase? Hill [1988] suggested hit time for 2-way vs. 1-way external cache +10%, internal + 2% 2. Reduce Misses via Higher Associativity

15 CS151BJason Cong °Assume CCT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8- way vs. CCT direct mapped Cache SizeAssociativity (KB)1-way2-way4-way8-way 12.332.152.072.01 21.981.861.761.68 41.721.671.611.53 81.461.481.471.43 161.291.321.321.32 321.201.241.251.27 641.141.201.211.23 1281.101.171.181.20 (Red means A.M.A.T. not improved by more associativity) Example: Avg. Memory Access Time vs. Miss Rate

16 CS151BJason Cong To Next Lower Level In Hierarchy DATA TAGS One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator 3. Reducing Misses via a “Victim Cache” °How to combine fast hit time of direct mapped yet still avoid conflict misses? °Add buffer to place data discarded from cache °Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache °Used in Alpha, HP machines

17 CS151BJason Cong °E.g., Instruction Prefetching Alpha 21064 fetches 2 blocks on a miss Extra block placed in “stream buffer” On miss check stream buffer °Works with data blocks too: Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 streams got 43% Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches °Prefetching relies on having extra memory bandwidth that can be used without penalty Could reduce performance if done indiscriminantly!!! 4. Reducing Misses by Hardware Prefetching

18 CS151BJason Cong °Data Prefetch Load data into register (HP PA-RISC loads) Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) Special prefetching instructions cannot cause faults; a form of speculative execution °Issuing Prefetch Instructions takes time Is cost of prefetch issues < savings in reduced misses? Higher superscalar reduces difficulty of issue bandwidth 5. Reducing Misses by Software Prefetching Data

19 CS151BJason Cong °McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software °Instructions Reorder procedures in memory so as to reduce conflict misses Profiling to look at conflicts(using tools they developed) °Data Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays Loop Interchange: change nesting of loops to access data in order stored in memory Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows 6. Reducing Misses by Compiler Optimizations

20 CS151BJason Cong 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache. Improving Cache Performance (Continued)

21 CS151BJason Cong 0. Reducing Penalty: Faster DRAM / Interface °New DRAM Technologies RAMBUS - same initial latency, but much higher bandwidth Synchronous DRAM °Better BUS interfaces °CRAY Technique: only use SRAM

22 CS151BJason Cong °A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory °Write buffer is just a FIFO: Typical number of entries: 4 Works fine if:Store frequency (w.r.t. time) << 1 / DRAM write cycle Must handle burst behavior as well! Processor Cache Write Buffer DRAM 1. Reducing Penalty: Read Priority over Write on Miss

23 CS151BJason Cong °Write-Buffer Issues: Could introduce RAW Hazard with memory! Write buffer may contain only copy of valid data  Reads to memory may get wrong result if we ignore write buffer °Solutions: Simply wait for write buffer to empty before servicing reads: -Might increase read miss penalty (old MIPS 1000 by 50% ) Check write buffer contents before read (“fully associative”); -If no conflicts, let the memory access continue -Else grab data from buffer °Can Write Buffer help with Write Back? Read miss replacing dirty block -Copy dirty block to write buffer while starting read to memory CPU stall less since restarts as soon as do read RAW Hazards from Write Buffer!

24 CS151BJason Cong °Don’t wait for full block to be loaded before restarting CPU Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first °Generally useful only in large blocks, °Spatial locality a problem; tend to want next sequential word, so not clear if benefit by early restart block 2. Reduce Penalty: Early Restart and Critical Word First

25 CS151BJason Cong °Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss requires F/E bits on registers or out-of-order execution requires multi-bank memories °“hit under miss” reduces the effective miss penalty by working during miss vs. ignoring CPU requests °“hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses Requires multiple memory banks (otherwise cannot support) Penium Pro allows 4 outstanding memory misses 3. Reduce Penalty: Non-blocking Caches

26 CS151BJason Cong °For in-order pipeline, 2 options: Freeze pipeline in Mem stage (popular early on: Sparc, R4000) IF ID EX Mem stall stall stall … stall Mem Wr IF ID EX stall stall stall … stall stall Ex Wr Use Full/Empty bits in registers + MSHR queue -MSHR = “Miss Status/Handler Registers” (Kroft) Each entry in this queue keeps track of status of outstanding memory requests to one complete memory line. –Per cache-line: keep info about memory address. –For each word: register (if any) that is waiting for result. –Used to “merge” multiple requests to one memory line -New load creates MSHR entry and sets destination register to “Empty”. Load is “released” from pipeline. -Attempt to use register before result returns causes instruction to block in decode stage. -Limited “out-of-order” execution with respect to loads. Popular with in-order superscalar architectures. °Out-of-order pipelines already have this functionality built in… (load queues, etc). What happens on a Cache miss?

27 CS151BJason Cong °FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26 °Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19 °8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss Integer Floating Point “Hit under n Misses” 0->1 1->2 2->64 Base Value of Hit Under Miss for SPEC

28 CS151BJason Cong °L2 Equations AMAT = Hit Time L1 + Miss Rate L1 x Miss Penalty L1 Miss Penalty L1 = Hit Time L2 + Miss Rate L2 x Miss Penalty L2 AMAT = Hit Time L1 + Miss Rate L1 x (Hit Time L2 + Miss Rate L2 x Miss Penalty L2 ) °Definitions: Local miss rate— misses in this cache divided by the total number of memory accesses to this cache (Miss rate L2 ) Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU (Miss Rate L1 x Miss Rate L2 ) Global Miss Rate is what matters 4. Reduce Penalty: Second-Level Cache Proc L1 Cache L2 Cache

29 CS151BJason Cong °Reducing Miss Rate 1. Reduce Misses via Larger Block Size 2. Reduce Conflict Misses via Higher Associativity 3. Reducing Conflict Misses via Victim Cache 4. Reducing Misses by HW Prefetching Instr, Data 5. Reducing Misses by SW Prefetching Data 6. Reducing Capacity/Conf. Misses by Compiler Optimizations Reducing Misses: which apply to L2 Cache?

30 CS151BJason Cong °32KB L1, 8 byte path to memory L2 cache block size & A.M.A.T.

31 CS151BJason Cong 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache: Lower Associativity (+victim caching or 2nd-level cache)? Multiple cycle Cache access (e.g. R4000) Harvard Architecture Careful Virtual Memory Design (rest of lecture!) Improving Cache Performance (Continued)

32 CS151BJason Cong °Sample Statistics: 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47% 32KB unified: Aggregate miss rate=1.99% °Which is better (ignore L2 cache)? Assume 33% loads/store, hit time=1, miss time=50 Note: data hit has 1 stall for unified cache (only one port) AMAT Harvard =(1/1.33)x(1+0.64%x50)+(0.33/1.33)x(1+6.47%x50) = 2.05 AMAT Unified =(1/1.33)x(1+1.99%x50)+(0.33/1.33)X(1+1+1.99%x50)= 2.24 Proc I-Cache-1 Proc Unified Cache-1 Unified Cache-2 D-Cache-1 Proc Unified Cache-2 Example: Harvard Architecture Unified Harvard Architecture

33 CS151BJason Cong CPU Registers 100s Bytes <10s ns Cache K Bytes 10-100 ns $.01-.001/bit Main Memory M Bytes 100ns-1us $.01-.001 Disk G Bytes ms 10 - 10 cents -3 -4 Capacity Access Time Cost Tape infinite sec-min 10 -6 Registers Cache Memory Disk Tape Instr. Operands Blocks Pages Files Staging Xfer Unit prog./compiler 1-8 bytes cache cntl 8-128 bytes OS 512-4K bytes user/operator Mbytes Upper Level Lower Level faster Larger Recall: Levels of the Memory Hierarchy

34 CS151BJason Cong °Virtual memory => treat memory as a cache for the disk °Terminology: blocks in this cache are called “Pages” Typical size of a page: 1K — 8K °Page table maps virtual page numbers to physical frames “PTE” = Page Table Entry Physical Address Space Virtual Address Space What is virtual memory? Virtual Address Page Table index into page table Page Table Base Reg V Access Rights PA V page no.offset 10 table located in physical memory P page no.offset 10 Physical Address

35 CS151BJason Cong Three Advantages of Virtual Memory °Translation: Program can be given consistent view of memory, even though physical memory is scrambled Makes multithreading reasonable (now used a lot!) Only the most important part of program (“Working Set”) must be in physical memory. Contiguous structures (like stacks) use only as much physical memory as necessary yet still grow later. °Protection: Different threads (or processes) protected from each other. Different pages can be given special behavior - (Read Only, Invisible to user programs, etc). Kernel data protected from User programs Very important for protection from malicious programs => Far more “viruses” under Microsoft Windows °Sharing: Can map same physical page to multiple users (“Shared memory”)

36 CS151BJason Cong What is the size of information blocks that are transferred from secondary to main storage (M)?  page size (Contrast with physical block size on disk, I.e. sector size) Which region of M is to hold the new block  placement policy How do we find a page when we look for it?  block identification Block of information brought into M, and M is full, then some region of M must be released to make room for the new block  replacement policy What do we do on a write?  write policy Missing item fetched from secondary memory only on the occurrence of a fault  demand load policy pages reg cache mem disk frame Issues in Virtual Memory System Design

37 CS151BJason Cong How big is the translation (page) table? °Simplest way to implement “fully associative” lookup policy is with large lookup table. °Each entry in table is some number of bytes, say 4 °With 4K pages, 32- bit address space, need: 2 32 /4K = 2 20 = 1 Meg entries x 4 bytes = 4MB °With 4K pages, 64-bit address space, need: 2 64 /4K = 2 52 entries = BIG! °Can’t keep whole page table in memory! Virtual Page NumberPage Offset

38 CS151BJason Cong Large Address Spaces Two-level Page Tables 32-bit address: P1 indexP2 indexpage offest 4 bytes 4KB 10 12 1K PTEs ° 2 GB virtual address space ° 4 MB of PTE2 – paged, holes ° 4 KB of PTE1 What about a 48-64 bit address space?

39 CS151BJason Cong Inverted Page Tables V.Page P. Frame hash Virtual Page = IBM System 38 (AS400) implements 64-bit addresses. 48 bits translated start of object contains a 12-bit tag => TLBs or virtually addressed caches are critical

40 CS151BJason Cong Virtual Address and a Cache: Step backward??? °Virtual memory seems to be really slow: Must access memory on load/store -- even cache hits! Worse, if translation not completely in memory, may need to go to disk before hitting in cache! °Solution: Caching! (surprise!) Keep track of most common translations and place them in a “Translation Lookaside Buffer” (TLB) CPU Trans- lation Cache Main Memory VAPA miss hit data

41 CS151BJason Cong Making address translation practical: TLB °Virtual memory => memory acts like a cache for the disk °Page table maps virtual page numbers to physical frames °Translation Look-aside Buffer (TLB) is a cache translations Physical Memory Space Virtual Address Space TLB Page Table 2 0 1 3 virtual address page off 2 framepage 2 50 physical address page off

42 CS151BJason Cong TLB organization: include protection °TLB usually organized as fully-associative cache Lookup is by Virtual Address Returns Physical Address + other info °Dirty => Page modified (Y/N)? Ref => Page touched (Y/N)? Valid => TLB entry valid (Y/N)? Access => Read? Write? ASID => Which User? Virtual Address Physical Address Dirty Ref Valid Access ASID 0xFA000x0003YNYR/W340xFA000x0003YNYR/W34 0x00400x0010NYYR0 0x00410x0011NYYR0

43 CS151BJason Cong Example: R3000 pipeline includes TLB stages Inst Fetch Dcd/ Reg ALU / E.AMemoryWrite Reg TLB I-Cache RF Operation WB E.A. TLB D-Cache MIPS R3000 Pipeline ASIDV. Page NumberOffset 12 20 6 0xx User segment (caching based on PT/TLB entry) 100 Kernel physical space, cached 101 Kernel physical space, uncached 11x Kernel virtual space Allows context switching among 64 user processes without TLB flush Virtual Address Space TLB 64 entry, on-chip, fully associative, software TLB fault handler

44 CS151BJason Cong What is the replacement policy for TLBs? °On a TLB miss, we check the page table for an entry. Two architectural possibilities: Hardware “table-walk” (Sparc, among others) -Structure of page table must be known to hardware Software “table-walk” (MIPS was one of the first) -Lots of flexibility -Can be expensive with modern operating systems. °What if missing Entry is not in page table? This is called a “Page Fault”  requested virtual page is not in memory Operating system must take over -pick a page to discard (possibly writing it to disk) -start loading the page in from disk -schedule some other process to run °Note: possible that parts of page table are not even in memory (I.e. paged out!) The root of the page table always “pegged” in memory

45 CS151BJason Cong Page Replacement: Not Recently Used (1-bit LRU, Clock) Set of all pages in Memory Tail pointer: Mark pages as “not used recently Head pointer: Place pages on free list if they are still marked as “not used”. Schedule dirty pages for writing to disk Freelist Free Pages

46 CS151BJason Cong Page Replacement: Not Recently Used (1-bit LRU, Clock) Associated with each page is a “used” flag such that used flag = 1 if the page has been referenced in recent past = 0 otherwise -- if replacement is necessary, choose any page frame such that its reference bit is 0. This is a page that has not been referenced in the recent past page table entry page table entry last replaced pointer (lrp) if replacement is to take place, advance lrp to next entry (mod table size) until one with a 0 bit is found; this is the target for replacement; As a side effect, all examined PTE's have their reference bits set to zero. 1 0 Or search for the a page that is both not recently referenced AND not dirty. useddirty Architecture part: support dirty and used bits in the page table => may need to update PTE on any instruction fetch, load, store How does TLB affect this design problem? Software TLB miss? page fault handler: 1 0 0 1 1 0

47 CS151BJason Cong °As described, TLB lookup is in serial with cache lookup: °Modern machines with TLBs go one step further: they overlap TLB lookup with cache access. Works because lower bits of result (offset) available early Reducing translation time further Virtual Address TLB Lookup V Access Rights PA V page no.offset 10 P page no.offset 10 Physical Address

48 CS151BJason Cong TLB 4K Cache 102 00 4 bytes index 1 K page #disp 20 assoc lookup 32 Hit/ Miss FN Data Hit/ Miss = FN What if cache size is increased to 8KB? °If we do this in parallel, we have to be careful, however: Overlapped TLB & Cache Access

49 CS151BJason Cong Overlapped access only works as long as the address bits used to index into the cache do not change as the result of VA translation This usually limits things to small caches, large page sizes, or high n-way set associative caches if you want a large cache Example: suppose everything the same except that the cache is increased to 8 K bytes instead of 4 K: 112 00 virt page #disp 20 12 cache index This bit is changed by VA translation, but is needed for cache lookup Solutions: go to 8K byte page sizes; go to 2 way set associative cache; or SW guarantee VA[13]=PA[13] 1K 44 10 2 way set assoc cache Problems With Overlapped TLB Access

50 CS151BJason Cong Only require address translation on cache miss! synonym problem: two different virtual addresses map to same physical address => two different cache entries holding data for the same physical address! nightmare for update: must update all cache entries with same physical address or memory becomes inconsistent determining this requires significant hardware, essentially an associative lookup on the physical address tags to see if you have multiple hits. (usually disallowed by fiat) data CPU Trans- lation Cache Main Memory VA hit PA Another option: Virtually Addressed Cache

51 CS151BJason Cong °TLBs fully associative °TLB updates in SW (“Priv Arch Libr”) °Separate Instr & Data TLB & Caches °Caches 8KB direct mapped, write thru °Critical 8 bytes first °Prefetch instr. stream buffer °4 entry write buffer between D$ & L2$ °2 MB L2 cache, direct mapped, (off-chip) °256 bit path to main memory, 4 x 64-bit modules °Victim Buffer: to give read priority over write Stream Buffer Write Buffer Victim Buffer InstrData Cache Optimization: Alpha 21064

52 CS151BJason Cong Summary #1/2 : TLB, Virtual Memory °Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions: 1) Where can block be placed? 2) How is block found? 3) What block is replaced on miss? 4) How are writes handled? °More cynical version of this: Everything in computer architecture is a cache! °Techniques people use to improve the miss rate of caches: TechniqueMRMPHT Complexity Larger Block Size+–0 Higher Associativity+–1 Victim Caches+2 Pseudo-Associative Caches +2 HW Prefetching of Instr/Data+2 Compiler Controlled Prefetching+3 Compiler Reduce Misses+0

53 CS151BJason Cong Summary #2 / 2: Virtual Memory °VM allows many processes to share single memory without having to swap all processes to disk °Translation, Protection, and Sharing are more important than memory hierarchy °Page tables map virtual address to physical address TLBs are a cache on translation and are extremely important for good performance Special tricks necessary to keep TLB out of critical cache-access path TLB misses are significant in processor performance: -These are funny times: most systems can’t access all of 2nd level cache without TLB misses!

54 CS151BJason Cong Acknowledgements °The majority of slides in this lecture are from UC Berkeley for their CS152 course (David Patterson, John Kubiatowicz, …)

CS151B Computer Systems Architecture Winter 2002 TuTh 2-4pm - 2444 BH Instructor: Prof. Jason Cong Lecture 15 More Caches and Virtual Memories.

Similar presentations

Presentation on theme: "CS151B Computer Systems Architecture Winter 2002 TuTh 2-4pm - 2444 BH Instructor: Prof. Jason Cong Lecture 15 More Caches and Virtual Memories."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS151B Computer Systems Architecture Winter 2002 TuTh 2-4pm - 2444 BH Instructor: Prof. Jason Cong Lecture 15 More Caches and Virtual Memories.

Similar presentations

Presentation on theme: "CS151B Computer Systems Architecture Winter 2002 TuTh 2-4pm - 2444 BH Instructor: Prof. Jason Cong Lecture 15 More Caches and Virtual Memories."— Presentation transcript:

Similar presentations

About project

Feedback