Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS2100 Computer Organisation Cache II (AY2010/2011) Semester 2.

Similar presentations


Presentation on theme: "CS2100 Computer Organisation Cache II (AY2010/2011) Semester 2."— Presentation transcript:

1 CS2100 Computer Organisation http://www.comp.nus.edu.sg/~cs2100/ http://www.comp.nus.edu.sg/~cs2100/ Cache II (AY2010/2011) Semester 2

2 CS2100 Cache II 2 CACHE II Type of Cache Misses Direct-Mapped Cache: Conflict Set Associative Cache Fully Associative Cache Block Replacement Policy Cache Framework Improving Miss Penalty Multi-Level Caches Virtual Memory (Non-examinable)

3 CS2100 Cache II 3 RECAP: QUESTION 1 A 128-byte direct-mapped cache has 32-byte cache blocks. How many cache blocks are there? How many bits for cache index? How many bits for block offset? How many bits for tag? 

4 CS2100 Cache II 4 RECAP: QUESTION 2  Memory Block Number Address Direct Mapped Cache 8-byte blocks Memory 0 1 2 3 Cache Index 0 1 2 3 4 5 6 7 0-3 4-7 8-11 12-15 16-19 20-23 24-27 28-31 Take address 16 What is the memory block number? What is block offset? What is cache index? What is Tag? Address 16: ?

5 CS2100 Cache II 5 RECAP: QUESTION 3  Memory Block Number Address Direct Mapped Cache 8-byte blocks Memory 0 1 2 3 Cache Index 0 1 2 3 4 5 6 7 0-3 4-7 8-11 12-15 16-19 20-23 24-27 28-31 Take address 36 What is the memory block number? What is block offset? What is cache index? What is Tag? Address 36: ?

6 CS2100 Cache II 6 TYPES OF CACHE MISSES Cold/Compulsory Miss: First time a memory address is accessed  Cold fact of life; not much we can do about it  Solution: Increase cache block size Conflict Miss: Two or more distinct memory blocks map to the same cache block  Big problem in direct-mapped caches  Solution 1: Increase cache size Inherent restriction on cache size due to SRAM technology  Solution 2: Set-Associative caches (coming next..) Capacity Miss: Due to limited cache size  Capacity Miss goes away if cache size can be increased to fully accommodate working sets  Vague definition for now; will become clear when we discuss fully associative caches

7 CS2100 Cache II 7 BLOCK SIZE TRADEOFF (1/2) In general, larger block size takes advantage of spatial locality BUT:  Larger block size means larger miss penalty: Takes longer time to fill up the block  If block size is too big relative to cache size, miss rate will go up Too few cache blocks Average access time = Hit rate  Hit Time + (1-Hit rate)  Miss penalty Miss Penalty Block Size Miss Rate Exploits Spatial Locality Fewer blocks: compromises temporal locality Average Access Time Block Size

8 CS2100 Cache II 8 BLOCK SIZE TRADEOFF (2/2)

9 CS2100 Cache II 9 DIRECT-MAPPED CACHE: CONFLICT Lots of people whose last name start with L Cannot cache both Jin Wei and Nicholas What if you need both Jin Wei and Nicholas repeatedly? Frequent cache misses Solution: Add more slots for each alphabet Find LEE Nicholas A B Z L It’s LEE Jin Wei

10 CS2100 Cache II 10 SET ASSOCIATIVE CACHE (1/2) Two slots for each alphabet Can cache both Jin Wei and Nicholas Search both the slots in parallel Find LEE Nicholas LEE Jin Wei A B Z L LEE Nicholas

11 CS2100 Cache II 11 SET ASSOCIATIVE CACHE (2/2) A memory block can be placed in a fixed number of locations (n > 1) in the cache  n-way set associative cache n-way set associative cache consists of a number of sets where each set contains n cache blocks Each memory block maps to a unique cache set Within the set, a memory block can be placed in any element of the set

12 CS2100 Cache II 12 SET ASSOCIATIVE CACHE STRUCTURE Cache 00 01 10 11 Set IndexDataTagValidDataTagValid Set 2-way Set Associative Cache Each cache set contains two cache blocks A memory block can be mapped to a unique cache set Within a cache set, the memory block can be placed anywhere Both the cache blocks within a set should be searched in parallel

13 CS2100 Cache II 13 SET ASSOCIATIVE MAPPING Memory Address Block Number 310N-1N Block size = 2 N bytes Offset Memory Address Block Number 310N-1N OffsetSet Index N+M-1 Tag Cache Set Index = (Block number) modulo (Number of Cache Sets) Block size = 2 N bytes Number of cache sets = 2 M Offset: N bits Set Index: M bits Tag: 32 – (N + M)

14 CS2100 Cache II 14 SET ASSOCIATIVE CACHE: EXAMPLE (1/2) We have 4GB main memory and block size is 2 2 = 4 bytes (N=2). How many memory blocks are there?  4GB/4-byte = 1G memory blocks How many bits are required to uniquely identify the memory blocks?  1G =2 30 memory blocks will require 30 bits  We can also calculate it as (32-N) = 30 We have 4KB cache. How many cache blocks are there?  4KB/4-byte = 1024 cache blocks

15 CS2100 Cache II 15 SET ASSOCIATIVE CACHE: EXAMPLE (2/2) It’s 4-way set associative cache. How many cache sets are there?  1024 cache blocks/4  256 cache sets How many bits for set index?  256 = 2 8 cache sets  8-bit set index (M = 8) How many memory blocks can map to the same cache set?  1G memory blocks/256 cache sets = 4M How many Tag bits do we need?  4M = 2 22 memory blocks can be uniquely distinguished using 22 bits  We can also calculate it as 32 – (M+N) = 22 offsetindextag 01291031

16 CS2100 Cache II 16 SET ASSOCIATIVE CACHE CIRCUITRY 31... 10 9... 2 1 0 Byte offset Data TagV 0 1 2. 253 254 255 Data TagV 0 1 2. 253 254 255 Data TagV 0 1 2. 253 254 255 Index Data TagV 0 1 2. 253 254 255 Block size = 4-byte # Cache Sets = 256 Associativity = 4 8 Index 22 Tag HitData 32 4x1 select

17 CS2100 Cache II 17 ADVANTAGE OF ASSOCIATIVITY (1/3) Direct-Mapped Cache 00 01 10 11 Cache Index Block Number (in decimal) Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 x Y Given the following memory access sequence (in memory block number): 0 4 0 4 0 4 0 4 All the accesses will be cache miss in a direct-mapped cache with 4 cache blocks. Cold Miss: First 2 accesses Conflict Miss: Next 6 accesses

18 CS2100 Cache II 18 ADVANTAGE OF ASSOCIATIVITY (2/3) Block Number (in decimal) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Memory x y 2-way Set Associative Cache 00 01 Set Index Given the following memory access sequence (in memory block number): 0 4 0 4 0 4 0 4 All but first 2 accesses will be cache miss in a 2-way set associative cache with 4 blocks. Cold Miss: First 2 accesses Conflict Miss: None

19 CS2100 Cache II 19 ADVANTAGE OF ASSOCIATIVITY (3/3) Rule of Thumb: A direct-mapped cache of size N has about the same miss rate as a 2-way set associative cache of size N/2

20 CS2100 Cache II 20 EXAMPLE (1/9) Here is a series of memory address references: 4, 0, 8, 36, 0 for a 2-way set-associative cache with four 8-byte blocks. Indicate hit/miss for each reference. Take address = 36 Memory block number =  36/8  = 4 Block offset = 36 mod 8 = 4 Cache set index = 4 mod 2 = 0 Tag =  4/2  = 2 Binary:...010 0 100

21 CS2100 Cache II 21 EXAMPLE (2/9) Addresses: 4, 0, 8, 36, 0 4 = …000 0 100 Set Index = 0, Tag = 0, Offset = 4 Both blocks in set 0 are invalid Cold/Compulsory Miss ValidTagWord0Word1ValidTagWord0Word1 00 00

22 CS2100 Cache II 22 EXAMPLE (3/9) Addresses: 4, 0, 8, 36, 0 4 =...000 0 100 Set Index = 0, Tag = 0, Offset = 4 Put memory block in set 0 ValidTagWord0Word1ValidTagWord0Word1 10M[0]M[4]0 00

23 CS2100 Cache II 23 EXAMPLE (4/9) Addresses: 4, 0, 8, 36, 0 0 = …000 0 000 Set Index = 0, Tag = 0, Offset = 0 Valid + Tag match for first block in set 0 Cache Hit due to spatial locality ValidTagWord0Word1ValidTagWord0Word1 10M[0]M[4]0 00

24 CS2100 Cache II 24 EXAMPLE (5/9) Addresses: 4, 0, 8, 36, 0 8 =...000 1 000 Set Index = 1, Tag = 0, Offset = 0 Both blocks in set 1 are invalid Cold/Compulsory Miss ValidTagWord0Word1ValidTagWord0Word1 10M[0]M[4]0 00

25 CS2100 Cache II 25 EXAMPLE (6/9) Addresses: 4, 0, 8, 36, 0 8 = …000 1 000 Set Index = 1, Tag = 0, Offset = 0 Put memory block in set 1 ValidTagWord0Word1ValidTagWord0Word1 10M[0]M[4]0 10M[8]M[12]0

26 CS2100 Cache II 26 EXAMPLE (7/9) Addresses: 4, 0, 8, 36, 0 36 =..010 0 100 Set Index = 0, Tag = 2, Offset = 4 First block in set 0: Tag mismatch Second block in set 0: Invalid Cold/Compulsory Miss ValidTagWord0Word1ValidTagWord0Word1 10M[0]M[4]0 10M[8]M[12]0

27 CS2100 Cache II 27 EXAMPLE (8/9) Addresses: 4, 0, 8, 36, 0 36 =..010 0 100 Set Index = 0, Tag = 2, Offset = 4 Put memory block in set 0 ValidTagWord0Word1ValidTagWord0Word1 10M[0]M[4]12M[32]M[36] 10M[8]M[12]0

28 CS2100 Cache II 28 EXAMPLE (9/9) Addresses: 4, 0, 8, 36, 0 0 =..000 0 000 Set Index = 0, Tag = 0, Offset = 0 First block in set 0: Match Second block in set 0: Mismatch Cache hit due to temporal locality ValidTagWord0Word1ValidTagWord0Word1 10M[0]M[4]12M[32]M[36] 10M[8]M[12]0

29 CS2100 Cache II 29 FULLY ASSOCIATIVE (FA) CACHE Extreme case of set-associative cache A memory block can be placed in any cache block Need to search all the cache blocks in parallel Advantage: Zero conflict miss Disadvantage: High hardware cost due to comparators Not used except for very very small caches Find LEE Nicholas No Yes

30 CS2100 Cache II 30 DIRECT-MAPPED  FULLY ASSOCIATIVE

31 CS2100 Cache II 31 FULLY ASSOCIATIVE MAPPING Memory Address Block Number 310N-1N Block size = 2 N bytes Offset No Index; only Tag Block size = 2 N bytes Number of cache sets = 2 M Offset: N bits Tag: 32 – N Memory Address Block Number 310N-1N OffsetTag

32 CS2100 Cache II 32 FULLY ASSOCIATIVE CACHE CIRCUITRY Fully Associative Cache (32-Byte block size)  compare tags in parallel Byte Offset : Cache Data B0 0431 : Cache Tag (27 bits long) Valid : B1B31 : Cache Tag = = = = = : No Conflict Miss (since data can go anywhere)

33 CS2100 Cache II 33 CACHE PERFORMANCE Observations: 1.Cold/compulsory miss remains the same irrespective of cache size/associativity 2.For the same cache size, conflict miss goes down with increasing associativity 3.Conflict miss is zero for fully associative caches 4.For the same cache size, capacity miss remains the same irrespective of associativity 5.Capacity miss decreases with increasing cache size Total Miss = Cold miss + Conflict miss + Capacity miss Capacity miss (FA) = Total miss (FA) – Cold miss (FA) Identical block size

34 CS2100 Cache II 34 BLOCK REPLACEMENT POLICY (1/3) n-way Set Associative or Fully Associative have choices regarding where to place a block (which block to replace)  Of course, if there is an invalid block, use it Whenever get a cache hit, record the cache block that was touched When need to evict a cache block, choose one which hasn’t been touched recently: Least Recently Used (LRU)  Past is prologue: history suggests it is least likely of the choices to be used soon  Why? Because of temporal locality

35 CS2100 Cache II 35 BLOCK REPLACEMENT POLICY (2/3) LRU policy in action Consider the following memory block accesses for a 4-way cache set: 0 4 8 12 4 16 12 0 4 812416 4 120 04 8 LRU 08124 841612161204 eviction Hit Miss Hit Miss Hit Access 4 Access 16 Access 12 Access 0 Access 4

36 CS2100 Cache II 36 BLOCK REPLACEMENT POLICY (3/3) Sometimes hard to keep track of LRU if lots of choices Second Choice Policy: pick one at random and replace that block Advantages  Very simple to implement Disadvantage  May not match with temporal locality

37 CS2100 Cache II 37 Consider a cache containing four 8-byte blocks. Consider the following memory address references: 4, 8, 36, 48, 68, 0, 32 Assume a direct-mapped cache ADDITIONAL EXAMPLE #1  IndexValidTagWord0Word1 00 10 20 30

38 CS2100 Cache II 38 ADDITIONAL EXAMPLE #2  ValidTagWord0Word1 0 0 0 0 Consider a cache containing four 8-byte blocks. Consider the following memory address references: 4, 8, 36, 48, 68, 0, 32 Assume a fully-associative cache with LRU replacement policy

39 CS2100 Cache II 39 ADDITIONAL EXAMPLE #3  Consider a cache containing four 8-byte blocks. Consider the following memory address references: 4, 8, 36, 48, 68, 0, 32 Assume a 2-way set-associative cache with LRU replacement policy. VTagWord0Word1VTagWord0Word1 00 00

40 CS2100 Cache II 40 CACHE FRAMEWORK (1/2) Block Placement: Where can a block be placed in cache?  Direct mapped: one block defined by index  n-way set-associative: any one of the n blocks within the set defined by index  Fully associative: any cache block Block Identification: How is a block found if it is in the cache?  Direct mapped: Tag match with only one block  n-way set associative: Tag match for all the blocks within the set  Fully associative: Tag match for all the blocks within the cache

41 CS2100 Cache II 41 CACHE FRAMEWORK (2/2) Block Replacement: Which block should be replaced on a cache miss?  Direct mapped: No choice  n-way set associative: LRU or random  Fully associative: LRU or random Write Strategy: What happens on a write?  Write-through vs. write-back  Write allocate vs. write no allocate

42 CS2100 Cache II 42 IMPROVING MISS PENALTY (1/2) So far, we tried to improve Miss Rate:  Larger block size  Larger Cache  Higher Associativity What about Miss Penalty? Average access time = Hit rate  Hit Time + (1-Hit rate)  Miss penalty

43 CS2100 Cache II 43 IMPROVING MISS PENALTY (2/2) When caches started becoming popular, Miss Penalty was about 10 processor clock cycles Today 1 GHz Processor (1 ns per clock cycle) and 50 ns to go to DRAM  processor clock cycles! Solution: Place another cache between memory and the processor cache: Second Level (L2) Cache

44 CS2100 Cache II 44 MULTI-LEVEL CACHES Options: separate data and instruction caches or a unified cache Sample sizes:  L1: 32KB, 32-byte block, 4-way set associative  L2: 256KB, 128-byte block, 8-way associative  L3: 4MB, 256-byte block, Direct mapped Disk Reg L1 I$ L1 D$ unified L2 $ Memory Processor

45 CS2100 Cache II 45 APPLE iMAC G5 RegL1 I$L1 D$L2DRAMDisk Size1K64K32K512K256M80G Latency (cycles) 13311881+E7 iMac G5 1.6 GHz

46 CS2100 Cache II 46 APPLE iMAC PowerPC iMac’s PowerPC 970FX All caches on chip

47 CS2100 Cache II 47 INTEL Itanium 2 McKinley L1: 16KB I$ + 16KB D$ L2: 256KB L3: 1.5MB – 9MB Pentium 4 Extreme Edition L1: 12KB I$ + 8KB D$ L2: 256KB L3: 2MB

48 CS2100 Cache II 48 WHERE ARE WE GOING? Intel Itanium Montecito 1.72 billion transistors per die: core logic — 57M core caches — 106.5M 24 MB L3 cache — 1550M bus logic & I/O — 6.7M

49 CS2100 Cache II 49 Virtual Memory Use main memory as a “cache” for secondary (disk) storage  Managed jointly by CPU hardware and the operating system (OS) Programs share main memory  Each gets a private virtual address space holding its frequently used code and data  Protected from other programs CPU and OS translate virtual addresses to physical addresses  VM “block” is called a page  VM translation “miss” is called a page fault §5.4 Virtual Memory

50 CS2100 Cache II 50 Address Translation Fixed-size pages (e.g., 4K)

51 CS2100 Cache II 51 Page Fault Penalty On page fault, the page must be fetched from disk  Takes millions of clock cycles  Handled by OS code Try to minimize page fault rate  Fully associative placement  Smart replacement algorithms

52 CS2100 Cache II 52 Page Tables Stores placement information  Array of page table entries, indexed by virtual page number  Page table register in CPU points to page table in physical memory If page is present in memory  PTE stores the physical page number  Plus other status bits (referenced, dirty, …) If page is not present  PTE can refer to location in swap space on disk

53 CS2100 Cache II 53 Translation Using a Page Table

54 CS2100 Cache II 54 Mapping Pages to Storage

55 CS2100 Cache II 55 Replacement and Writes To reduce page fault rate, prefer least- recently used (LRU) replacement  Reference bit (aka use bit) in PTE set to 1 on access to page  Periodically cleared to 0 by OS  A page with reference bit = 0 has not been used recently Disk writes take millions of cycles  Block at once, not individual locations  Write through is impractical  Use write-back  Dirty bit in PTE set when page is written

56 CS2100 Cache II 56 Fast Translation Using a TLB Address translation would appear to require extra memory references  One to access the PTE  Then the actual memory access But access to page tables has good locality  So use a fast cache of PTEs within the CPU  Called a Translation Look-aside Buffer (TLB)  Typical: 16–512 PTEs, 0.5–1 cycle for hit, 10–100 cycles for miss, 0.01%–1% miss rate  Misses could be handled by hardware or software

57 CS2100 Cache II 57 Fast Translation Using a TLB

58 CS2100 Cache II 58 TLB Misses If page is in memory  Load the PTE from memory and retry  Could be handled in hardware Can get complex for more complicated page table structures  Or in software Raise a special exception, with optimized handler If page is not in memory (page fault)  OS handles fetching the page and updating the page table  Then restart the faulting instruction

59 CS2100 Cache II 59 TLB Miss Handler TLB miss indicates  Page present, but PTE not in TLB  Page not preset Must recognize TLB miss before destination register overwritten  Raise exception Handler copies PTE from memory to TLB  Then restarts instruction  If page not present, page fault will occur

60 CS2100 Cache II 60 Page Fault Handler Use faulting virtual address to find PTE Locate page on disk Choose page to replace  If dirty, write to disk first Read page into memory and update page table Make process runnable again  Restart from faulting instruction

61 CS2100 Cache II 61 TLB and Cache Interaction If cache tag uses physical address  Need to translate before cache lookup Alternative: use virtual address tag  Complications due to aliasing Different virtual addresses for shared physical address

62 CS2100 Cache II 62 Memory Protection Different tasks can share parts of their virtual address spaces  But need to protect against errant access  Requires OS assistance Hardware support for OS protection  Privileged supervisor mode (aka kernel mode)  Privileged instructions  Page tables and other state information only accessible in supervisor mode  System call exception (e.g., syscall in MIPS)

63 CS2100 Cache II 63 READING ASSIGNMENT Large and Fast: Exploiting Memory Hierarchy  Chapter 7 sections 7.3 (3 rd edition)  Chapter 5 sections 5.3 (4 th edition)

64 CS2100 Cache II 64 END


Download ppt "CS2100 Computer Organisation Cache II (AY2010/2011) Semester 2."

Similar presentations


Ads by Google