1 The Memory System (Chapter 5)
Agenda 1.Basic Concepts 2.Performance Considerations: Interleaving, Hit ratio/rate, etc. 3.Caches 4.Virtual Memory 1.1. Organization 1.2. Pinning
TU-Delft TI1400/11-PDS Organization Word Address Byte Address
TU-Delft TI1400/11-PDS Connection Memory-CPU Memory CPU Read/Write MFC Address Data MAR MDR
TU-Delft TI1400/11-PDS Memory: contents Addressable number of bits Different orderings Speed-up techniques -Memory interleaving -Cache memories Enlargement -Virtual memory
TU-Delft TI1400/11-PDS Organisation (1) sense/wr W0 W1 W15 FF Address decoder input/output lines b7b1b0 R/W CS A0 A1 A2 A3 b1
TU-Delft TI1400/11-PDS Pinning Total number of pins required for 16x8 memory: address lines -8 data lines -2 control lines -2 power lines
TU-Delft TI1400/11-PDS 8 32 by 32 memory array W0 W A 1K by 1 Memory 5-bit deco- der 10-bit address lines two 32-to-1 multiplexors inout
TU-Delft TI1400/11-PDS Pinning Total number of pins required for 1024x1 memory: address lines -2 data lines (in/out) -2 control lines -2 power lines For 128 by 8 memory: 19 pins ( ) Conclusion: the smaller the addressable unit, the fewer pins needed
TU-Delft TI1400/11-PDS Agenda 1.Basic Concepts 2.Performance Considerations 3.Caches 4.Virtual Memory 2.1. Interleaving 2.2. Performance Gap Processor-Memory 2.3. Caching 2.4. A Performance Model: Hit ratio, Performance Penalty, etc.
TU-Delft TI1400/11-PDS Interleaving Multiple Modules (1) Address in Module m bits CS address Module n-1 CS address Module i CS address Module 0 Module k bits MM address Block-wise organization (consecutive words in single module) CS=Chip Select
TU-Delft TI1400/11-PDS Interleaving Multiple Modules (2) CS address Module 2**k-1 CS address Module i CS address Module 0 Module k bits Address in Module m bits MM address Interleaving organization (consecutive words in consecutive module) CS = Chip Select
TU-Delft TI1400/11-PDS 13 Questions What is the advantage of the interleaved organization? What the disadvantage? Higher bandwidth CPU-memory: data transfer to/from multiple modules simultaneously When a module breaks down, memory has many small holes
TU-Delft TI1400/11-PDS Problem: The Performance Gap Processor-Memory Processor: CPU Speeds 2X every 2 years ~Moore’s Law; limit ~2010 Memory: DRAM Speeds 2X every 7 years Gap: 2X every 2 years Gap Still Growing?
TU-Delft TI1400/11-PDS Idea: Memory Hierarchy increasing size increasing speed increasing cost Disks Main Memory Secondary cache: L2 Primary cache: L1 CPU
TU-Delft TI1400/11-PDS Caches (1) Problem: Main memory is slower than CPU registers (factor of 5-10) Solution: Fast and small memory between CPU and main memory Contains: recent references to memory CPU Cache Main memory
TU-Delft TI1400/11-PDS Caches (2)/2.4. A Performance Model Works because of locality principle Profit: -cache hit ratio (rate):h -access time cache: c -cache miss ratio (rate):1-h -access time main memory: m -mean access time: h.c + (1-h).m Cache is transparent to programmer
TU-Delft TI1400/11-PDS Caches (3) READ operation: -if not in cache, copy block into cache and read out of cache (possibly read-through) -if in cache, read out of cache WRITE operation: -if not in cache, write in main memory -if in cache, write in cache, and: write in main memory (store through) set modified (dirty) bit, and write later
TU-Delft TI1400/11-PDS Caches (4) The Library Analogy Real-world analogue: -borrow books from a library -store these books according to the first letter of the name of the first author in 26 locations Direct mapped: separate location for a single book for each letter of the alphabet Associative: any book can go to any of the 26 locations Set-associative: two locations for letters A-B, two for C-D, etc … A Z
TU-Delft TI1400/11-PDS Caches (5) Suppose -size of main memory in bytes: N = 2 n -block size in bytes: b = 2 k -number of blocks in cache: 128 -e.g., n=16, k=4, b=16 Every block in cache has valid bit (is reset when memory is modified) At context switch: invalidate cache
TU-Delft TI1400/11-PDS Agenda 1.Basic Concepts 2.Performance Considerations 3.Caches 4.Virtual Memory 3.1. Mapping Function 3.2. Replacement Algorithm 3.3. Examples of Mapping 3.4. Examples of Caches in Commercial Processors 3.5. Write Policy 3.6. Number of Blocks/Caches/…
TU-Delft TI1400/11-PDS Mapping Function 1. Direct Mapped Cache (1) A block in main memory can be at only one place in the cache This place is determined by its block number j: -place = j modulo size of cache 574 tagblockword main memory address
TU-Delft TI1400/11-PDS Direct Mapped Cache (2) BLOCK BLOCK 127 BLOCK 128 BLOCK BLOCK 255 BLOCK bits tag BLOCK 0 BLOCK 1 BLOCK 2 CACHE main memory
TU-Delft TI1400/11-PDS Direct Mapped Cache (3) BLOCK 0 BLOCK BLOCK 127 BLOCK 128 BLOCK BLOCK 255 BLOCK bits CACHE main memory tag BLOCK 0 BLOCK 1 BLOCK 2
TU-Delft TI1400/11-PDS Mapping Function 2. Associative Cache (1) Each block can be at any place in cache Cache access: parallel (associative) match of tag in address with tags in all cache entries Associative: slower, more expensive, higher hit ratio 124 tagword main memory address
TU-Delft TI1400/11-PDS Associative Cache (2) BLOCK 0 BLOCK BLOCK 127 BLOCK 128 BLOCK BLOCK 255 BLOCK bits 128 blocks main memory tag BLOCK 0 BLOCK 1 BLOCK 2
TU-Delft TI1400/11-PDS Mapping Function 3. Set-Associative Cache (1) Combination of direct mapped and associative Cache consists of sets Mapping of block to set is direct, determined by set number Each set is associative 664 tagsetword main memory address
TU-Delft TI1400/11-PDS Set-Associative Cache (2) BLOCK 0 BLOCK BLOCK 127 BLOCK 128 BLOCK BLOCK 255 BLOCK 256 tag 6- bits BLOCK blocks, 64 sets tag BLOCK 1 tag BLOCK 2 tag BLOCK 3 tag BLOCK 4 set 0 set 1 Q: What is wrong in this picture? Answer: 64 sets, so block 64 also goes to set 0
TU-Delft TI1400/11-PDS Set-Associative Cache (3) BLOCK 0 BLOCK BLOCK 127 BLOCK 128 BLOCK BLOCK 255 BLOCK 256 tag 6- bits BLOCK blocks, 64 sets tag BLOCK 1 tag BLOCK 2 tag BLOCK 3 tag BLOCK 4 set 0 set 1
TU-Delft TI1400/11-PDS 30 Question Main memory: 4 GByte Cache: 512 blocks of 64 byte Cache: 8-way set-associative (set size is 8) All memories are byte addressable Q How many bits is the: -byte address within a block -set number -tag
TU-Delft TI1400/11-PDS 31 Answer Main memory is 4 GByte, so 32-bits address A block is 64 byte, so 6-bits byte address within a block 8-way set-associative cache with 512 blocks, so 512/8=64 sets, so 6-bits set number So, =20-bits tag 2066 tagsetword
TU-Delft TI1400/11-PDS Replacement Algorithm Replacement (1) (Set) associative replacement algorithms: Least Recently Used (LRU) -if 2 k blocks per set, implement with k-bit counters per block -hit: increase counters lower than the one referenced with 1, set counter at 0 -miss and set not full: replace, set counter new block 0, increase rest -miss and set full: replace block with highest value (2 k -1), set counter new block at 0, increase rest
TU-Delft TI1400/11-PDS LRU: Example k=2 4 blocks per set HIT increased unchanged now at the top
TU-Delft TI1400/11-PDS LRU: Example k=2 EMPTY miss and set not full increased now at the top
TU-Delft TI1400/11-PDS LRU: Example k=2 miss and set full increased now at the top
TU-Delft TI1400/11-PDS Replacement Algorithm Replacement (2) Alternatives for LRU: -Replace oldest block, First-In-First-Out (FIFO) -Least-Frequently Used (LFU) -Random replacement
TU-Delft TI1400/11-PDS Example (1): program int SUM = 0; for(j=0, j<10, j++) { SUM =SUM + A[0,j]; } AVE = SUM/10; for(i=9, i>-1, i--){ A[0,i] = A[0,i]/AVE } Normalize the elements of row 0 of array A First pass: from start to end Second pass: from end to start
TU-Delft TI1400/11-PDS Example (2): cache BLOCK 0 tag BLOCK 1 tag BLOCK 2 tag BLOCK 3 tag BLOCK 4 tag BLOCK 5 tag BLOCK 6 tag BLOCK 7 tag Cache: 8 blocks 2 sets each block 1 word LRU replacement Set 0 Set tagblock direct 16 tag associative 151 tagset associative
TU-Delft TI1400/11-PDS Example (3): array Tag direct Tag set-associative Tag associative a(0,0) a(1,0) a(2,0) a(3,0) a(0,1).... a(0,9) a(1,9) a(2,9) a(3,9) Memory address 4x10 array column-major ordering elements of row 0 are four locations apart 7A00
TU-Delft TI1400/11-PDS Example (4): direct mapped a[0,0]a[0,2]a[0,4]a[0,6]a[0,8]a[0,6]a[0,4]a[0,2]a[0,0] j=1j=3j=5j=7j=9i=6i=4i=2i= block pos. Contents of cache after pass: a[0,1]a[0,3]a[0,5]a[0,7]a[0,9]a[0,7]a[0,5]a[0,3]a[0,1] = miss = hit Elements of row 0 are also 4 locations apart in the cache Conclusion: from 20 accesses none are in cache
TU-Delft TI1400/11-PDS Example (5): associative a[0,0]a[0,8] a[0,0] j=7j=8j=9 i=1i=0 a[0,1] a[0,9] a[0,1] a[0,2] a[0,3] a[0,4] a[0,5] a[0,6] a[0,7] block pos. from i=9 to i=2 all are in cache... Conclusion: from 20 accesses 8 are in cache Contents of cache after pass:
TU-Delft TI1400/11-PDS Example (6): set-associative a[0,0]a[0,4]a[0,8]a[0,4] j=3j=7j=9i=4i=2 a[0,1]a[0,5]a[0,9]a[0,5] a[0,2]a[0,6] a[0,2] a[0,3]a[0,7] a[0,3] block pos. a[0,0] i=0 a[0,1] a[0,2] a[0,3] set 0 all elements of row 0 are mapped to set 0 Contents of cache after pass: from i=9 to i=6 all are in cache... Conclusion: from 20 accesses 4 are in cache
TU-Delft TI1400/11-PDS Example: PowerPC (1) PowerPC 604 Separate data and instruction cache Caches are 16 Kbytes Four-way set-associative cache Cache has 128 sets Each block has 8 words of 32 bits
TU-Delft TI1400/11-PDS Example: PowerPC (2) Block 0 00BA2 st Block 1 Block 2 Block 3 003F4 st address F408 set 0 =? no yes word address in block set number tag.....
TU-Delft TI1400/11-PDS Agenda 1.Basic Concepts 2.Performance Considerations 3.Caches 4.Virtual Memory 4.1. Basic Concepts 4.2. Address Translation
TU-Delft TI1400/11-PDS Virtual Memory (1) Problem: compiled program does not fit into memory Solution: virtual memory, where the logical address space is larger than the physical address space Logical address space: addresses referable by instructions Physical address space: addresses referable in real machine
TU-Delft TI1400/11-PDS Virtual Memory (2) For realizing virtual memory, we need an address conversion: a m = f(a v ) a m is physical address (machine address) a v is virtual address This is generally done by hardware
TU-Delft TI1400/11-PDS Organization Processor MMU Cache Main Memory Disk Storage amam amam avav data DMA transfer
TU-Delft TI1400/11-PDS Address Translation Basic approach is to partition both physical address space and virtual address space in equally sized blocks called pages A virtual address is composed of: -a page number -word number within a page (the offset)
TU-Delft TI1400/11-PDS Page tables (1) virtual page numberoffset page frameoffset page table address + virtual address from processor page table base register physical address from processor control bits page frame number Page table in main memory
TU-Delft TI1400/11-PDS Page tables (2) Having page tables only in main memory is much too slow Additional memory access for every instruction and operand Solution: keep a cache with recent address translation: a Translation Look-aside Buffer (TLB)
TU-Delft TI1400/11-PDS Operation of TLB virtual page numberoffset virtual address from processor page frameoffset physical address from processor virtual page #real page # = ? hit miss control bits TLB Idea: keep most recent address translations
TU-Delft TI1400/11-PDS Policies The pages of a process in main memory: resident set Mechanism works because of principle of locality Page replacement algorithms needed Protection possible through page table register Sharing possible through page table Hardware support: Memory Management Unit (MMU)
TU-Delft TI1400/11-PDS 54 Question Main memory: 256 MByte Maximal virtual-address space: 4 GByte Page size: 4 KByte All memories are byte addressable Q How many bits is the -offset within a page -virtual page frame number -(physical) page frame number
TU-Delft TI1400/11-PDS 55 Answer Main memory: 256 MByte Maximal virtual-address space: 4 GByte Page size: 4 KByte All memories are byte addressable Virtual address: 32 bits (2 32 =4 Gbyte) Physical address: 28 bits (2 28 =256 Mbyte) Offset in a page: 12 bits (2 12 =4 kbyte) Virtual page frame number: 32-12=20 bits Physical page frame number: 28-12=16 bits