The Memory System (Chapter 5)

The Memory System (Chapter 5)
Course website:

Agenda Basic Concepts Performance Considerations: Interleaving, Hit ratio/rate, etc. Caches Virtual Memory 1.1. Organization 1.2. Pinning

1.1. Organization Byte Address 1 2 3 1 4 5 6 7 Word Address 2 8 9 . .
1 2 3 1 4 5 6 7 Word Address 2 8 9 . . 3 . . . .

1.1. Connection Memory-CPU
MAR Address Memory MDR Data CPU Read/Write MFC

1.1. Memory: contents Addressable number of bits Different orderings
Speed-up techniques Memory interleaving Cache memories Enlargement Virtual memory

1.1. Organisation (1) Address decoder b1 b1 W0 A0 FF FF W1 A1 A2 A3
R/W sense/wr CS input/output lines b7 b1 b0

1.2. Pinning Total number of pins required for 16x8 memory: 16
4 address lines 8 data lines 2 control lines 2 power lines

1.2. A 1K by 1 Memory W0 5-bit decoder 10-bit address lines 32 by 32
array W31 ...... two 32-to-1 multiplexors in out

1.2. Pinning Total number of pins required for 1024x1 memory: 16
10 address lines 2 data lines (in/out) 2 control lines 2 power lines For 128 by 8 memory: 19 pins ( ) Conclusion: the smaller the addressable unit, the fewer pins needed

Agenda Basic Concepts Performance Considerations Caches Virtual Memory
2.1. Interleaving 2.2. Performance Gap Processor-Memory 2.3. Caching 2.4. A Performance Model: Hit ratio, Performance Penalty, etc.

2.1. Interleaving Multiple Modules (1)
k bits m bits Module Address in Module MM address address address address CS CS CS Module Module i Module n-1 Block-wise organization (consecutive words in single module) CS=Chip Select

2.1. Interleaving Multiple Modules (2)
m bits k bits Address in Module Module MM address address address address CS CS CS Module Module i Module 2**k-1 Interleaving organization (consecutive words in consecutive module) CS = Chip Select

Questions What is the advantage of the interleaved organization?
What the disadvantage? Higher bandwidth CPU-memory: data transfer to/from multiple modules simultaneously When a module breaks down, memory has many small holes

2.2. Problem: The Performance Gap Processor-Memory
Processor: CPU Speeds 2X every 2 years ~Moore’s Law; limit ~2010 Memory: DRAM Speeds 2X every 7 years Gap Still Growing? Gap: 2X every 2 years

2.2. Idea: Memory Hierarchy
increasing speed 2.2. Idea: Memory Hierarchy increasing cost CPU Primary cache: L1 increasing size Secondary cache: L2 Main Memory Disks

2.3. Caches (1) Problem: Main memory is slower than CPU registers (factor of 5-10) Solution: Fast and small memory between CPU and main memory Contains: recent references to memory Cache Main memory CPU

2.3. Caches (2)/2.4. A Performance Model
Works because of locality principle Profit: cache hit ratio (rate): h access time cache: c cache miss ratio (rate): 1-h access time main memory: m mean access time: h.c + (1-h).m Cache is transparent to programmer

2.3. Caches (3) READ operation: WRITE operation:
if not in cache, copy block into cache and read out of cache (possibly read-through) if in cache, read out of cache WRITE operation: if not in cache, write in main memory if in cache, write in cache, and: write in main memory (store through) set modified (dirty) bit, and write later

2.3. Caches (4) The Library Analogy
Real-world analogue: borrow books from a library store these books according to the first letter of the name of the first author in 26 locations Direct mapped: separate location for a single book for each letter of the alphabet Associative: any book can go to any of the 26 locations Set-associative: two locations for letters A-B, two for C-D, etc 1 2 3 … 26 A Z

2.3. Caches (5) Suppose size of main memory in bytes: N = 2n block size in bytes: b = 2k number of blocks in cache: 128 e.g., n=16, k=4, b=16 Every block in cache has valid bit (is reset when memory is modified) At context switch: invalidate cache

3.1. Mapping Function 3.2. Replacement Algorithm 3.3. Examples of Mapping 3.4. Examples of Caches in Commercial Processors 3.5. Write Policy 3.6. Number of Blocks/Caches/…

3.1. Mapping Function 1. Direct Mapped Cache (1)
A block in main memory can be at only one place in the cache This place is determined by its block number j: place = j modulo size of cache tag block word 5 7 4 main memory address

3.1. Direct Mapped Cache (2) 5 bits BLOCK 0 BLOCK 0 .................
tag BLOCK 0 BLOCK 1 BLOCK 2 BLOCK 127 BLOCK 128 main memory BLOCK 129 BLOCK 255 CACHE BLOCK 256

3.1. Direct Mapped Cache (3) 5 bits BLOCK 0 BLOCK 1 BLOCK 0
tag BLOCK 0 BLOCK 1 BLOCK 2 BLOCK 1 BLOCK 127 BLOCK 128 main memory BLOCK 129 BLOCK 255 CACHE BLOCK 256

3.1. Mapping Function 2. Associative Cache (1)
Each block can be at any place in cache Cache access: parallel (associative) match of tag in address with tags in all cache entries Associative: slower, more expensive, higher hit ratio tag word 12 4 main memory address

3.1.2. Associative Cache (2) 12- bits BLOCK 0 BLOCK 1 BLOCK 0
tag BLOCK 0 BLOCK 1 BLOCK 2 BLOCK 1 BLOCK 127 BLOCK 128 main memory BLOCK 129 128 blocks BLOCK 255 BLOCK 256

3.1. Mapping Function 3. Set-Associative Cache (1)
Combination of direct mapped and associative Cache consists of sets Mapping of block to set is direct, determined by set number Each set is associative tag set word 6 6 4 main memory address

3.1.3. Set-Associative Cache (2)
6- bits tag BLOCK 0 set 0 BLOCK 0 BLOCK 1 tag BLOCK 1 tag BLOCK 127 BLOCK 2 set 1 tag BLOCK 128 BLOCK 3 BLOCK 129 tag BLOCK 4 128 blocks, 64 sets BLOCK 255 Q: What is wrong in this picture? BLOCK 256 Answer: 64 sets, so block 64 also goes to set 0

3.1.3. Set-Associative Cache (3)
6- bits set 0 BLOCK 0 tag BLOCK 0 BLOCK 1 tag BLOCK 1 tag BLOCK 127 BLOCK 2 set 1 tag BLOCK 128 BLOCK 3 BLOCK 129 tag BLOCK 4 128 blocks, 64 sets BLOCK 255 BLOCK 256

Question Q How many bits is the: Main memory: 4 GByte
Cache: 512 blocks of 64 byte Cache: 8-way set-associative (set size is 8) All memories are byte addressable Q How many bits is the: byte address within a block set number tag

Answer Main memory is 4 GByte, so 32-bits address
A block is 64 byte, so 6-bits byte address within a block 8-way set-associative cache with 512 blocks, so 512/8=64 sets, so 6-bits set number So, =20-bits tag tag set word 20 6 6

3.2. Replacement Algorithm Replacement (1)
(Set) associative replacement algorithms: Least Recently Used (LRU) if 2k blocks per set, implement with k-bit counters per block hit: increase counters lower than the one referenced with 1, set counter at 0 miss and set not full: replace, set counter new block 0, increase rest miss and set full: replace block with highest value (2k-1), set counter new block at 0, increase rest

3.2.1. LRU: Example 1 k=2  4 blocks per set 1 1 increased 1 increased
1 1 increased 1 increased HIT now at the top 1 unchanged

3.2.2. LRU: Example 2 k=2 EMPTY 1 1 now at the top increased increased
1 now at the top increased increased increased miss and set not full

3.2.3. LRU: Example 3 k=2 1 1 increased increased increased
1 1 increased increased increased now at the top miss and set full

3.2. Replacement Algorithm Replacement (2)
Alternatives for LRU: Replace oldest block, First-In-First-Out (FIFO) Least-Frequently Used (LFU) Random replacement

3.3. Example (1): program First pass: from start to end
int SUM = 0; for(j=0, j<10, j++) { SUM =SUM + A[0,j]; } AVE = SUM/10; for(i=9, i>-1, i--){ A[0,i] = A[0,i]/AVE Normalize the elements of row 0 of array A First pass: from start to end Second pass: from end to start

3.3. Example (2): cache Cache: 8 blocks 2 sets each block 1 word
LRU replacement BLOCK 0 tag BLOCK 1 tag Set 0 BLOCK 2 tag BLOCK 3 tag tag block BLOCK 4 tag 13 3 direct BLOCK 5 tag tag Set 1 BLOCK 6 tag 16 associative BLOCK 7 tag tag set set associative 15 1

3.3. Example (3): array a(0,0) 7A00 0111101000000 0 0 0 a(1,0)
a(0,0) a(1,0) a(2,0) a(3,0) a(0,1) .... a(0,9) a(1,9) a(2,9) a(3,9) 4x10 array column-major ordering elements of row 0 are four locations apart Memory address Tag direct Tag set-associative Tag associative

3.3. Example (4): direct mapped
Contents of cache after pass: block pos. j=1 j=3 j=5 j=7 j=9 i=6 i=4 i=2 i=0 a[0,0] a[0,2] a[0,4] a[0,6] a[0,8] a[0,6] a[0,4] a[0,2] a[0,0] 1 Elements of row 0 are also 4 locations apart in the cache 2 3 4 a[0,1] a[0,3] a[0,5] a[0,7] a[0,9] a[0,7] a[0,5] a[0,3] a[0,1] 5 = miss 6 = hit 7 Conclusion: from 20 accesses none are in cache

3.3. Example (5): associative
Contents of cache after pass: block pos. j=7 j=8 j=9 i=1 i=0 a[0,0] a[0,8] a[0,8] a[0,8] a[0,0] 1 a[0,1] a[0,1] a[0,9] from i=9 to i=2 all are in cache... a[0,1] a[0,1] 2 a[0,2] a[0,2] a[0,2] a[0,2] a[0,2] 3 a[0,3] a[0,3] a[0,3] a[0,3] a[0,3] 4 a[0,4] a[0,4] a[0,4] a[0,4] a[0,4] 5 a[0,5] a[0,5] a[0,5] a[0,5] a[0,5] 6 a[0,6] a[0,6] a[0,6] a[0,6] a[0,6] 7 a[0,7] a[0,7] a[0,7] a[0,7] a[0,7] Conclusion: from 20 accesses 8 are in cache

3.3. Example (6): set-associative
Contents of cache after pass: block pos. j=3 j=7 j=9 i=4 i=2 i=0 a[0,0] a[0,4] a[0,8] a[0,4] a[0,4] a[0,0] from i=9 to i=6 all are in cache... 1 a[0,1] a[0,5] a[0,9] a[0,5] a[0,5] a[0,1] set 0 2 a[0,2] a[0,6] a[0,6] a[0,6] a[0,2] a[0,2] 3 a[0,3] a[0,7] a[0,7] a[0,7] a[0,3] a[0,3] 4 5 all elements of row 0 are mapped to set 0 6 7 Conclusion: from 20 accesses 4 are in cache

3.4. Example: PowerPC (1) PowerPC 604
Separate data and instruction cache Caches are 16 Kbytes Four-way set-associative cache Cache has 128 sets Each block has 8 words of 32 bits

3.4. Example: PowerPC (2) Block 0 =? Block 1 no set 0 Block 2 .....
address set number tag word address in block 003F4 8 00BA2 st Block 0 =? Block 1 no set 0 Block 2 ..... 003F4 st Block 3 =? yes

4.2. Address Translation

4.1. Virtual Memory (1) Problem: compiled program does not fit into memory Solution: virtual memory, where the logical address space is larger than the physical address space Logical address space: addresses referable by instructions Physical address space: addresses referable in real machine

4.1. Virtual Memory (2) For realizing virtual memory, we need an address conversion: am = f(av) am is physical address (machine address) av is virtual address This is generally done by hardware

4.1. Organization av am am Processor data MMU Cache data Main Memory
DMA transfer Disk Storage

4.2. Address Translation Basic approach is to partition both physical address space and virtual address space in equally sized blocks called pages A virtual address is composed of: a page number word number within a page (the offset)

4.2. Page tables (1) page table address virtual page number offset +
virtual address from processor page table base register page table address virtual page number offset + Page table in main memory control bits page frame number page frame offset physical address from processor

4.2. Page tables (2) Having page tables only in main memory is much too slow Additional memory access for every instruction and operand Solution: keep a cache with recent address translation: a Translation Look-aside Buffer (TLB)

4.2. Operation of TLB Idea: keep most recent address translations
virtual address from processor Idea: keep most recent address translations virtual page number offset virtual page # real page # = ? TLB miss hit control bits page frame offset physical address from processor

4.2. Policies The pages of a process in main memory: resident set
Mechanism works because of principle of locality Page replacement algorithms needed Protection possible through page table register Sharing possible through page table Hardware support: Memory Management Unit (MMU)

Question Q How many bits is the Main memory: 256 MByte
Maximal virtual-address space: 4 GByte Page size: 4 KByte All memories are byte addressable Q How many bits is the offset within a page virtual page frame number (physical) page frame number

Answer Main memory: 256 MByte Maximal virtual-address space: 4 GByte
Page size: 4 KByte All memories are byte addressable Virtual address: 32 bits (232=4 Gbyte) Physical address: 28 bits (228=256 Mbyte) Offset in a page: 12 bits (212=4 kbyte) Virtual page frame number: 32-12=20 bits Physical page frame number: 28-12=16 bits

The Memory System (Chapter 5)

Similar presentations

Presentation on theme: "The Memory System (Chapter 5)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Memory System (Chapter 5)

Similar presentations

Presentation on theme: "The Memory System (Chapter 5)"— Presentation transcript:

Similar presentations

About project

Feedback