Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 The Memory System (Chapter 5)

Similar presentations


Presentation on theme: "1 The Memory System (Chapter 5)"— Presentation transcript:

1 1 The Memory System (Chapter 5) http://www.pds.ewi.tudelft.nl/~iosup/Courses/2011_ti1400_9.ppt

2 Agenda 1.Basic Concepts 2.Performance Considerations: Interleaving, Hit ratio/rate, etc. 3.Caches 4.Virtual Memory 1.1. Organization 1.2. Pinning

3 TU-Delft TI1400/11-PDS 3 1.1. Organization 0123 4567 89...... Word Address Byte Address 0 1 2 3

4 TU-Delft TI1400/11-PDS 4 1.1. Connection Memory-CPU Memory CPU Read/Write MFC Address Data MAR MDR

5 TU-Delft TI1400/11-PDS 5 1.1. Memory: contents Addressable number of bits Different orderings Speed-up techniques -Memory interleaving -Cache memories Enlargement -Virtual memory

6 TU-Delft TI1400/11-PDS 6 1.1. Organisation (1) sense/wr W0 W1 W15 FF Address decoder input/output lines b7b1b0 R/W CS A0 A1 A2 A3 b1

7 TU-Delft TI1400/11-PDS 7 1.2. Pinning Total number of pins required for 16x8 memory: 16 -4 address lines -8 data lines -2 control lines -2 power lines

8 TU-Delft TI1400/11-PDS 8 32 by 32 memory array W0 W31...... 1.2. A 1K by 1 Memory 5-bit deco- der 10-bit address lines two 32-to-1 multiplexors inout

9 TU-Delft TI1400/11-PDS 9 1.2. Pinning Total number of pins required for 1024x1 memory: 16 -10 address lines -2 data lines (in/out) -2 control lines -2 power lines For 128 by 8 memory: 19 pins (7+8+2+2) Conclusion: the smaller the addressable unit, the fewer pins needed

10 TU-Delft TI1400/11-PDS Agenda 1.Basic Concepts 2.Performance Considerations 3.Caches 4.Virtual Memory 2.1. Interleaving 2.2. Performance Gap Processor-Memory 2.3. Caching 2.4. A Performance Model: Hit ratio, Performance Penalty, etc.

11 TU-Delft TI1400/11-PDS 11 2.1. Interleaving Multiple Modules (1) Address in Module m bits CS address Module n-1 CS address Module i CS address Module 0 Module k bits MM address Block-wise organization (consecutive words in single module) CS=Chip Select

12 TU-Delft TI1400/11-PDS 12 2.1. Interleaving Multiple Modules (2) CS address Module 2**k-1 CS address Module i CS address Module 0 Module k bits Address in Module m bits MM address Interleaving organization (consecutive words in consecutive module) CS = Chip Select

13 TU-Delft TI1400/11-PDS 13 Questions What is the advantage of the interleaved organization? What the disadvantage? Higher bandwidth CPU-memory: data transfer to/from multiple modules simultaneously When a module breaks down, memory has many small holes

14 TU-Delft TI1400/11-PDS 14 2.2. Problem: The Performance Gap Processor-Memory Processor: CPU Speeds 2X every 2 years ~Moore’s Law; limit ~2010 Memory: DRAM Speeds 2X every 7 years Gap: 2X every 2 years Gap Still Growing?

15 TU-Delft TI1400/11-PDS 15 2.2. Idea: Memory Hierarchy increasing size increasing speed increasing cost Disks Main Memory Secondary cache: L2 Primary cache: L1 CPU

16 TU-Delft TI1400/11-PDS 16 2.3. Caches (1) Problem: Main memory is slower than CPU registers (factor of 5-10) Solution: Fast and small memory between CPU and main memory Contains: recent references to memory CPU Cache Main memory

17 TU-Delft TI1400/11-PDS 17 2.3. Caches (2)/2.4. A Performance Model Works because of locality principle Profit: -cache hit ratio (rate):h -access time cache: c -cache miss ratio (rate):1-h -access time main memory: m -mean access time: h.c + (1-h).m Cache is transparent to programmer

18 TU-Delft TI1400/11-PDS 18 2.3. Caches (3) READ operation: -if not in cache, copy block into cache and read out of cache (possibly read-through) -if in cache, read out of cache WRITE operation: -if not in cache, write in main memory -if in cache, write in cache, and: write in main memory (store through) set modified (dirty) bit, and write later

19 TU-Delft TI1400/11-PDS 19 2.3. Caches (4) The Library Analogy Real-world analogue: -borrow books from a library -store these books according to the first letter of the name of the first author in 26 locations Direct mapped: separate location for a single book for each letter of the alphabet Associative: any book can go to any of the 26 locations Set-associative: two locations for letters A-B, two for C-D, etc 12326 … A Z

20 TU-Delft TI1400/11-PDS 20 2.3. Caches (5) Suppose -size of main memory in bytes: N = 2 n -block size in bytes: b = 2 k -number of blocks in cache: 128 -e.g., n=16, k=4, b=16 Every block in cache has valid bit (is reset when memory is modified) At context switch: invalidate cache

21 TU-Delft TI1400/11-PDS Agenda 1.Basic Concepts 2.Performance Considerations 3.Caches 4.Virtual Memory 3.1. Mapping Function 3.2. Replacement Algorithm 3.3. Examples of Mapping 3.4. Examples of Caches in Commercial Processors 3.5. Write Policy 3.6. Number of Blocks/Caches/…

22 TU-Delft TI1400/11-PDS 22 3.1. Mapping Function 1. Direct Mapped Cache (1) A block in main memory can be at only one place in the cache This place is determined by its block number j: -place = j modulo size of cache 574 tagblockword main memory address

23 TU-Delft TI1400/11-PDS 23 3.1. Direct Mapped Cache (2) BLOCK 0................. BLOCK 127 BLOCK 128 BLOCK 129.................. BLOCK 255 BLOCK 256 5 bits tag BLOCK 0 BLOCK 1 BLOCK 2 CACHE main memory

24 TU-Delft TI1400/11-PDS 24 3.1. Direct Mapped Cache (3) BLOCK 0 BLOCK 1................. BLOCK 127 BLOCK 128 BLOCK 129.................. BLOCK 255 BLOCK 256 5 bits CACHE main memory tag BLOCK 0 BLOCK 1 BLOCK 2

25 TU-Delft TI1400/11-PDS 25 3.1. Mapping Function 2. Associative Cache (1) Each block can be at any place in cache Cache access: parallel (associative) match of tag in address with tags in all cache entries Associative: slower, more expensive, higher hit ratio 124 tagword main memory address

26 TU-Delft TI1400/11-PDS 26 3.1.2. Associative Cache (2) BLOCK 0 BLOCK 1................. BLOCK 127 BLOCK 128 BLOCK 129.................. BLOCK 255 BLOCK 256 12- bits 128 blocks main memory tag BLOCK 0 BLOCK 1 BLOCK 2

27 TU-Delft TI1400/11-PDS 27 3.1. Mapping Function 3. Set-Associative Cache (1) Combination of direct mapped and associative Cache consists of sets Mapping of block to set is direct, determined by set number Each set is associative 664 tagsetword main memory address

28 TU-Delft TI1400/11-PDS 28 3.1.3. Set-Associative Cache (2) BLOCK 0 BLOCK 1................. BLOCK 127 BLOCK 128 BLOCK 129.................. BLOCK 255 BLOCK 256 tag 6- bits BLOCK 0 128 blocks, 64 sets tag BLOCK 1 tag BLOCK 2 tag BLOCK 3 tag BLOCK 4 set 0 set 1 Q: What is wrong in this picture? Answer: 64 sets, so block 64 also goes to set 0

29 TU-Delft TI1400/11-PDS 29 3.1.3. Set-Associative Cache (3) BLOCK 0 BLOCK 1................. BLOCK 127 BLOCK 128 BLOCK 129.................. BLOCK 255 BLOCK 256 tag 6- bits BLOCK 0 128 blocks, 64 sets tag BLOCK 1 tag BLOCK 2 tag BLOCK 3 tag BLOCK 4 set 0 set 1

30 TU-Delft TI1400/11-PDS 30 Question Main memory: 4 GByte Cache: 512 blocks of 64 byte Cache: 8-way set-associative (set size is 8) All memories are byte addressable Q How many bits is the: -byte address within a block -set number -tag

31 TU-Delft TI1400/11-PDS 31 Answer Main memory is 4 GByte, so 32-bits address A block is 64 byte, so 6-bits byte address within a block 8-way set-associative cache with 512 blocks, so 512/8=64 sets, so 6-bits set number So, 32-6-6=20-bits tag 2066 tagsetword

32 TU-Delft TI1400/11-PDS 32 3.2. Replacement Algorithm Replacement (1) (Set) associative replacement algorithms: Least Recently Used (LRU) -if 2 k blocks per set, implement with k-bit counters per block -hit: increase counters lower than the one referenced with 1, set counter at 0 -miss and set not full: replace, set counter new block 0, increase rest -miss and set full: replace block with highest value (2 k -1), set counter new block at 0, increase rest

33 TU-Delft TI1400/11-PDS 33 3.2.1. LRU: Example 1 01 00 10 11 1001 00 11 k=2  4 blocks per set HIT increased unchanged now at the top

34 TU-Delft TI1400/11-PDS 34 3.2.2. LRU: Example 2 11 00 10 01 00 01 11 10 k=2 EMPTY miss and set not full increased now at the top

35 TU-Delft TI1400/11-PDS 35 3.2.3. LRU: Example 3 01 00 10 11 10 01 11 00 k=2 miss and set full increased now at the top

36 TU-Delft TI1400/11-PDS 36 3.2. Replacement Algorithm Replacement (2) Alternatives for LRU: -Replace oldest block, First-In-First-Out (FIFO) -Least-Frequently Used (LFU) -Random replacement

37 TU-Delft TI1400/11-PDS 37 3.3. Example (1): program int SUM = 0; for(j=0, j<10, j++) { SUM =SUM + A[0,j]; } AVE = SUM/10; for(i=9, i>-1, i--){ A[0,i] = A[0,i]/AVE } Normalize the elements of row 0 of array A First pass: from start to end Second pass: from end to start

38 TU-Delft TI1400/11-PDS 38 3.3. Example (2): cache BLOCK 0 tag BLOCK 1 tag BLOCK 2 tag BLOCK 3 tag BLOCK 4 tag BLOCK 5 tag BLOCK 6 tag BLOCK 7 tag Cache: 8 blocks 2 sets each block 1 word LRU replacement Set 0 Set 1 133 tagblock direct 16 tag associative 151 tagset associative

39 TU-Delft TI1400/11-PDS 39 3.3. Example (3): array 0111101000000 0 0 0 0111101000000 0 0 1 0111101000000 0 1 0 0111101000000 0 1 1........................ 1 0 0........................ 0111101000100 1 0 0 0111101000100 1 0 1 0111101000100 1 1 0 0111101000100 1 1 1 Tag direct Tag set-associative Tag associative a(0,0) a(1,0) a(2,0) a(3,0) a(0,1).... a(0,9) a(1,9) a(2,9) a(3,9) Memory address 4x10 array column-major ordering elements of row 0 are four locations apart 7A00

40 TU-Delft TI1400/11-PDS 40 3.3. Example (4): direct mapped a[0,0]a[0,2]a[0,4]a[0,6]a[0,8]a[0,6]a[0,4]a[0,2]a[0,0] j=1j=3j=5j=7j=9i=6i=4i=2i=0 0 1 2 3 4 5 6 7 block pos. Contents of cache after pass: a[0,1]a[0,3]a[0,5]a[0,7]a[0,9]a[0,7]a[0,5]a[0,3]a[0,1] = miss = hit Elements of row 0 are also 4 locations apart in the cache Conclusion: from 20 accesses none are in cache

41 TU-Delft TI1400/11-PDS 41 3.3. Example (5): associative a[0,0]a[0,8] a[0,0] j=7j=8j=9 i=1i=0 a[0,1] a[0,9] a[0,1] a[0,2] a[0,3] a[0,4] a[0,5] a[0,6] a[0,7] 0 1 2 3 4 5 6 7 block pos. from i=9 to i=2 all are in cache... Conclusion: from 20 accesses 8 are in cache Contents of cache after pass:

42 TU-Delft TI1400/11-PDS 42 3.3. Example (6): set-associative a[0,0]a[0,4]a[0,8]a[0,4] j=3j=7j=9i=4i=2 a[0,1]a[0,5]a[0,9]a[0,5] a[0,2]a[0,6] a[0,2] a[0,3]a[0,7] a[0,3] 0 1 2 3 4 5 6 7 block pos. a[0,0] i=0 a[0,1] a[0,2] a[0,3] set 0 all elements of row 0 are mapped to set 0 Contents of cache after pass: from i=9 to i=6 all are in cache... Conclusion: from 20 accesses 4 are in cache

43 TU-Delft TI1400/11-PDS 43 3.4. Example: PowerPC (1) PowerPC 604 Separate data and instruction cache Caches are 16 Kbytes Four-way set-associative cache Cache has 128 sets Each block has 8 words of 32 bits

44 TU-Delft TI1400/11-PDS 44 3.4. Example: PowerPC (2) Block 0 00BA2 st Block 1 Block 2 Block 3 003F4 st address 0000 0000 0011 1111 0100 0000000 01000 003F408 set 0 =? no yes word address in block set number tag.....

45 TU-Delft TI1400/11-PDS Agenda 1.Basic Concepts 2.Performance Considerations 3.Caches 4.Virtual Memory 4.1. Basic Concepts 4.2. Address Translation

46 TU-Delft TI1400/11-PDS 46 4.1. Virtual Memory (1) Problem: compiled program does not fit into memory Solution: virtual memory, where the logical address space is larger than the physical address space Logical address space: addresses referable by instructions Physical address space: addresses referable in real machine

47 TU-Delft TI1400/11-PDS 47 4.1. Virtual Memory (2) For realizing virtual memory, we need an address conversion: a m = f(a v ) a m is physical address (machine address) a v is virtual address This is generally done by hardware

48 TU-Delft TI1400/11-PDS 48 4.1. Organization Processor MMU Cache Main Memory Disk Storage amam amam avav data DMA transfer

49 TU-Delft TI1400/11-PDS 49 4.2. Address Translation Basic approach is to partition both physical address space and virtual address space in equally sized blocks called pages A virtual address is composed of: -a page number -word number within a page (the offset)

50 TU-Delft TI1400/11-PDS 50 4.2. Page tables (1) virtual page numberoffset page frameoffset page table address + virtual address from processor page table base register physical address from processor control bits page frame number Page table in main memory

51 TU-Delft TI1400/11-PDS 51 4.2. Page tables (2) Having page tables only in main memory is much too slow Additional memory access for every instruction and operand Solution: keep a cache with recent address translation: a Translation Look-aside Buffer (TLB)

52 TU-Delft TI1400/11-PDS 52 4.2. Operation of TLB virtual page numberoffset virtual address from processor page frameoffset physical address from processor virtual page #real page # = ? hit miss control bits TLB Idea: keep most recent address translations

53 TU-Delft TI1400/11-PDS 53 4.2. Policies The pages of a process in main memory: resident set Mechanism works because of principle of locality Page replacement algorithms needed Protection possible through page table register Sharing possible through page table Hardware support: Memory Management Unit (MMU)

54 TU-Delft TI1400/11-PDS 54 Question Main memory: 256 MByte Maximal virtual-address space: 4 GByte Page size: 4 KByte All memories are byte addressable Q How many bits is the -offset within a page -virtual page frame number -(physical) page frame number

55 TU-Delft TI1400/11-PDS 55 Answer Main memory: 256 MByte Maximal virtual-address space: 4 GByte Page size: 4 KByte All memories are byte addressable Virtual address: 32 bits (2 32 =4 Gbyte) Physical address: 28 bits (2 28 =256 Mbyte) Offset in a page: 12 bits (2 12 =4 kbyte) Virtual page frame number: 32-12=20 bits Physical page frame number: 28-12=16 bits


Download ppt "1 The Memory System (Chapter 5)"

Similar presentations


Ads by Google