1 The Memory System (Chapter 5)

Slides:

Advertisements

Similar presentations

SE-292 High Performance Computing

Advertisements

Chapter 4 Memory Management Basic memory management Swapping

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

1 Memory hierarchy and paging Electronic Computers M.

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

1 Recap: Memory Hierarchy. 2 Memory Hierarchy - the Big Picture Problem: memory is too slow and or too small Solution: memory hierarchy Fastest Slowest.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

CS 342 – Operating Systems Spring 2003 © Ibrahim Korpeoglu Bilkent University1 Memory Management -3 CS 342 – Operating Systems Ibrahim Korpeoglu Bilkent.

Memory Management and Paging CSCI 3753 Operating Systems Spring 2005 Prof. Rick Han.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

The Memory Hierarchy II CPSC 321 Andreas Klappenecker.

Computer Architecture, Memory Hierarchy & Virtual Memory

Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.

©UCB CS 162 Ch 7: Virtual Memory LECTURE 13 Instructor: L.N. Bhuyan

1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )

1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.

CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

VIRTUAL MEMORY. Virtual memory technique is used to extents the size of physical memory When a program does not completely fit into the main memory, it.

Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.

Systems I Locality and Caching

Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

Computer Architecture Lecture 28 Fasih ur Rehman.

Five Components of a Computer

Lecture 19: Virtual Memory

In1210/01-PDS 1 TU-Delft The Memory System. in1210/01-PDS 2 TU-Delft Organization Word Address Byte Address

IT253: Computer Organization

Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.

Memory and cache CPU Memory I/O. CEG 320/52010: Memory and cache2 The Memory Hierarchy Registers Primary cache Secondary cache Main memory Magnetic disk.

1 Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: –illusion of having more physical memory –program relocation.

1 Memory Management. 2 Fixed Partitions Legend Free Space 0k 4k 16k 64k 128k Internal fragmentation (cannot be reallocated) Divide memory into n (possible.

The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.

Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

1 CSCI 2510 Computer Organization Memory System II Cache In Action.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

Memory Hierarchy How to improve memory access. Outline Locality Structure of memory hierarchy Cache Virtual memory.

Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:

1  1998 Morgan Kaufmann Publishers Chapter Seven.

CHAPTER 3-3: PAGE MAPPING MEMORY MANAGEMENT. VIRTUAL MEMORY Key Idea Disassociate addresses referenced in a running process from addresses available in.

1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.

LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

1 Contents Memory types & memory hierarchy Virtual memory (VM) Page replacement algorithms in case of VM.

Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.

CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.

Memory Hierarchy Ideal memory is fast, large, and inexpensive

Computer Organization

The Memory System (Chapter 5)

ECE232: Hardware Organization and Design

Memory COMPUTER ARCHITECTURE

Memory and cache CPU Memory I/O.

The Goal: illusion of large, fast, cheap memory

CS 704 Advanced Computer Architecture

Cache Memory Presentation I

Memory and cache CPU Memory I/O.

Module IV Memory Organization.

Chap. 12 Memory Organization

M. Usha Professor/CSE Sona College of Technology

Miss Rate versus Block Size

Translation Buffers (TLB’s)

Virtual Memory Overcoming main memory size limitation

Contents Memory types & memory hierarchy Virtual memory (VM)

Sarah Diesburg Operating Systems CS 3430

Review What are the advantages/disadvantages of pages versus segments?

Presentation transcript:

1 The Memory System (Chapter 5)

Agenda 1.Basic Concepts 2.Performance Considerations: Interleaving, Hit ratio/rate, etc. 3.Caches 4.Virtual Memory 1.1. Organization 1.2. Pinning

TU-Delft TI1400/11-PDS Organization Word Address Byte Address

TU-Delft TI1400/11-PDS Connection Memory-CPU Memory CPU Read/Write MFC Address Data MAR MDR

TU-Delft TI1400/11-PDS Memory: contents Addressable number of bits Different orderings Speed-up techniques -Memory interleaving -Cache memories Enlargement -Virtual memory

TU-Delft TI1400/11-PDS Organisation (1) sense/wr W0 W1 W15 FF Address decoder input/output lines b7b1b0 R/W CS A0 A1 A2 A3 b1

TU-Delft TI1400/11-PDS Pinning Total number of pins required for 16x8 memory: address lines -8 data lines -2 control lines -2 power lines

TU-Delft TI1400/11-PDS 8 32 by 32 memory array W0 W A 1K by 1 Memory 5-bit decoder 10-bit address lines two 32-to-1 multiplexors inout

TU-Delft TI1400/11-PDS Pinning Total number of pins required for 1024x1 memory: address lines -2 data lines (in/out) -2 control lines -2 power lines For 128 by 8 memory: 19 pins ( ) Conclusion: the smaller the addressable unit, the fewer pins needed

TU-Delft TI1400/11-PDS Agenda 1.Basic Concepts 2.Performance Considerations 3.Caches 4.Virtual Memory 2.1. Interleaving 2.2. Performance Gap Processor-Memory 2.3. Caching 2.4. A Performance Model: Hit ratio, Performance Penalty, etc.

TU-Delft TI1400/11-PDS Interleaving Multiple Modules (1) Address in Module m bits CS address Module n-1 CS address Module i CS address Module 0 Module k bits MM address Block-wise organization (consecutive words in single module) CS=Chip Select

TU-Delft TI1400/11-PDS Interleaving Multiple Modules (2) CS address Module 2**k-1 CS address Module i CS address Module 0 Module k bits Address in Module m bits MM address Interleaving organization (consecutive words in consecutive module) CS = Chip Select

TU-Delft TI1400/11-PDS 13 Questions What is the advantage of the interleaved organization? What the disadvantage? Higher bandwidth CPU-memory: data transfer to/from multiple modules simultaneously When a module breaks down, memory has many small holes

TU-Delft TI1400/11-PDS Problem: The Performance Gap Processor-Memory Processor: CPU Speeds 2X every 2 years ~Moore’s Law; limit ~2010 Memory: DRAM Speeds 2X every 7 years Gap: 2X every 2 years Gap Still Growing?

TU-Delft TI1400/11-PDS Idea: Memory Hierarchy increasing size increasing speed increasing cost Disks Main Memory Secondary cache: L2 Primary cache: L1 CPU

TU-Delft TI1400/11-PDS Caches (1) Problem: Main memory is slower than CPU registers (factor of 5-10) Solution: Fast and small memory between CPU and main memory Contains: recent references to memory CPU Cache Main memory

TU-Delft TI1400/11-PDS Caches (2)/2.4. A Performance Model Works because of locality principle Profit: -cache hit ratio (rate):h -access time cache: c -cache miss ratio (rate):1-h -access time main memory: m -mean access time: h.c + (1-h).m Cache is transparent to programmer

TU-Delft TI1400/11-PDS Caches (3) READ operation: -if not in cache, copy block into cache and read out of cache (possibly read-through) -if in cache, read out of cache WRITE operation: -if not in cache, write in main memory -if in cache, write in cache, and: write in main memory (store through) set modified (dirty) bit, and write later

TU-Delft TI1400/11-PDS Caches (4) The Library Analogy Real-world analogue: -borrow books from a library -store these books according to the first letter of the name of the first author in 26 locations Direct mapped: separate location for a single book for each letter of the alphabet Associative: any book can go to any of the 26 locations Set-associative: two locations for letters A-B, two for C-D, etc … A Z

TU-Delft TI1400/11-PDS Caches (5) Suppose -size of main memory in bytes: N = 2 n -block size in bytes: b = 2 k -number of blocks in cache: 128 -e.g., n=16, k=4, b=16 Every block in cache has valid bit (is reset when memory is modified) At context switch: invalidate cache

TU-Delft TI1400/11-PDS Agenda 1.Basic Concepts 2.Performance Considerations 3.Caches 4.Virtual Memory 3.1. Mapping Function 3.2. Replacement Algorithm 3.3. Examples of Mapping 3.4. Examples of Caches in Commercial Processors 3.5. Write Policy 3.6. Number of Blocks/Caches/…

TU-Delft TI1400/11-PDS Mapping Function 1. Direct Mapped Cache (1) A block in main memory can be at only one place in the cache This place is determined by its block number j: -place = j modulo size of cache 574 tagblockword main memory address

TU-Delft TI1400/11-PDS Direct Mapped Cache (2) BLOCK BLOCK 127 BLOCK 128 BLOCK BLOCK 255 BLOCK bits tag BLOCK 0 BLOCK 1 BLOCK 2 CACHE main memory

TU-Delft TI1400/11-PDS Direct Mapped Cache (3) BLOCK 0 BLOCK BLOCK 127 BLOCK 128 BLOCK BLOCK 255 BLOCK bits CACHE main memory tag BLOCK 0 BLOCK 1 BLOCK 2

TU-Delft TI1400/11-PDS Mapping Function 2. Associative Cache (1) Each block can be at any place in cache Cache access: parallel (associative) match of tag in address with tags in all cache entries Associative: slower, more expensive, higher hit ratio 124 tagword main memory address

TU-Delft TI1400/11-PDS Associative Cache (2) BLOCK 0 BLOCK BLOCK 127 BLOCK 128 BLOCK BLOCK 255 BLOCK bits 128 blocks main memory tag BLOCK 0 BLOCK 1 BLOCK 2

TU-Delft TI1400/11-PDS Mapping Function 3. Set-Associative Cache (1) Combination of direct mapped and associative Cache consists of sets Mapping of block to set is direct, determined by set number Each set is associative 664 tagsetword main memory address

TU-Delft TI1400/11-PDS Set-Associative Cache (2) BLOCK 0 BLOCK BLOCK 127 BLOCK 128 BLOCK BLOCK 255 BLOCK 256 tag 6- bits BLOCK blocks, 64 sets tag BLOCK 1 tag BLOCK 2 tag BLOCK 3 tag BLOCK 4 set 0 set 1 Q: What is wrong in this picture? Answer: 64 sets, so block 64 also goes to set 0

TU-Delft TI1400/11-PDS Set-Associative Cache (3) BLOCK 0 BLOCK BLOCK 127 BLOCK 128 BLOCK BLOCK 255 BLOCK 256 tag 6- bits BLOCK blocks, 64 sets tag BLOCK 1 tag BLOCK 2 tag BLOCK 3 tag BLOCK 4 set 0 set 1

TU-Delft TI1400/11-PDS 30 Question Main memory: 4 GByte Cache: 512 blocks of 64 byte Cache: 8-way set-associative (set size is 8) All memories are byte addressable Q How many bits is the: -byte address within a block -set number -tag

TU-Delft TI1400/11-PDS 31 Answer Main memory is 4 GByte, so 32-bits address A block is 64 byte, so 6-bits byte address within a block 8-way set-associative cache with 512 blocks, so 512/8=64 sets, so 6-bits set number So, =20-bits tag 2066 tagsetword

TU-Delft TI1400/11-PDS Replacement Algorithm Replacement (1) (Set) associative replacement algorithms: Least Recently Used (LRU) -if 2 k blocks per set, implement with k-bit counters per block -hit: increase counters lower than the one referenced with 1, set counter at 0 -miss and set not full: replace, set counter new block 0, increase rest -miss and set full: replace block with highest value (2 k -1), set counter new block at 0, increase rest

TU-Delft TI1400/11-PDS LRU: Example k=2  4 blocks per set HIT increased unchanged now at the top

TU-Delft TI1400/11-PDS LRU: Example k=2 EMPTY miss and set not full increased now at the top

TU-Delft TI1400/11-PDS LRU: Example k=2 miss and set full increased now at the top

TU-Delft TI1400/11-PDS Replacement Algorithm Replacement (2) Alternatives for LRU: -Replace oldest block, First-In-First-Out (FIFO) -Least-Frequently Used (LFU) -Random replacement

TU-Delft TI1400/11-PDS Example (1): program int SUM = 0; for(j=0, j<10, j++) { SUM =SUM + A[0,j]; } AVE = SUM/10; for(i=9, i>-1, i--){ A[0,i] = A[0,i]/AVE } Normalize the elements of row 0 of array A First pass: from start to end Second pass: from end to start

TU-Delft TI1400/11-PDS Example (2): cache BLOCK 0 tag BLOCK 1 tag BLOCK 2 tag BLOCK 3 tag BLOCK 4 tag BLOCK 5 tag BLOCK 6 tag BLOCK 7 tag Cache: 8 blocks 2 sets each block 1 word LRU replacement Set 0 Set tagblock direct 16 tag associative 151 tagset associative

TU-Delft TI1400/11-PDS Example (3): array Tag direct Tag set-associative Tag associative a(0,0) a(1,0) a(2,0) a(3,0) a(0,1).... a(0,9) a(1,9) a(2,9) a(3,9) Memory address 4x10 array column-major ordering elements of row 0 are four locations apart 7A00

TU-Delft TI1400/11-PDS Example (4): direct mapped a[0,0]a[0,2]a[0,4]a[0,6]a[0,8]a[0,6]a[0,4]a[0,2]a[0,0] j=1j=3j=5j=7j=9i=6i=4i=2i= block pos. Contents of cache after pass: a[0,1]a[0,3]a[0,5]a[0,7]a[0,9]a[0,7]a[0,5]a[0,3]a[0,1] = miss = hit Elements of row 0 are also 4 locations apart in the cache Conclusion: from 20 accesses none are in cache

TU-Delft TI1400/11-PDS Example (5): associative a[0,0]a[0,8] a[0,0] j=7j=8j=9 i=1i=0 a[0,1] a[0,9] a[0,1] a[0,2] a[0,3] a[0,4] a[0,5] a[0,6] a[0,7] block pos. from i=9 to i=2 all are in cache... Conclusion: from 20 accesses 8 are in cache Contents of cache after pass:

TU-Delft TI1400/11-PDS Example (6): set-associative a[0,0]a[0,4]a[0,8]a[0,4] j=3j=7j=9i=4i=2 a[0,1]a[0,5]a[0,9]a[0,5] a[0,2]a[0,6] a[0,2] a[0,3]a[0,7] a[0,3] block pos. a[0,0] i=0 a[0,1] a[0,2] a[0,3] set 0 all elements of row 0 are mapped to set 0 Contents of cache after pass: from i=9 to i=6 all are in cache... Conclusion: from 20 accesses 4 are in cache

TU-Delft TI1400/11-PDS Example: PowerPC (1) PowerPC 604 Separate data and instruction cache Caches are 16 Kbytes Four-way set-associative cache Cache has 128 sets Each block has 8 words of 32 bits

TU-Delft TI1400/11-PDS Example: PowerPC (2) Block 0 00BA2 st Block 1 Block 2 Block 3 003F4 st address F408 set 0 =? no yes word address in block set number tag.....

TU-Delft TI1400/11-PDS Agenda 1.Basic Concepts 2.Performance Considerations 3.Caches 4.Virtual Memory 4.1. Basic Concepts 4.2. Address Translation

TU-Delft TI1400/11-PDS Virtual Memory (1) Problem: compiled program does not fit into memory Solution: virtual memory, where the logical address space is larger than the physical address space Logical address space: addresses referable by instructions Physical address space: addresses referable in real machine

TU-Delft TI1400/11-PDS Virtual Memory (2) For realizing virtual memory, we need an address conversion: a m = f(a v ) a m is physical address (machine address) a v is virtual address This is generally done by hardware

TU-Delft TI1400/11-PDS Organization Processor MMU Cache Main Memory Disk Storage amam amam avav data DMA transfer

TU-Delft TI1400/11-PDS Address Translation Basic approach is to partition both physical address space and virtual address space in equally sized blocks called pages A virtual address is composed of: -a page number -word number within a page (the offset)

TU-Delft TI1400/11-PDS Page tables (1) virtual page numberoffset page frameoffset page table address + virtual address from processor page table base register physical address from processor control bits page frame number Page table in main memory

TU-Delft TI1400/11-PDS Page tables (2) Having page tables only in main memory is much too slow Additional memory access for every instruction and operand Solution: keep a cache with recent address translation: a Translation Look-aside Buffer (TLB)

TU-Delft TI1400/11-PDS Operation of TLB virtual page numberoffset virtual address from processor page frameoffset physical address from processor virtual page #real page # = ? hit miss control bits TLB Idea: keep most recent address translations

TU-Delft TI1400/11-PDS Policies The pages of a process in main memory: resident set Mechanism works because of principle of locality Page replacement algorithms needed Protection possible through page table register Sharing possible through page table Hardware support: Memory Management Unit (MMU)

TU-Delft TI1400/11-PDS 54 Question Main memory: 256 MByte Maximal virtual-address space: 4 GByte Page size: 4 KByte All memories are byte addressable Q How many bits is the -offset within a page -virtual page frame number -(physical) page frame number

TU-Delft TI1400/11-PDS 55 Answer Main memory: 256 MByte Maximal virtual-address space: 4 GByte Page size: 4 KByte All memories are byte addressable Virtual address: 32 bits (2 32 =4 Gbyte) Physical address: 28 bits (2 28 =256 Mbyte) Offset in a page: 12 bits (2 12 =4 kbyte) Virtual page frame number: 32-12=20 bits Physical page frame number: 28-12=16 bits