Memory Architecture Chapter 5 in Hennessy & Patterson.

Slides:



Advertisements
Similar presentations
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Performance of Cache Memory
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
Virtual Memory Hardware Support
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Caches P & H Chapter 5.1, 5.2 (except writes)
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Chapter 7 Large and Fast: Exploiting Memory Hierarchy Bo Cheng.
The Memory Hierarchy II CPSC 321 Andreas Klappenecker.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
Memory Organization.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
Virtual Memory Topics Virtual Memory Access Page Table, TLB Programming for locality Memory Mountain Revisited.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Memory Systems Architecture and Hierarchical Memory Systems
Chapter 6: Memory Memory is organized into a hierarchy
CSE431 L22 TLBs.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 22. Virtual Memory Hardware Support Mary Jane Irwin (
Lecture 19: Virtual Memory
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
0 High-Performance Computer Architecture Memory Organization Chapter 5 from Quantitative Architecture January 2006.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
1 Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: –illusion of having more physical memory –program relocation.
Virtual Memory. Virtual Memory: Topics Why virtual memory? Virtual to physical address translation Page Table Translation Lookaside Buffer (TLB)
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
Virtual Memory Review Goal: give illusion of a large memory Allow many processes to share single memory Strategy Break physical memory up into blocks (pages)
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
CS203 – Advanced Computer Architecture Virtual Memory.
CDA 5155 Virtual Memory Lecture 27. Memory Hierarchy Cache (SRAM) Main Memory (DRAM) Disk Storage (Magnetic media) CostLatencyAccess.
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
CS161 – Design and Architecture of Computer
CMSC 611: Advanced Computer Architecture
COSC3330 Computer Architecture
Memory COMPUTER ARCHITECTURE
Yu-Lun Kuo Computer Sciences and Information Engineering
CS161 – Design and Architecture of Computer
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
ECE 445 – Computer Organization
CSCI206 - Computer Organization & Programming
Chapter 8 Digital Design and Computer Architecture: ARM® Edition
Part V Memory System Design
Lecture 23: Cache, Memory, Virtual Memory
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Part V Memory System Design
ECE 463/563 Fall `18 Memory Hierarchies, Cache Memories H&P: Appendix B and Chapter 2 Prof. Eric Rotenberg Fall 2018 ECE 463/563, Microprocessor Architecture,
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Fundamentals of Computing: Computer Architecture
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Memory Architecture Chapter 5 in Hennessy & Patterson

ENGR 6861 Fall 2005R. Venkatesan, Memorial University2 Memory lags in speed

ENGR 6861 Fall 2005R. Venkatesan, Memorial University3 Basic Ideas -- Locality Instructions and data exhibit both spatial and temporal locality –Temporal locality: If a particular instruction or data item is used now, there is a good chance that it will be used again in the near future. –Spatial locality: If a particular instruction or data item is used now, there is a good chance that the instructions or data items that are located in memory immediately following or preceding this item will soon be used. Therefore, it is a good idea to move such instruction and data items that are expected to be used soon from slow memory to fast memory (cache). This is prediction, and therefore will not always be correct – depends on the extent of locality.

ENGR 6861 Fall 2005R. Venkatesan, Memorial University4 Memory Hierarchy – Why and How Memory is much slower compared to processor. Faster memories are more expensive. Due to high decoding time and other reasons, larger memories are always slower. Therefore, locate small but very fast memory (SRAM: L1 cache) very close to processor. L2 cache will be larger and slower. Main memory is usually GBs large and are made up of DRAMs: several nanoseconds. Secondary memory is on discs (hard disk, CD, DVD), are hundreds of GBs large, and take milliseconds to access. Similar to warehouse. CPU registers are the closest to CPU, but do not use memory address – they have separate ids.

ENGR 6861 Fall 2005R. Venkatesan, Memorial University5 Speedup due to cache – Example Assume 1 GHz processor using 10 ns memory, and 35% of all executed instructions are load or store. The application runs 1 billion instructions. Execution time = ( *10)*10^9*10^-9 = 14.5 ns. Assume all instruction and data required are stored in a perfect cache that operates within the clock period. Exec. Time with perfect cache = 1 ns. Now, assume that the cache has a hit rate of 90%. Exec. time with cache = ( *0.1*10) = 2.35 ns. Caches are 95-99% successful in having the required instructions and 75-90% successful for data. How do we design a better cache (or caches)?

ENGR 6861 Fall 2005R. Venkatesan, Memorial University6 von Neumann Architecture Memory holds instructions (in a sequence) and data, and memory items are not distinguished based on their contents – any memory item is a string of bits. Most modern processors have instruction pipelines. Instruction storage exhibits stronger locality. Due to the above two points, we usually split the L1 cache into IC and DC – Harvard architecture. Nowadays, IC or DC usually has 8,16, 32, 64 or 128 kB of information, plus overheads. Tape uses sequential access; RAM/ROM uses random access; disc uses random/sequential access; caches use associative access (i.e. uses a part of the information to find the remaining).

ENGR 6861 Fall 2005R. Venkatesan, Memorial University7 Cache – mapping options

ENGR 6861 Fall 2005R. Venkatesan, Memorial University8 Cache – how is a block found? Index bits are decoded to select one set. Tag bits corresponding to each block in this set are compared with tag bits in the address. Cache miss: no comparison succeeds. Cache hit: 1 comparison succeeds, block identified. Block offset bits: selects byte or word needed. Fully-associative: no index; Direct mapped: largest index field, but only one comparison.

ENGR 6861 Fall 2005R. Venkatesan, Memorial University9 2-way set-associative cache

ENGR 6861 Fall 2005R. Venkatesan, Memorial University10 Cache performance Memory access time = cache hit time + cache miss rate * miss penalty To improve performance, i.e. reduce memory time, we need to reduce hit time, miss rate & miss penalty. As L1 caches are in the critical path of instruction execution, hit time is the most important parameter. When one parameter is improved, others might suffer Misses: Compulsory, Capacity, Conflict. Compulsory miss: always occurs first time. Capacity miss: reduces with cache size. Conflict miss: reduces with level of associativity. IC/DC: 2-way; DC: write through & write no-allocate.

ENGR 6861 Fall 2005R. Venkatesan, Memorial University11 Cache – write hit & write miss Write: Need to update lower levels of cache and main memory whenever a store instruction modifies DC. Write Hit: the item to be modified is in DC. –Write Through: as if no DC, write also to L2. –Write Back: set a Dirty Bit, and update L2 before replacing this block. –Although write through is an inefficient strategy, most DCs and some lower level caches follow this so that read hit time is not affected due to complicated logic to update dirty bit. Write Miss: the item to be modified is not in DC. –Write allocate: exploit locality, and bring the block to DC. –Write no-allocate: do not fetch the missing block. Usually, write through and write no-allocate policies are used together; write back and write allocate.

ENGR 6861 Fall 2005R. Venkatesan, Memorial University12 Cache size examples Overhead bits: tag bits, valid bit, dirty bit? #address bits: 46; block size: 64B; 64kB 2-way IC/DC, DC uses write through, write no-allocate. #block offset bits: 6; #index bits: 9 as 512 sets, 1024 blocks; tag bits: 46 – (6+9) = 31. IC/DC size: 64*8* (31+1)*1024 = 64kB + 4kB. #address bits: 40; block size: 32B; 1MB L2 cache, 4- way set-associative, uses write back & write allocate. #block offset bits: 5; #index bits: 13 as 8192 sets, 2^15 blocks; tag bits: 40 – (5+13) = 22. L2 cache size: 1MB + (22+1+1)/8*2^15 B = 1MB+96kB

ENGR 6861 Fall 2005R. Venkatesan, Memorial University13 Cache performance examples 5 GHz pipelined processor; IC hits 98%; L1 read miss 10%, write miss 5%; 25% of all instructions are loads and 10% are writes. Fetching a block from L2 cache takes 40 clk. L2 misses 12%, with a penalty 25 ns. Instruction execution time = 1 + (0.02*(40+(0.12*125))) + (0.25*0.1*( *125))) + (0.1*0.05*(40+(0.12*125))) clk = clk = ns. An L3 cache might improve the performance. Most general purpose computers today use CPU chips that have on-chip IC&DC, on-package L2 cache, and an L3 cache on the motherboard. Caches are also used in the sound card, video card and with other special purpose hardware.

ENGR 6861 Fall 2005R. Venkatesan, Memorial University14 Main memory organizations

ENGR 6861 Fall 2005R. Venkatesan, Memorial University15 Memory interleaving Access time = address transmit + decode + amplify + data transmit. Decode is the largest component. In memory interleaved system, address is transmitted and decode takes place commonly for all the banks; only data transmission occurs from each bank in a pipelined fashion *0.6 = 9.2 ns.

ENGR 6861 Fall 2005R. Venkatesan, Memorial University16 Virtual Memory Relocatability: programmer and compiler do not have to worry about exact location of information. Use any addresses, and leave it to loader at run time. Security: Using extra bits, access of any information can be regulated. Controlled sharing is possible. Different size: virtual memory can be larger than real memory. This is not the major reason nowadays. Virtual memory can be implemented using –Paging: most common; discussed in detail here. –Segmentation: used in early Intel processors. –Hybrid: some processors use them. Virtual Address and Real Address: translation.

ENGR 6861 Fall 2005R. Venkatesan, Memorial University17 Disk & Real memories Operating System (OS) uses processor identifier bits (PID or ASN) and thus permits the same virtual addresses to be used by different processors

ENGR 6861 Fall 2005R. Venkatesan, Memorial University18 Address Translation Page Table is located in the main memory, contains as each entry real page address and addressed using virtual page address. Page size: 16kB or 64kB. Every instruction execution refers to memory roughly 1.35 times, and so a cache for PT is needed => TLB.

ENGR 6861 Fall 2005R. Venkatesan, Memorial University19 Address Translation Cache: TLB Usually split as ITLB & DTLB; only one level, and when TLB misses, exception occurs and OS consults PT and updates TLB. TLB takes care of protection. ITLB: 12 – 40 entries => 200B to 1kB; DTLB: 32 – 64 entries; TLBs are read-only, fully-associative.

ENGR 6861 Fall 2005R. Venkatesan, Memorial University20 Memory Access: TLB & caches TLB first, L1 cache next, and then, if necessary, L2 cache, etc.

ENGR 6861 Fall 2005R. Venkatesan, Memorial University21 Page Table access

ENGR 6861 Fall 2005R. Venkatesan, Memorial University22

ENGR 6861 Fall 2005R. Venkatesan, Memorial University23 Summary Memory design is extremely important: hierarchical. Caches improve performance due to locality present in instruction and data access. L1 cache design is important as it is on critical path; it is split into IC & DC if processor is pipelined. L2 cache (mixed) will be larger and slower than L1; L3 cache will be larger and slower than L2, and main memory will be larger and slower than L3. Cache performance suffers when store is performed. Virtual memory is used for relocatability & protection. Page Tables are used for virtual to real page address translation; TLBs are caches for PT.