Computer Architecture Memory Hierarchy Lynn Choi Korea University.

Slides:



Advertisements
Similar presentations
Virtual Memory In this lecture, slides from lecture 16 from the course Computer Architecture ECE 201 by Professor Mike Schulte are used with permission.
Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Virtual Memory Chapter 18 S. Dandamudi To be used with S. Dandamudi, “Fundamentals of Computer Organization and Design,” Springer,  S. Dandamudi.
1 Recap: Memory Hierarchy. 2 Memory Hierarchy - the Big Picture Problem: memory is too slow and or too small Solution: memory hierarchy Fastest Slowest.
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Virtual Memory 2 P & H Chapter
The Memory Hierarchy (Lectures #24) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer Organization.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
©UCB CS 161 Ch 7: Memory Hierarchy LECTURE 16 Instructor: L.N. Bhuyan
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Chapter 7 Large and Fast: Exploiting Memory Hierarchy Bo Cheng.
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
EECC550 - Shaaban #1 Lec # 10 Summer Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store.
Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.
©UCB CS 162 Ch 7: Virtual Memory LECTURE 13 Instructor: L.N. Bhuyan
1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.
Lecture 15: Virtual Memory EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.
July 30, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 8: Exploiting Memory Hierarchy: Virtual Memory * Jeremy R. Johnson Monday.
Lecture 19 Today’s topics Types of memory Memory hierarchy.
Lecture Topics: 11/17 Page tables TLBs Virtual memory flat page tables
EEE-445 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output Cache Main Memory Secondary Memory (Disk)
0 High-Performance Computer Architecture Memory Organization Chapter 5 from Quantitative Architecture January 2006.
Operating System Chapter 8. Virtual Memory
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
Computer Architecture Memory Hierarchy Lynn Choi Korea University.
The Three C’s of Misses 7.5 Compulsory Misses The first time a memory location is accessed, it is always a miss Also known as cold-start misses Only way.
Virtual Memory. Virtual Memory: Topics Why virtual memory? Virtual to physical address translation Page Table Translation Lookaside Buffer (TLB)
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.
Memory Architecture Chapter 5 in Hennessy & Patterson.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Virtual Memory.  Next in memory hierarchy  Motivations:  to remove programming burdens of a small, limited amount of main memory  to allow efficient.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
1  1998 Morgan Kaufmann Publishers Chapter Seven.
Constructive Computer Architecture Virtual Memory: From Address Translation to Demand Paging Arvind Computer Science & Artificial Intelligence Lab. Massachusetts.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a “cache” for secondary (disk) storage – Managed jointly.
Microprocessor Microarchitecture Memory Hierarchy Optimization Lynn Choi Dept. Of Computer and Electronics Engineering.
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
Virtual Memory Chapter 8.
CS161 – Design and Architecture of Computer
COSC3330 Computer Architecture
ECE232: Hardware Organization and Design
Memory COMPUTER ARCHITECTURE
CS161 – Design and Architecture of Computer
From Address Translation to Demand Paging
CS 704 Advanced Computer Architecture
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
CSC3050 – Computer Architecture
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Main Memory Background
Fundamentals of Computing: Computer Architecture
Presentation transcript:

Computer Architecture Memory Hierarchy Lynn Choi Korea University

Memory Hierarchy Motivated by Principles of Locality Speed vs. Size vs. Cost tradeoff Locality principle Temporal Locality: reference to the same location is likely to occur soon Example: loops, reuse of variables Keep recently accessed data/instruction to closer to the processor Spatial Locality: nearby references are likely Example: arrays, program codes Access a block of contiguous bytes Speed vs. Size tradeoff Bigger memory is slower but cheaper: SRAM - DRAM - Disk - Tape Fast memory is more expensive but faster

Memory Wall CPU performance 2x every 18 months DRAM performance 7% per year

Levels of Memory Hierarchy Registers Cache Main Memory Disk Network Instruction Operands Cache Line Page File Capacity/Access TimeMoved ByFaster/Smaller Slower/Larger Program/Compiler 1- 16B H/W B OS 512B – 64MB User any size 100Bs KBs-MBs 100GBs Infinite GBs

Cache A small but fast memory located between processor and main memory Benefits Reduce load latency Reduce store latency Reduce bus traffic (on-chip caches) Cache Block Placement (Where to place) Fully-associative cache Direct-mapped cache Set-associative cache

Fully Associative Cache 32KB cache (SRAM) Physical Address Space 32 bit PA = 4GB (DRAM) Cache Block (Cache Line) Memory Block A memory block can be placed into any cache block location! b Word, 4 Word Cache Block

Fully Associative Cache 32KB DATA RAM TAG RAM tag = = = = offset V Word & Byte select Data out Data to CPU Advantages Disadvantages 1. High hit rate 1. Very expensive 2. Fast Yes Cache Hit

Direct Mapped Cache 32KB cache (SRAM) Physical Address Space 32 bit PA = 4GB (DRAM) Memory Block A memory block can be placed into only a single cache block! *2 11 ( )*2 11 …..

Direct Mapped Cache 32KB DATA RAM TAG RAM index offset V Word & Byte select Data out Data to CPU Disadvantages Advantages 1. Low hit rate 1. Simple HW 2. Fast Implementation tag decoder = Cache Hit Yes 14 4

Set Associative Cache 32KB cache (SRAM) Memory Block In an M-way set associative cache, A memory block can be placed into M cache blocks! *2 10 ( )* Way 0 Way sets

Set Associative Cache 32KB DATA RAM TAG RAM index offset V Word & Byte select Data out Data to CPU tag decoder = Cache Hit Yes = 13 4 Wmux Most caches are implemented as set-associative caches!

Block Allocation and Replacement Block Allocation (When to place) On a read miss, always allocate On a write miss Write-allocate: allocate cache block on a write miss No-write-allocate Replacement Policy LRU (least recently used) Need to keep timestamp Expensive due to global compare Pseudo-LRU: use LFU using bit tags Random Just pick one and replace it Pseudo-random: use simple hash algorithm using address Replacement policy critical for small caches

Write Policy Write-through Write to cache and to next level of memory hierarchy Simple to design, memory consistent Generates more write traffic With no write allocate policyWrite-back Only write to cache (not to lower level) Update memory when a dirty block is replaced Less write traffic, write independent of main memory Complex to design, memory inconsistent With write allocate policy

Review: 4 Questions for Cache Design Q1: Where can a block can be placed in? (Block Placement) Fully-associative, direct-mapped, set-associative Q2: How is a block found in the cache? (Block Indentification) Tag/Index/Offset Q3: Which block should be replaced on a miss? (Block Replacement) Random, LRU Q4: What happens on a write? (Write Policy)

3+1 Types of Cache Misses Cold-start misses (or compulsory misses) The first access to a block is always not in the cache Misses even in an infinite cache Capacity misses If the memory blocks needed by a program is bigger than the cache size, then capacity misses will occur due to cache block replacement. Misses even in fully associative cache Conflict misses (or collision misses) For direct-mapped or set-associative cache, too many blocks can be mapped to the same set. Invalidation misses (or sharing misses): cache blocks can be invalidated due to coherence traffic

Miss Rates (SPEC92)

Cache Performance Avg-access-time = hit time+miss rate*miss penalty Improving Cache Performance Reduce hit time Reduce miss rate Reduce miss penalty For L1 organization, AMAT = Hit_Time + Miss_Rate * Miss_Penalty For L1/L2 organization, AMAT = Hit_Time L1 + Miss_Rate L1 * (Hit_Time L2 + Miss_Rate L2 * Miss_Penalty L2 ) Design Issues Size(L2) >> Size(L1) Usually, Block_size(L2) > Block_size(L1)

Cache Performance vs Block Size Miss PenaltyMiss Rate Average Access Time Block Size Access Time Sweet Spot Transfer Time

Random Access Memory Static vs. Dynamic Memory Static RAM (at least 6 transistor) State can be retained while power is supplied Use latched storage Speed: access time 8-16X faster than DRAM Used for registers, buffers, on-chip and off-chip caches Dynamic RAM (usually 1 transistor) State is discharged as time goes by Use dynamic storage of charge on a capacitor Require refresh of each cell every few milliseconds Density: 16X SRAM size at the same feature size Multiplexed address lines - RAS, CAS Complex interface logic due to refresh, precharge Used for main memory

SRAM Cell versus DRAM Cell SRAM Cell DRAM Cell

DRAM Refresh Typical devices require each cell to be refreshed once every 4 to 64 mS. During “suspended” operation, notebook computers use power mainly for DRAM refresh.

RAM Structure Column Decoder + Multiplexer Memory Array Row Decoder Row Address Column Address Data 2N2N 2M2M M N-K 2K2K

DRAM Chip Internal Organization 64K x 1 bit DRAM

RAS/CAS Operation Row Address Strobe, Column Address Strobe n address bits are provided in two steps using n/2 pins, referenced to the falling edges of RAS_L and CAS_L Traditional method of DRAM operation for 20 years. DRAM read timing

DRAM Packaging Typically, 8 or 16 memory chips are mounted on a tiny printed circuit board For compatibility and easier upgrade SIMM (Single Inline Memory Module) Connectors on one side 32 pins for 8b data bus 72 pins for 32b data bus DIMM (Dual Inline Memory Module) For 64b data bus (64, 72, 80) 84 pins at both sides, total of 168 pins Ex) 16 16M*4 bit DRAM constitutes 128MB DRAM module with 64b data bus SO-DIMM (Small Outline DIMM) for notebooks 72 pins for 32b data while 144pins for 64b data bus

Memory Performance Parameters Access Time The time elapsed from asserting an address to when the data is available on the output Row Access Time: The time elapsed from asserting RAS to when the row is available in the row buffer Column Access Time - the time elapsed from asserting CAS to when the valid data is present on the output pins Cycle Time The minimum time between two different requests to memoryLatency The time to access the first word of a blockBandwidth Transmission rate (bytes per second)

Memory Organization Assuming 1 cycle to send the address 15 cycles for each DRAM access 1 cycle to return a word of data (15 + 1) = 65 cycles (15 + 1) = 33 cycles (2 word wide) (15 + 1) = 17 cycles (4 word wide) = 20 cycles

Pentium III Example Host to PCI Bridge Main Memory Graphics System Bus FSB (133 MHz 64b data & 32b address) AGP Memory Bus Multiplexed (RAS/CAS) 16KB I-Cache 16KB D-Cache Pentium III Core Pipeline 256 KB 8-way 2 nd -level Cache Pentium III Processor 800 MHz 256b data DIMMs: 16 16M*4b 133 MHz SDRAM constitutes 128MB DRAM module with 64b data bus

Intel i7 System Architecture Integrated memory controller 3 Channel, 3.2GHz clock, 25.6 GB/s memory bandwidth (memory up to 24GB DDR3 SDRAM), 36 bit physical address QuickPath Interconnect (QPI) Point-to-point processor interconnect, replacing the front side bus (FSB) 64bit data every two clock cycles, up to 25.6GB/s, which doubles the theoretical bandwidth of 1600MHz FSB Direct Media Interface (DMI) The link between Intel Northbridge and Intel Southbridge, sharing many characteristics with PCI-Express IOH (Northbridge) ICH (Southbridge)

Virtual Memory Virtual memory Programmer’s view of memory (virtual address space) Physical memory (main memory) Machine’s physical memory (physical address space)Objective Large address spaces -> Easy Programming Provide the illusion of infinite amount of memory Program code/data can exceed the main memory size Processes partially resident in memory Improve software portability Increased CPU utilization: More programs can run at the same time Support protection of codes and data Privilege level Access rights: read/modify/execute permission Support sharing of codes and data

Virtual Memory Require the following functions Memory allocation (Placement) Memory deallocation (Replacement) Memory mapping (Translation) Memory management Automatic movement of data between main memory and secondary storage Done by operating system with the help of processor HW (exception handling mechanism) Main memory contains only the most frequently used portions of a process’s address space Illusion of infinite memory (size of secondary storage) but access time equal to main memory Usually implemented by demand paging Bring a page on a page miss on demand Exploit spatial locality

Paging Divide address space into fixed size page frames VA consists of (VPN, offset) PA consists of (PPN, offset) Map a virtual page to a physical page at runtime Page table contains VA to PA mapping information Page table entry (PTE) contains VPN PPN Presence bit – 1 if this page is in main memory Reference bits – reference statistics info used for page replacement Dirty bit – 1 if this page has been modified Access control – read/write/execute permissions Privilege level – user-level page versus system-level page Disk address Internal fragmentation

Process Def: A process is an instance of a program in execution. One of the most profound ideas in computer science. Not the same as “program” or “processor” Process provides each program with two key abstractions: Logical control flow Each program seems to have exclusive use of the CPU. Private address space Each program seems to have exclusive use of main memory. How are these Illusions maintained? Multitasking: process executions are interleaved In reality, many other programs are running on the system. Processes take turns in using the processor Each time period that a process executes a portion of its flow is called a time slice Virtual memory: a private space for each process The private space is also called the virtual address space, which is a linear array of bytes, addressed by n bit virtual address (0, 1, 2, 3, … 2 n -1)

Paging Page table organization Linear: one PTE per virtual page Hierarchical: tree structured page table Page table itself can be paged due to its size For example, 32b VA, 4KB page, 16B PTE requires 16MB page table Page directory tables PTE contains descriptor (i.e. index) for page table pages Page tables - only leaf nodes PTE contains descriptor for page Page table entries are dynamically allocated as needed Different virtual memory faults TLB miss - PTE not in TLB PTE miss - PTE not in main memory page miss - page not in main memory access violation privilege violation

Multi-Level Page Tables Given: 4KB (2 12 ) page size 32-bit address space 4-byte PTEProblem: Would need a 4 MB page table! 2 20 *4 bytes Common solution multi-level page tables e.g., 2-level table (P6) Level 1 table: 1024 entries, each of which points to a Level 2 page table. This is called page directory Level 2 table: 1024 entries, each of which points to a page Level 1 Table... Level 2 Tables

TLB TLB (Translation Lookaside Buffer) Cache of page table entries (PTEs) On TLB hit, can do virtual to physical translation without accessing the page table On TLB miss, must search the page table for the missing entry TLB configuration ~100 entries, usually fully associative cache sometimes mutil-level TLBs, TLB shootdown issue usually separate I-TLB and D-TLB, accessed every cycle Miss handling On a TLB miss, exception handler (with the help of operating system) search page table for the missed TLB entry and insert it into TLB Software managed TLBs - TLB insert/delete instructions Flexible but slow: TLB miss handler ~ 100 instructions Sometimes, by HW - HW page walker

TLB and Cache Implementation of DECStation 3100

Virtually-Indexed Physically Addressed Cache Virtually-addressed physically-tagged cache Commonly used scheme to bypass the translation Use lower bits (page offsets) of VA to access the L1 cache With 8K page size, use the 13 low order bits to access 8KB, 16KB 2-way, 32KB 4-way set-associative caches Access TLB and L1 in parallel using VA and do tag comparison after fetching the PPN from TLB

Exercises and Discussion Which one is the fastest among 3 cache organizations? Which one is the slowest among 3 cache organizations? Which one is the largest among 3 cache organizations? Which one is the smallest among 3 cache organizations? What will happen in terms of cache/TLB/page misses right after context switching?

Homework 6 Read Chapter 9 from Computer System Textbook Exercise