Computer Architecture Memory Hierarchy Lynn Choi Korea University.

Computer Architecture Memory Hierarchy Lynn Choi Korea University

Memory Hierarchy Motivated by Principles of Locality Speed vs. Size vs. Cost tradeoff Locality principle Spatial Locality: nearby references are likely Example: arrays, program codes Access a block of contiguous words Temporal Locality: references to the same location is likely to occur soon Example: loops, reuse of variables Keep recently accessed data to closer to the processor Speed vs. Size tradeoff Bigger memory is slower: SRAM - DRAM - Disk - Tape Fast memory is more expensive

Memory Wall 10 100 1000 10000 100000 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 2000 CPU performance 2x every 18 months DRAM performance 7% per year

Levels of Memory Hierarchy Registers Cache Main Memory Disk Tape Instruction Operands Cache Line Page File Capacity/Access TimeMoved ByFaster/Smaller Slower/Larger Program/Compiler 1- 16B H/W 16 - 512B OS 512B – 64MB User any size 100Bs KBs-MBs 100GBs Infinite GBs

Cache A small but fast memory located between processor and main memory Benefits Reduce load latency Reduce store latency Reduce bus traffic (on-chip caches) Cache Block Allocation (When to place) On a read miss On a write miss Write-allocate vs. no-write-allocate Cache Block Placement (Where to place) Fully-associative cache Direct-mapped cache Set-associative cache

Fully Associative Cache 32KB cache (SRAM) Physical Address Space 32 bit PA = 4GB (DRAM) 0 2 28 -1 0 Cache Block (Cache Line) Memory Block A memory block can be placed into any cache block location! 2 11 -1 32b Word, 4 Word Cache Block 2 11 -1

Fully Associative Cache 32KB DATA RAM 2 11 -1 0 0 TAG RAM 3 0 31 tag = = = = offset V Word & Byte select Data out Data to CPU Advantages Disadvantages 1. High hit rate 1. Very expensive 2. Fast Yes Cache Hit

Direct Mapped Cache 32KB cache (SRAM) Physical Address Space 32 bit PA = 4GB (DRAM) 0 2 28 -1 0 Memory Block A memory block can be placed into only a single cache block! 2 11 -1 2 11 2*2 11 (2 17 -1)*2 11 …..

Direct Mapped Cache 32KB DATA RAM 2 11 -1 0 0 TAG RAM 3 031 index offset V Word & Byte select Data out Data to CPU Disadvantages Advantages 1. Low hit rate 1. Simple HW 2. Fast Implementation tag decoder = Cache Hit Yes 14 4

Set Associative Cache 32KB cache (SRAM) 0 2 28 -1 0 Memory Block In an M-way set associative cache, A memory block can be placed into M cache blocks! 2 11 -1 2 10 2*2 10 (2 18 -1)*2 10 2 10 Way 0 Way 1 2 10 -1 2 10 sets

Set Associative Cache 32KB DATA RAM 2 10 -1 0 0 TAG RAM 3 031 index offset V Word & Byte select Data out Data to CPU tag decoder = Cache Hit Yes = 13 4 Wmux Most caches are implemented as set-associative caches!

Cache Block Replacement Random Just pick one and replace it Pseudo-random: use simple hash algorithm using address LRU (least recently used) need to keep timestamp expensive due to global compare Pseudo-LRU: use LFU using bit tags Replacement policy critical for small caches

Write Policy Write-through write to cache and to next level of memory hierarchy simple to design, memory consistent generates more write trafficWrite-back Only write to cache (not to lower level) update memory when a dirty block is replaced less write traffic, write independent of main memory complex to design, memory inconsistent Write-Allocate Policy Write-allocate: allocate cache block on a write miss For write-back cache No-write-allocate For write-through cache

Review: 4 Questions for Cache Design Q1: Where can a block can be placed in? (Block Placement) Fully-associative, direct-mapped, set-associative Q2: How is a block found in the cache? (Block Indentification) Tag/Index/Offset Q3: Which block should be replaced on a miss? (Block Replacement) Random, LRU Q4: What happens on a write? (Write Policy)

Cache Performance Execution time can be divided into two factors Execution time = (busy cycles + stall cycles) *T cycle If all stalls are due to cache misses then, Stall cycles = memory stall cycles = (reads/program * read miss rate * read miss penalty) + (writes/program * write miss rate * write miss penalty) = memory accesses/program * miss rate * miss penalty Factoring into Instruction Count, Execution time = NI * (CPI execution + memory accesses/instruction * miss rate * miss penalty) * T cycle Avg-access-time = hit time+miss rate*miss penalty Improving Cache Performance Reduce miss rate Reduce miss penalty Reduce hit time

3+1 Types of Cache Misses Cold-start misses (or compulsory misses): the first access to a block is always not in the cache Misses even in an infinite cache Capacity misses: if the memory blocks needed by a program is bigger than the cache size, then capacity misses will occur due to cache block replacement. Misses even in fully associative cache Conflict misses (or collision misses): for direct-mapped or set-associative cache, too many blocks can be mapped to the same set. Invalidation misses (or sharing misses): cache blocks can be invalidated due to coherence traffic

Miss Rates (SPEC92)

Cache Performance vs Block Size Miss PenaltyMiss Rate Average Access Time Block Size Access Time Sweet Spot Transfer Time

Reduce Miss Penalty - Multi-level Cache For L1 organization, AMAT = Hit_Time + Miss_Rate * Miss_Penalty For L1/L2 organization, AMAT = Hit_Time L1 + Miss_Rate L1 * (Hit_Time L2 + Miss_Rate L2 * Miss_Penalty L2 )Advantages For capacity misses and conflict misses in L1, a significant penalty reductionDisadvantages For L1-L2 misses, miss penalty increases slightly L2 does not help compulsory misses Design Issues Size(L2) >> Size(L1) Usually, Block_size(L2) > Block_size(L1)

Random Access Memory Static vs. Dynamic Memory Static RAM (at least 6 transistor) State can be retained while power is supplied Use latched storage Speed: access time 8-16X faster than DRAM, 2 - 10ns Used for registers, buffers, on-chip and off-chip caches Dynamic RAM (usually 1 transistor) State is discharged as time goes by Use dynamic storage of charge on a capacitor Require refresh of each cell every few milliseconds Density: 16X SRAM size at the same feature size Multiplexed address lines - RAS, CAS Complex interface logic due to refresh, precharge Used for main memory

SRAM Cell versus DRAM Cell SRAM Cell DRAM Cell

RAM Structure Column Decoder + Multiplexer Memory Array Row Decoder Row Address Column Address Data 2N2N 2M2M M N-K 2K2K

Memory Performance Parameters Access Time The time elapsed from asserting an address to when the data is available on the output Row Access Time: The time elapsed from asserting RAS to when the row is available in the row buffer Column Access Time - the time elapsed from asserting CAS to when the valid data is present on the output pins Cycle Time The minimum time between two different requests to memoryLatency The time to access the first word of a blockBandwidth Transmission rate (bytes per second)

Memory Organization Assuming 1 cycle to send the address 15 cycles for each DRAM access 1 cycle to return a word of data 1 + 4 (15 + 1) = 65 cycles 1 + 2 (15 + 1) = 33 cycles (2 word wide) 1 + 1 (15 + 1) = 17 cycles (4 word wide) 1 + 15 + 4 = 20 cycles

Pentium III Example Host to PCI Bridge Main Memory Graphics System Bus FSB (133 MHz 64b data & 32b address) AGP Memory Bus Multiplexed (RAS/CAS) 16KB I-Cache 16KB D-Cache Pentium III Core Pipeline 256 KB 8-way 2 nd -level Cache Pentium III Processor 800 MHz 256b data DIMMs: 16 16M*4b 133 MHz SDRAM constitutes 128MB DRAM module with 64b data bus

Virtual Memory Virtual memory Programmer’s view of memory (virtual address space) A linear array of bytes addressed by the virtual address Physical memory Machine’s physical memory (physical address space) Also, called main memoryObjective Large address spaces -> Easy Programming Provide the illusion of infinite amount of memory Program code/data can exceed the main memory size Processes partially resident in memory Protection of codes and data Privilege level Access rights: read/modify/execute permission Sharing of codes and data Software portability Increased CPU utilization More programs can run at the same time

Virtual Memory Require the following functions Memory allocation (Placement) Memory deallocation (Replacement) Memory mapping (Translation) Memory management Automatic movement of data between main memory and secondary storage Done by operating system with the help of processor HW (exception handling mechanism) Main memory contains only the most frequently used portions of a process’s address space Illusion of infinite memory (size of secondary storage) but access time equal to main memory Usually implemented by demand paging

Paging Divide address space into fixed size page frames VA consists of (VPN, offset) PA consists of (PPN, offset) Map a virtual page to a physical page at runtime Demand paging: bring in a page on a page miss Page table entry (PTE) contains VPN PPN Presence bit – 1 if this page is in main memory Reference bits – reference statistics info used for page replacement Dirty bit – 1 if this page has been modified Access control – read/write/execute permissions Privilege level – user-level page versus system-level page Disk address Internal fragmentation

Process Def: A process is an instance of a program in execution. One of the most profound ideas in computer science. Not the same as “program” or “processor” Process provides each program with two key abstractions: Logical control flow Each program seems to have exclusive use of the CPU. Private address space Each program seems to have exclusive use of main memory. How are these Illusions maintained? Multitasking: process executions are interleaved In reality, many other programs are running on the system. Processes take turns in using the processor Each time period that a process executes a portion of its flow is called a time slice Virtual memory: a private space for each process The private space is also called the virtual address space, which is a linear array of bytes, addressed by n bit virtual address (0, 1, 2, 3, … 2 n -1)

Paging Page table organization Linear: one PTE per virtual page Hierarchical: tree structured page table Page table itself can be paged due to its size For example, 32b VA, 4KB page, 16B PTE requires 16MB page table Page directory tables PTE contains descriptor (i.e. index) for page table pages Page tables - only leaf nodes PTE contains descriptor for page Page table entries are dynamically allocated as needed Different virtual memory faults TLB miss - PTE not in TLB PTE miss - PTE not in main memory page miss - page not in main memory access violation privilege violation

Multi-Level Page Tables Given: 4KB (2 12 ) page size 32-bit address space 4-byte PTEProblem: Would need a 4 MB page table! 2 20 *4 bytes Common solution multi-level page tables e.g., 2-level table (P6) Level 1 table: 1024 entries, each of which points to a Level 2 page table. This is called page directory Level 2 table: 1024 entries, each of which points to a page Level 1 Table... Level 2 Tables

TLB TLB (Translation Lookaside Buffer) Cache of page table entries (PTEs) On TLB hit, can do virtual to physical translation without accessing the page table On TLB miss, must search the page table for the missing entry TLB configuration ~100 entries, usually fully associative cache sometimes mutil-level TLBs, TLB shootdown issue usually separate I-TLB and D-TLB, accessed every cycle Miss handling On a TLB miss, exception handler (with the help of operating system) search page table for the missed TLB entry and insert it into TLB Software managed TLBs - TLB insert/delete instructions Flexible but slow: TLB miss handler ~ 100 instructions Sometimes, by HW - HW page walker

TLB and Cache Implementation of DECStation 3100

Virtually-Indexed Physically Addressed Cache Virtually-addressed physically-tagged cache Commonly used scheme to bypass the translation Use lower bits (page offsets) of VA to access the L1 cache With 8K page size, use the 13 low order bits to access 8KB, 16KB 2-way, 32KB 4-way set-associative caches Access TLB and L1 in parallel using VA and do tag comparison after fetching the PPN from TLB

Exercises and Discussion Which one is the fastest among 3 cache organizations? Which one is the slowest among 3 cache organizations? Which one is the largest among 3 cache organizations? Which one is the smallest among 3 cache organizations? What will happen in terms of cache/TLB/page misses right after context switching?

Homework 6 Exercise 5.2 5.3 5.4 5.7 5.10

Computer Architecture Memory Hierarchy Lynn Choi Korea University.

Similar presentations

Presentation on theme: "Computer Architecture Memory Hierarchy Lynn Choi Korea University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer Architecture Memory Hierarchy Lynn Choi Korea University.

Similar presentations

Presentation on theme: "Computer Architecture Memory Hierarchy Lynn Choi Korea University."— Presentation transcript:

Similar presentations

About project

Feedback