Cache Memory Organization Today's Agends Memory Hierarchy Locality of Reference Cache Memory Organization Virtual Memory
Bus Structure for a computer system CPU chip Register file ALU Main memory Bus interface I/O bus USB controller Graphics adapter Disk controller Mouse Keyboard Monitor Disk
The CPU-Memory Gap The gap widens between DRAM, disk, and CPU speeds. SSD DRAM CPU
Locality to the Rescue! The key to bridging this CPU-Memory gap is a fundamental property of computer programs known as locality
Locality of Reference Principle of Locality:Programs tend to use data and instructions with addresses near or equal to those they have used recently Temporal locality: Recently referenced items are likely to be referenced again in the near future Spatial locality: Items with nearby addresses tend to be referenced close together in time
Locality Example Data references Instruction references sum = 0; for (i = 0; i<n; i++) sum += a[i]; return sum; Data references Reference array elements in succession (stride-1 reference pattern). Reference variable sum each iteration. Instruction references Reference instructions in sequence. Cycle through loop repeatedly. Spatial locality Temporal locality Spatial locality Temporal locality
Memory Hierarchies Some fundamental and enduring properties of hardware and software: Fast storage technologies cost more per byte, have less capacity, and require more power (heat!). The gap between CPU and main memory speed is widening. Well-written programs tend to exhibit good locality. They suggest an approach for organizing memory and storage systems known as a memory hierarchy.
An Example Memory Hierarchy CPU registers hold words retrieved from L1 cache Registers L1: L1 cache (SRAM) Smaller, faster, costlier per byte L1 cache holds cache lines retrieved from L2 cache L2: L2 cache (SRAM) L2 cache holds cache lines retrieved from main memory L3: Main memory (DRAM) Larger, slower, cheaper per byte Main memory holds disk blocks retrieved from local disks L4: Local secondary storage (local disks) Local disks hold files retrieved from disks on remote network servers Remote secondary storage (tapes, distributed file systems, Web servers) L5:
Caches Cache:A smaller, faster storage device that acts as a staging area for a subset of the data in a larger, slower device. Fundamental idea of a memory hierarchy: For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1. Why do memory hierarchies work? Because of locality, programs tend to access the data at level k more often than they access the data at level k+1. Thus, the storage at level k+1 can be slower, and thus larger and cheaper per bit. Big Idea: The memory hierarchy creates a large pool of storage that costs as much as the cheap storage near the bottom, but that serves data to programs at the rate of the fast storage near the top.
General Cache Concepts Smaller, faster, more expensive memory caches a subset of the blocks Cache 8 4 9 14 10 3 Data is copied in block-sized transfer units 4 10 Larger, slower, cheaper memory viewed as partitioned into “blocks” Memory 1 2 3 4 4 5 6 7 8 9 10 10 11 12 13 14 15
General Cache Concepts: Hit Request: 14 Data in block b is needed Block b is in cache: Hit! Cache 8 9 14 14 3 Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
General Cache Concepts: Miss Request: 12 Data in block b is needed Block b is not in cache: Miss! Cache 8 9 12 14 3 Block b is fetched from memory 12 Request: 12 Block b is stored in cache Placement policy: determines where b goes Replacement policy: determines which block gets evicted (victim) Memory 1 2 3 4 5 6 7 8 9 10 11 12 12 13 14 15
General Caching Concepts: Types of Cache Misses Cold (compulsory) miss Cold misses occur because the cache is empty. Conflict miss Most caches limit blocks at level k+1 to a small subset (sometimes a singleton) of the block positions at level k. E.g. Block i at level k+1 must be placed in block (i mod 4) at level k. Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k block. E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time. Capacity miss Occurs when the set of active cache blocks (working set) is larger than the cache.
Examples of Caching in the Hierarchy Cache Type What is Cached? Where is it Cached? Latency (cycles) Managed By Registers 4-8 bytes words CPU core Compiler TLB Address translations On-Chip TLB Hardware L1 cache 64-bytes block On-Chip L1 1 Hardware L2 cache 64-bytes block On/Off-Chip L2 10 Hardware Virtual Memory 4-KB page Main memory 100 Hardware + OS Buffer cache Parts of files Main memory 100 OS Disk cache Disk sectors Disk controller 100,000 Disk firmware Network buffer cache Parts of files Local disk 10,000,000 AFS/NFS client Browser cache Web pages Local disk 10,000,000 Web browser Web cache Web pages Remote server disks 1,000,000,000 Web proxy server
Put it together The speed gap between CPU, memory and mass storage continues to widen. Well-written programs exhibit a property called locality. Memory hierarchies based on caching close the gap by exploiting locality.
Cache/Main Memory Structure CPU requests data from memory location Check cache for this data If present, get from cache (fast) If not present, read required block from main memory to cache Then deliver from cache to CPU Cache includes tags to identify which block of main memory is in each cache slot
Cache Design Issues Size does matter Size Mapping Function Replacement Algorithm Write Policy Block Size Number of Caches Size does matter Cost More cache is expensive Speed More cache is faster (up to a point) Checking cache for data takes time
Mapping Function i.e. cache is 16k (214) lines of 4 bytes each Cache of 64kByte Cache block of 4 bytes i.e. cache is 16k (214) lines of 4 bytes each 16MBytes main memory 24 bit address (224=16M) Direct Mapping Each block of main memory maps to only one cache line i.e. if a block is in cache, it must be in one specific place Address is in two parts Least Significant w bits identify unique word. Most Significant s bits specify one memory block The MSBs are split into a cache line field r and a tag of s-r (most significant)
Direct Mapping: Address Structure Tag s-r Line or Slot r Word w 8 14 2 24 bit address 2 bit word identifier (4 byte block) 22 bit block identifier 8 bit tag (=22-14) 14 bit slot or line No two blocks in the same line have the same Tag field Check contents of cache by finding line and checking Tag
Direct Mapping Cache Organization Cache line Main Memory blocks held 0, m, 2m, 3m…2s-m 1 1,m+1, 2m+1…2s-m+1 m-1 m-1, 2m-1,3m-1…2s-1
Direct Mapped Cache Simple Inexpensive Fixed location for given block If a program accesses 2 blocks that map to the same line repeatedly, cache misses are very high Address length = (s + w) bits Number of addressable units = 2s+w words or bytes Block size = line size = 2w words or bytes Number of blocks in main memory 2s+w/2w = 2s Number of lines in cache = m = 2r Size of tag = (s – r) bits
Fully Associative Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word Tag uniquely identifies block of memory Every line’s tag is examined for a match Cache searching gets expensive
Fully Associative Mapping
Fully Associative Mapping: Address Structure Word 2 bit Tag 22 bit 22 bit tag stored with each 32 bit block of data Compare tag field with tag entry in cache to check for hit Least significant 2 bits of address identify which 16 bit word is required from 32 bit data block
Set Associative Mapping Cache is divided into a number of sets Each set contains a number of lines A given block maps to any line in a given set e.g. Block B can be in any line of set i e.g. 2 lines per set 2 way associative mapping A given block can be in one of 2 lines in only one set
Set Associative Mapping
Two-Way Set Associative Mapping Address Structure Tag 9 bit Set 13 bit Word 2 bit Use set field to determine cache set to look in Compare tag field to see if we have a hit
Set Associative Mapping Summary Address length = (s + w) bits Number of addressable units = 2s+w words or bytes Block size = line size = 2w words or bytes Number of blocks in main memory = 2s Number of lines in set = k Number of sets = v = 2d Number of lines in cache = kv = k * 2d Size of tag = (s – d) bits
Replacement Algorithms No choice in Direct mapping because each block only maps to one line!!! So it is applicable for other mapping functions Hardware implemented algorithm (speed) Least Recently used (LRU) e.g. in 2 way set associative Which of the 2 block is LRU? First in first out (FIFO) Replace block that has been in cache longest Least frequently used Replace block which has had fewest hits Random
Write Policy What is the need of Write Policy? Write through Must not overwrite a cache block unless main memory is up to date Multiple CPUs may have individual caches I/O may address main memory directly Write through All writes go to main memory as well as cache Multiple CPUs can monitor main memory traffic to keep local (to CPU) cache up to date Lots of traffic and Slows down writes Write back Updates initially made in cache only Update bit for cache slot is set when update occurs If block is to be replaced, write to main memory only if update bit is set Other caches get out of sync I/O must access main memory through cache
What about writes? Multiple copies of data exist: L1, L2, Main Memory, Disk What to do on a write-hit? Write-through (write immediately to memory) Write-back (defer write to memory until replacement of line) Need a dirty bit (line different from memory or not) What to do on a write-miss? Write-allocate (load into cache, update line in cache) Good if more writes to the location follow No-write-allocate (writes immediately to memory) Typical Write-through + No-write-allocate Write-back + Write-allocate
Intel Core i7 Cache Hierarchy Processor package Core 0 Core 3 L1 i-cache and d-cache: 32 KB, 8-way, Access: 4 cycles L2 unified cache: 256 KB, 8-way, Access: 11 cycles L3 unified cache: 8 MB, 16-way, Access: 30-40 cycles Block size: 64 bytes for all caches. Regs Regs L1 d-cache L1 i-cache L1 d-cache L1 i-cache … L2 unified cache L2 unified cache L3 unified cache (shared by all cores) Main memory
Cache Performance Metrics Miss Rate Fraction of memory references not found in cache (misses / accesses) = 1 – hit rate Typical numbers (in percentages): 3-10% for L1 can be quite small (e.g., < 1%) for L2, depending on size, etc. Hit Time Time to deliver a line in the cache to the processor includes time to determine whether the line is in the cache Typical numbers: 1-2 clock cycle for L1 5-20 clock cycles for L2 Miss Penalty Additional time required because of a miss typically 50-200 cycles for main memory (Trend: increasing!)
Writing Cache Friendly Code Make the common case go fast Focus on the inner loops of the core functions Minimize the misses in the inner loops Repeated references to variables are good (temporal locality) Stride-1 reference patterns are good (spatial locality) Key idea: Our qualitative notion of locality is quantified through our understanding of cache memories.
Lecture on Friday (Extra Class) 10 to 11:am in room no 1220
Memory Management In multi-tasking environment, CPU is shared among multiple processes and CPU scheduling takes care of assigning CPU to each ready process. In-order to realize multi-tasking multiple processes should be kept in memory or memory should be shared between multiple processes. This sharing of memory is taken care by memory management techniques. These techniques are usually implemented in hardware.
Main Memory Main memory is divided into two parts One for OS Other for multiple processes User processes should be restricted from accessing OS memory area and also other user processes This protection is provided by the hardware in the form of base and limit register. (8.1,pno.316) Base register holds starting physical address and limit register holds the size of the process. Both registers are loaded by OS by using privileged instructions.
8.1, 316
8.2, 317
Address Binding Address in user program is different from physical address. Needs address binding. Compiler usually bind symbolic address in an executable to relocatable address. Loader in turn bind relocatable address to absolute physical address. Binding can be done at Compile time – generates absolute code Load time – compiler generates relocatable code. Binding is delayed until load time Execution time – binding is delayed until run time. That is process is brought into the memory at execution time.
Logical versus Physical Address Space CPU generates logical address Address seen by memory is Physical address Logical Address Space : it is set of all logical addresses generated by a program Physical Address Space: It is a set of physical addresses corresponding to logical addresses. Memory Management Unit: Runtime mapping from logical to physical address is done by MMU (hardware)
Simple MMU scheme Using base register (a.k.a Relocation Register) User program deals with logical address MMU converts logical address to physical address Physical address is used to access main or physical memory
Dynamic Loading Do not load entire program into memory before execution It limits the size of program to the size of physical memory Since programs divided into modules or routines Dynamically load a routine when it called. Advantage: Unused routine is never loaded
Static Linking and Dynamic Linking Shared libraries are linked statically and are combined into the binary program image before execution Dynamic Linking Linking shared libraries is postponed until execution time
Memory Allocation Memory can be allocated in two ways Contiguous Memory Allocation A process resides in a single contiguous section of memory Fixed sized partition Multi-programming is bounded to number of partitions Internal Fragmentation Variable sized partition OS maintains a table to keep track of which memory is available and which is occupied External Fragmentation Non- Contiguous Memory Allocation Process is not stored in contiguous memory segments Paging
Paging It’s a memory management scheme that allows memory allocation to be non-contiguous Paging avoids external fragmentation and need of compaction Traditional systems provide hardware support for paging Modern systems implement paging by closely integrating hardware and OS (in 64-microprocessors)
Basic method of Paging Logical memory is divided into fixed-sized blocks called pages Physical memory is divided into fixed-sized blocks called frames Backing store (disk) is also divide into fixed sized blocks called memory frames CPU generates logical address which is divided into two parts: page number and page offset Page size is defined by hardware and usually is power of 2 varying between512 bytes to 16MB page size. Page size is growing over time as processes, data sets and main memory have become large Internal fragmentation may exist
Page Table It is used to map a virtual page to physical frame Page table contains information for this mapping For each page number there is a corresponding frame number stored in page table. Page table is implemented in hardware.
Paging Hardware
Paging, memory mapping using page table
Page table – hardware implementation Method 1 Page table as a set of dedicated set of registers These registers are made up of high speed logic to make paging address translation efficient CPU dispatcher reloads these registers just as it reloads other registers This method is okay if page table is small but for large processes this method is not feasible
Page table – hardware implementation Method 2 Store page table in main memory Page Table Base Register (PTBR) is used to point to page table in memory But with this method, accessing memory require two memory access. One for page table and another for actual access Very slow Need another technique
Translation Lookaside Buffer (TLB) Since the page tables vary in size. Require Larger page table for larger size processes Its not possible to store it in registers. Need to store it in main memory Every virtual memory reference causes two physical memory access Fetch page table entry Fetch data Slows down the system To overcome this problem special cache memory can be used to store page table entries TLB
TLB
TLB and Cache Operation
Structure of page table Hierarchical Paging Here Logical address is divided into three parts: Page number 1 (p1) Page number 2 (p2) Page offset 8.14 and 8.15 figures pno. 338
Virtual Memory Demand paging Page fault Do not require all pages of a process in memory Bring in pages as required Page fault Required page is not in memory Operating System must swap in required page May need to swap out a page to make space Select page to throw out based on recent history
Page Fault
Thrashing Too many processes in too little memory Operating System spends all its time swapping Little or no real work is done Disk light is on all the time Solutions Good page replacement algorithms Reduce number of processes running Fit more memory
Advantage We do not need all of a process in memory for it to run We can swap in pages as required So - we can now run processes that are bigger than total memory available! Main memory is called real memory User/programmer sees much bigger memory - virtual memory