Memory Organization [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and Irwin, PSU 2005] Other handouts To handout next.

Review: Major Components of a Computer
Processor Devices Control Input Memory Datapath Output Workstation Design Target: 25% of cost on Processor, 25% of cost on Memory (minimum memory size), rest on I/O devices, power supplies, box 11/11/2018 Irwin, PSU, 2005

Pipeline review Increases throughput What about latency?
Instruction set design - Fixed length Hazard detection Feed forward Addressing modes Increasing number of stages More hardware More stalls Branch hazard Balanced stages Potential for higher speedup PIPELINING IS NOT FREE 11/11/2018

Implementation review
Clock rate Throughput I/cycle Throughput I/sec Latency cycles CPI = 1 L H Multicycle LL M Pipeline 11/11/2018

Memories: Review SRAM: DRAM:
value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be refreshed) very small but slower than SRAM (factor of 5 to 10) Ó1998 Morgan Kaufmann Publishers 11/11/2018

Processor-Memory Performance Gap
55%/year (2X/1.5yr) “Moore’s Law” Processor-Memory Performance Gap (grows 50%/year) DRAM 7%/year (2X/10yrs) Memory baseline is a 64KB DRAM in 1980, with three years to the next generation until 1996 and then two years thereafter with a 7% per year performance improvement in latency. Processor assumes a 35% improvement per year until 1986, then a 55% until 2003, then 5% Need to supply an instruction and a data every clock cycle In 1980 there were no caches (and no need for them), by 1995 most systems had 2 level caches (e.g., 60% of the transistors on the Alpha were in the cache) 11/11/2018 Irwin, PSU, 2005

The “Memory Wall” Logic vs DRAM speed gap continues to grow
Clocks per DRAM access Clocks per instruction 11/11/2018 Irwin, PSU, 2005

Memory Performance Impact on Performance
Suppose a processor executes at ideal CPI = 1.1 50% arith/logic, 30% ld/st, 20% control and that 10% of data memory operations miss with a 50 cycle miss penalty CPI = ideal CPI + average stalls per instruction = 1.1(cycle) + ( 0.30 (datamemops/instr) x 0.10 (miss/datamemop) x 50 (cycle/miss) ) = 1.1 cycle cycle = 2.6 so 58% of the time the processor is stalled waiting for memory! A 1% instruction miss rate would add an additional 0.5 to the CPI! 11/11/2018 Irwin, PSU, 2005

The Memory Hierarchy Goal
Fact: Large memories are slow and fast memories are small How do we create a memory that gives the illusion of being large, cheap and fast (most of the time)? With hierarchy With parallelism 11/11/2018 Irwin, PSU, 2005

Memory Technology Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk
§5.1 Introduction Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic RAM (DRAM) 50ns – 70ns, $20 – $75 per GB Magnetic disk 5ms – 20ms, $0.20 – $2 per GB Ideal memory Access time of SRAM Capacity and cost/GB of disk 4th Edition Slides 11/11/2018

Programs access a small proportion of their address space at any time
Principle of Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to be accessed again soon e.g., instructions in a loop, induction variables Spatial locality Items near those accessed recently are likely to be accessed soon E.g., sequential instruction access, array data 4th Edition Slides 11/11/2018

Taking Advantage of Locality
Memory hierarchy Store everything on disk Copy recently accessed (and nearby) items from disk to smaller DRAM memory Main memory Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory Cache memory attached to CPU 4th Edition Slides 11/11/2018

A Typical Memory Hierarchy
By taking advantage of the principle of locality Can present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology On-Chip Components Control eDRAM Secondary Memory (Disk) Cache Instr Second Level Cache (SRAM) ITLB Main Memory (DRAM) Datapath Instead, the memory system of a modern computer consists of a series of black boxes ranging from the fastest to the slowest. Besides variation in speed, these boxes also varies in size (smallest to biggest) and cost. What makes this kind of arrangement work is one of the most important principle in computer design. The principle of locality. The principle of locality states that programs access a relatively small portion of the address space at any instant of time. The design goal is to present the user with as much memory as is available in the cheapest technology (points to the disk). While by taking advantage of the principle of locality, we like to provide the user an average access speed that is very close to the speed that is offered by the fastest technology. (We will go over this slide in detail in the next lectures on caches). Cache Data RegFile DTLB Speed (%cycles): ½’s ’s ’s ’s ,000’s Size (bytes): ’s K’s K’s M’s G’s to T’s Cost: highest lowest 11/11/2018 Irwin, PSU, 2005

Memory Hierarchy Levels
Block (aka line): unit of copying May be multiple words If accessed data is present in upper level Hit: access satisfied by upper level Hit ratio: hits/accesses If accessed data is absent Miss: block copied from lower level Time taken: miss penalty Miss ratio: misses/accesses = 1 – hit ratio Then accessed data supplied from upper level 4th Edition Slides 11/11/2018

Characteristics of the Memory Hierarchy
Processor Inclusive– what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM 4-8 bytes (word) 1 to 4 blocks 1,024+ bytes (disk sector = page) 8-32 bytes (block) Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory (Relative) size of the memory at each level 11/11/2018 Irwin, PSU, 2005

Memory Hierarchy Technologies
Caches use SRAM for speed and technology compatibility Low density (6 transistor cells), high power, expensive, fast Static: content will last “forever” (until power turned off) 21 Address Chip select SRAM 2M x 16 16 Output enable Dout[15-0] Write enable Din[15-0] 16 Main Memory uses DRAM for size (density) High density (1 transistor cells), low power, cheap, slow Dynamic: needs to be “refreshed” regularly (~ every 8 ms) 1% to 2% of the active cycles of the DRAM Addresses divided into 2 halves (row and column) RAS or Row Access Strobe triggering row decoder CAS or Column Access Strobe triggering column selector Size comparison: DRAM/SRAM is 4 to 8 times Cost/cycle time comparison SRAM/DRAM is 8 to 16 times Need output enable on SRAM because outputs are tri-stated (0, 1, high impedance) 11/11/2018 Irwin, PSU, 2005

Memory Performance Metrics
Latency: Time to access one word Access time: time between the request and when the data is available (or written) Cycle time: time between requests Usually cycle time > access time Typical read access times for SRAMs in 2004 are 2 to 4 ns for the fastest parts to 8 to 20ns for the typical largest parts Bandwidth: How much data from the memory can be supplied to the processor per unit time width of the data channel * the rate at which it can be used Size: DRAM to SRAM 4 to 8 Cost/Cycle time: SRAM to DRAM 8 to 16 11/11/2018 Irwin, PSU, 2005

Memory Systems that Support Caches
The off-chip interconnect and memory architecture can affect overall system performance in dramatic ways on-chip One word wide organization (one word wide bus and one word wide memory) CPU Assume 1 clock cycle to send the address 25 clock cycles for DRAM cycle time, 8 clock cycles access time 1 clock cycle to return a word of data Memory-Bus to Cache bandwidth number of bytes accessed from memory and transferred to cache/CPU per clock cycle Cache bus 32-bit data & 32-bit addr per cycle Memory 11/11/2018 Irwin, PSU, 2005

One Word Wide Memory Organization
If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall the number of cycles required to return one data word from memory cycle to send address cycles to read DRAM cycle to return data total clock cycles miss penalty Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock on-chip CPU Cache bus Memory For class handout 11/11/2018 Irwin, PSU, 2005

One Word Wide Memory Organization
If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall the number of cycles required to return one data word from memory cycle to send address cycles to read DRAM cycle to return data total clock cycles miss penalty Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock on-chip CPU 1 25 27 Cache bus Memory For lecture 4/27 = 0.148 11/11/2018 Irwin, PSU, 2005

One Word Wide Memory Organization, con’t
What if the block size is four words? cycle to send 1st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock on-chip CPU Cache bus Memory For class handout 11/11/2018 Irwin, PSU, 2005

One Word Wide Memory Organization, con’t
What if the block size is four words? cycle to send 1st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock on-chip 1 4 x 25 = 100 4 x 1 = 105 CPU Cache bus 25 cycles 25 cycles 25 cycles Memory 25 cycles For lecture (4 x 4)/105 = 0.152 Revised 09 11/11/2018

FOUR Word Wide Memory Organization, con’t
What if the block size is four words? cycle to send 1st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock on-chip CPU Cache bus Memory For class handout Revised 09 11/11/2018 Irwin, PSU, 2005

FOUR Word Wide Memory Organization, con’t
What if the block size is four words? cycle to send 1st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock on-chip 1 25 27 CPU Cache bus 25 cycles 25 cycles 25 cycles Memory 25 cycles For lecture (4 x 4)/27 = 0.593 Revised 09 11/11/2018

Interleaved Memory Organization
For a block size of four words cycle to send 1st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty on-chip CPU Cache bus Memory bank 0 Memory bank 1 Memory bank 2 Memory bank 3 For class handout Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock 11/11/2018

Interleaved Memory Organization
For a block size of four words cycle to send 1st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty on-chip 1 = 28 30 CPU Cache bus 25 cycles Memory bank 0 Memory bank 1 Memory bank 2 Memory bank 3 For lecture Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock (4 x 4)/30 = 0.533 11/11/2018

DRAM Memory System Summary
Its important to match the cache characteristics caches access one block at a time (usually more than one word) with the DRAM characteristics use DRAMs that support fast multiple word accesses, preferably ones that match the block size of the cache with the memory-bus characteristics make sure the memory-bus can support the DRAM access rates and patterns with the goal of increasing the Memory-Bus to Cache bandwidth 11/11/2018

Review: The Memory Hierarchy
Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology Processor 4-8 bytes (word) 1 to 4 blocks 1,024+ bytes (disk sector = page) 8-32 bytes (block) Inclusive– what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory (Relative) size of the memory at each level 11/11/2018

The Memory Hierarchy: Why Does it Work?
Temporal Locality (Locality in Time):  Keep most recently accessed data items closer to the processor Spatial Locality (Locality in Space):  Move blocks consisting of contiguous words to the upper levels Lower Level Memory To Processor Upper Level Memory How does the memory hierarchy work? Well it is rather simple, at least in principle. In order to take advantage of the temporal locality, that is the locality in time, the memory hierarchy will keep those more recently accessed data items closer to the processor because chances are (points to the principle), the processor will access them again soon. In order to take advantage of the spatial locality, not ONLY do we move the item that has just been accessed to the upper level, but we ALSO move the data items that are adjacent to it. +1 = 15 min. (X:55) Blk X From Processor Blk Y 11/11/2018

The Memory Hierarchy: Terminology
Hit: data is in some block in the upper level (Blk X) Hit Rate: the fraction of memory accesses found in the upper level Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data is not in the upper level so needs to be retrieve from a block in the lower level (Blk Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level Time to deliver the block the processor Hit Time << Miss Penalty Lower Level Memory Upper Level To Processor From Processor Blk X Blk Y A HIT is when the data the processor wants to access is found in the upper level (Blk X). The fraction of the memory access that are HIT is defined as HIT rate. HIT Time is the time to access the Upper Level where the data is found (X). It consists of: (a) Time to access this level. (b) AND the time to determine if this is a Hit or Miss. If the data the processor wants cannot be found in the Upper level. Then we have a miss and we need to retrieve the data (Blk Y) from the lower level. By definition (definition of Hit: Fraction), the miss rate is just 1 minus the hit rate. This miss penalty also consists of two parts: (a) The time it takes to replace a block (Blk Y to BlkX) in the upper level. (b) And then the time it takes to deliver this new block to the processor. It is very important that your Hit Time to be much much smaller than your miss penalty. Otherwise, there will be no reason to build a memory hierarchy. 11/11/2018

How is the Hierarchy Managed?
registers  memory by compiler (programmer?) cache  main memory by the cache controller hardware main memory  disks by the operating system (virtual memory) virtual to physical address mapping assisted by the hardware (TLB) by the programmer (files) 11/11/2018

(block address) modulo (# of blocks in the cache)
Two questions to answer (in hardware): Q1: How do we know if a data item is in the cache? Q2: If it is, how do we find it? Direct mapped For each item of data at the lower level, there is exactly one location in the cache where it might be - so lots of items at the lower level must share locations in the upper level Address mapping: (block address) modulo (# of blocks in the cache) First consider block sizes of one word 11/11/2018

Caching: A Simple First Example
Main Memory 0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx 1111xx Two low order bits define the byte in the word (32-b words) Cache Index Valid Tag Data 00 01 10 11 Q2: How do we find it? Use next 2 low order memory address bits – the index – to determine which cache block (i.e., modulo the number of blocks in the cache) Q1: Is it there? Compare the cache tag to the high order 2 memory address bits to tell if the memory block is in the cache For class handout (block address) modulo (# of blocks in the cache) 11/11/2018

Caching: A Simple First Example
Main Memory 0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx 1111xx Two low order bits define the byte in the word (32b words) Cache Index Valid Tag Data 00 01 10 11 Q2: How do we find it? Use next 2 low order memory address bits – the index – to determine which cache block (i.e., modulo the number of blocks in the cache) Q1: Is it there? Compare the cache tag to the high order 2 memory address bits to tell if the memory block is in the cache For lecture Valid bit indicates whether an entry contains valid information – if the bit is not set, there cannot be a match for this block (block address) modulo (# of blocks in the cache) 11/11/2018

Tags and Valid Bits How do we know which particular block is stored in a cache location? Store block address as well as the data Actually, only need the high-order bits Called the tag What if there is no data in a location? Valid bit: 1 = present, 0 = not present Initially 0 11/11/2018

Direct Mapped Cache Consider the main memory word reference string
Start with an empty cache - all blocks initially marked as not valid 1 2 3 4 3 4 15 For class handout 11/11/2018

Direct Mapped Cache Consider the main memory word reference string
Start with an empty cache - all blocks initially marked as not valid miss 1 miss 2 miss 3 miss 00 Mem(0) 00 Mem(1) 00 Mem(0) 00 Mem(0) 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(1) 00 Mem(2) 00 Mem(3) 4 miss 3 hit 4 hit 15 miss 01 4 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) For lecture 11 15 8 requests, 6 misses 11/11/2018

MIPS Direct Mapped Cache Example
One word/block, cache size = 1K words Byte offset 20 Tag 10 Index Hit Data 32 Data Index Tag Valid 1 2 . 1021 1022 1023 20 Let’s use a specific example with realistic numbers: assume we have a 1 K word (4Kbyte) direct mapped cache with block size equals to 4 bytes (1 word). In other words, each block associated with the cache tag will have 4 bytes in it (Row 1). With Block Size equals to 4 bytes, the 2 least significant bits of the address will be used as byte select within the cache block. Since the cache size is 1K word, the upper 32 minus 10+2 bits, or 20 bits of the address will be stored as cache tag. The rest of the (10) address bits in the middle, that is bit 2 through 11, will be used as Cache Index to select the proper cache entry Temporal! What kind of locality are we taking advantage of? 11/11/2018

Handling Cache Hits Read hits (I$ and D$) Write hits (D$ only)
this is what we want! Write hits (D$ only) allow cache and memory to be inconsistent write the data only into the cache block (write-back the cache contents to the next level in the memory hierarchy when that cache block is “evicted”) need a dirty bit for each data cache block to tell if it needs to be written back to memory when it is evicted require the cache and memory to be consistent always write the data into both the cache block and the next level in the memory hierarchy (write-through) so don’t need a dirty bit writes run at the speed of the next level in the memory hierarchy – so slow! – or can use a write buffer, so only have to stall if the write buffer is full 11/11/2018

Write Buffer for Write-Through Caching
Cache Processor DRAM write buffer Write buffer between the cache and main memory Processor: writes data into the cache and the write buffer Memory controller: writes contents of the write buffer to memory The write buffer is just a FIFO Typical number of entries: 4 Works fine if store frequency (w.r.t. time) << 1 / DRAM write cycle Memory system designer’s nightmare When the store frequency (w.r.t. time) → 1 / DRAM write cycle leading to write buffer saturation One solution is to use a write-back cache; another is to use an L2 cache (next lecture) We really didn't write to the memory directly. We are writing to a write buffer. Once the data is written into the write buffer and assuming a cache hit, the CPU is done with the write. The memory controller will then move the write buffer’s contents to the real memory behind the scene. The write buffer works as long as the frequency of store is not too high. Notice here, I am referring to the frequency with respect to time, not with respect to number of instructions. Remember the DRAM cycle time we talked about last time. It sets the upper limit on how frequent you can write to the main memory. If the store are too close together or the CPU time is so much faster than the DRAM cycle time, you can end up overflowing the write buffer and the CPU must stop and wait. A Memory System designer’s nightmare is when the Store frequency with respect to time approaches 1 over the DRAM Write Cycle Time. We called this Write Buffer Saturation. In that case, it does NOT matter how big you make the write buffer, the write buffer will still overflow because you simply feeding things in it faster than you can empty it. This is called Write Buffer Saturation and I have seen this happened before in simulation and when that happens your processor will be running at DRAM cycle time--very very slow. The first solution for write buffer saturation is to get rid of this write buffer and replace this write through cache with a write back cache. Another solution is to install the 2nd level cache between the write buffer and memory and makes the 2nd level write back. 11/11/2018

Review: Why Pipeline? For Throughput!
To avoid a structural hazard need two caches on-chip: one for instructions (I$) and one for data (D$) I n s t r. O r d e Time (clock cycles) Inst 0 Inst 1 Inst 2 Inst 4 Inst 3 ALU I$ Reg D$ To keep the pipeline running at its maximum rate both I$ and D$ need to satisfy a request from the datapath every cycle. What happens when they can’t do that? 11/11/2018

Another Reference String Mapping
Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid 4 4 4 4 For class handout 11/11/2018

Another Reference String Mapping
Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid miss 4 miss miss 4 miss 01 4 00 01 4 00 Mem(0) 00 Mem(0) 01 Mem(4) 00 Mem(0) miss 4 miss miss 4 miss 01 4 00 01 4 00 01 Mem(4) 00 Mem(0) 01 Mem(4) 00 Mem(0) For class handout 8 requests, 8 misses Ping pong effect due to conflict misses - two memory locations that map into the same cache block 11/11/2018

Sources of Cache Misses
Compulsory (cold start or process migration, first reference): First access to a block, “cold” fact of life, not a whole lot you can do about it If you are going to run “millions” of instruction, compulsory misses are insignificant Conflict (collision): Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity (next lecture) Capacity: Cache cannot contain all blocks accessed by the program Solution: increase cache size (Capacity miss) That is the cache misses are due to the fact that the cache is simply not large enough to contain all the blocks that are accessed by the program. The solution to reduce the Capacity miss rate is simple: increase the cache size. Here is a summary of other types of cache miss we talked about. First is the Compulsory misses. These are the misses that we cannot avoid. They are caused when we first start the program. Then we talked about the conflict misses. They are the misses that caused by multiple memory locations being mapped to the same cache location. There are two solutions to reduce conflict misses. The first one is, once again, increase the cache size. The second one is to increase the associativity. For example, say using a 2-way set associative cache instead of directed mapped cache. But keep in mind that cache miss rate is only one part of the equation. You also have to worry about cache access time and miss penalty. Do NOT optimize miss rate alone. Finally, there is another source of cache miss we will not cover today. Those are referred to as invalidation misses caused by another process, such as IO , update the main memory so you have to flush the cache to avoid inconsistency between memory and cache. +2 = 43 min. (Y:23) 11/11/2018

Handling Cache Misses Read misses (I$ and D$) Write misses (D$ only)
stall the entire pipeline, fetch the block from the next level in the memory hierarchy, install it in the cache and send the requested word to the processor, then let the pipeline resume Write misses (D$ only) stall the pipeline, fetch the block from next level in the memory hierarchy, install it in the cache (which may involve having to evict a dirty block if using a write-back cache), write the word from the processor to the cache, then let the pipeline resume or (normally used in write-back caches) Write allocate – just write the word into the cache updating both the tag and data, no need to check for cache hit, no need to stall or (normally used in write-through caches with a write buffer) No-write allocate – skip the cache write and just write the word to the write buffer (and eventually to the next memory level), no need to stall if the write buffer isn’t full; must invalidate the cache block since it will be inconsistent (now holding stale data) Let’s look at our 1KB direct mapped cache again. Assume we do a 16-bit write to memory location 0x and causes a cache miss in our 1KB direct mapped cache that has 32-byte block select. After we write the cache tag into the cache and write the 16-bit data into Byte 0 and Byte 1, do we have to read the rest of the block (Byte 2, 3, ... Byte 31) from memory? If we do read the rest of the block in, it is called write allocate. But stop and think for a second. Is it really necessary to bring in the rest of the block on a write miss? True, the principle of spatial locality implies that we are likely to access them soon. But the type of access we are going to do is likely to be another write. So if even if we do read in the data, we may end up overwriting them anyway so it is a common practice to NOT read in the rest of the block on a write miss. If you don’t bring in the rest of the block, or use the more technical term, Write Not Allocate, you better have some way to tell the processor the rest of the block is no longer valid. This bring us to the topic of sub-blocking. 11/11/2018

Multiword Block Direct Mapped Cache
Four words/block, cache size = 1K words Byte offset Hit Data 32 Block offset 20 Tag 8 Index Data Index Tag Valid 1 2 . 253 254 255 20 to take advantage for spatial locality want a cache block that is larger than word word in size. What kind of locality are we taking advantage of? 11/11/2018

Example: Larger Block Size
64 blocks, 16 bytes/block To what block number does address 1200 map? Block address = 1200/16 = 75 Block number = 75 modulo 64 = 11 Tag Index Offset 3 4 9 10 31 4 bits 6 bits 22 bits 11/11/2018

Taking Advantage of Spatial Locality
Let cache block hold more than one word Start with an empty cache - all blocks initially marked as not valid 1 2 3 4 3 For class handout 4 15 11/11/2018

Taking Advantage of Spatial Locality
Let cache block hold more than one word Start with an empty cache - all blocks initially marked as not valid miss 1 hit 2 miss 00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 00 Mem(3) Mem(2) 3 hit 4 miss 3 hit 01 5 4 00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 01 Mem(5) Mem(4) 00 Mem(3) Mem(2) 00 Mem(3) Mem(2) 00 Mem(3) Mem(2) For lecture 4 hit 15 miss 01 Mem(5) Mem(4) 01 Mem(5) Mem(4) 11 15 14 00 Mem(3) Mem(2) 00 Mem(3) Mem(2) 8 requests, 4 misses 11/11/2018

Miss Rate vs Block Size vs Cache Size
Miss rate goes up if the block size becomes a significant fraction of the cache size because the number of blocks that can be held in the same size cache is smaller (increasing capacity misses) 11/11/2018

Increased Miss Penalty
Block Size Tradeoff Larger block sizes take advantage of spatial locality but If the block size is too big relative to the cache size, the miss rate will go up Larger block size means larger miss penalty Latency to first word in block + transfer time for remaining words Average Access Time Increased Miss Penalty & Miss Rate Block Size Miss Rate Exploits Spatial Locality Fewer blocks compromises Temporal Locality Block Size Miss Penalty Block Size As I said earlier, block size is a tradeoff. In general, larger block size will reduce the miss rate because it take advantage of spatial locality. But remember, miss rate NOT the only cache performance metrics. You also have to worry about miss penalty. As you increase the block size, your miss penalty will go up because as the block gets larger, it will take you longer to fill up the block. Even if you look at miss rate by itself, which you should NOT, bigger block size does not always win. As you increase the block size, assuming keeping cache size constant, your miss rate will drop off rapidly at the beginning due to spatial locality. However, once you pass certain point, your miss rate actually goes up. As a result of these two curves, the Average Access Time (point to equation), which is really the more important performance metric than the miss rate, will go down initially because the miss rate is dropping much faster than the increase in miss penalty. But eventually, as you keep on increasing the block size, the average access time can go up rapidly because not only is the miss penalty is increasing, the miss rate is increasing as well. In general, Average Memory Access Time = Hit Time + Miss Penalty x Miss Rate 11/11/2018

Multiword Block Considerations
Read misses (I$ and D$) Processed the same as for single word blocks – a miss returns the entire block from memory Miss penalty grows as block size grows Early restart – datapath resumes execution as soon as the requested word of the block is returned Requested word first – requested word is transferred from the memory to the cache (and datapath) first Nonblocking cache – allows the datapath to continue to access the cache while the cache is handling an earlier miss Write misses (D$) Can’t use write allocate or will end up with a “garbled” block in the cache (e.g., for 4 word blocks, a new tag, one word of data from the new block, and three words of data from the old block), so must fetch the block from memory first and pay the stall time Early restart works best for instruction caches (since it works best for sequential accesses) – if the memory system can deliver a word every clock cycle, it can return words just in time. But if the processor needs another word from a different block before the previous transfer is complete, then the processor will have to stall until the memory is no longer busy. Unless you have a nonblocking cache that come in two flavors Hit under miss – allow additional cache hits during a miss with the goal of hiding some of the miss latency Miss under miss – allow multiple outstanding cache misses (need a high bandwidth memory system to support it) 11/11/2018

Cache Summary The Principle of Locality:
Program likely to access a relatively small portion of the address space at any instant of time Temporal Locality: Locality in Time Spatial Locality: Locality in Space Three major categories of cache misses: Compulsory misses: sad facts of life. Example: cold start misses Conflict misses: increase cache size and/or associativity Nightmare Scenario: ping pong effect! Capacity misses: increase cache size Cache design space total size, block size, associativity (replacement policy) write-hit policy (write-through, write-back) write-miss policy (write allocate, write buffers) Let’s summarize today’s lecture. I know you have heard this many times and many ways but it is still worth repeating. Memory hierarchy works because of the Principle of Locality which says a program will access a relatively small portion of the address space at any instant of time. There are two types of locality: temporal locality, or locality in time and spatial locality, or locality in space. So far, we have covered three major categories of cache misses. Compulsory misses are cache misses due to cold start. You cannot avoid them but if you are going to run billions of instructions anyway, compulsory misses usually don’t bother you. Conflict misses are misses caused by multiple memory location being mapped to the same cache location. The nightmare scenario is the ping pong effect when a block is read into the cache but before we have a chance to use it, it was immediately forced out by another conflict miss. You can reduce Conflict misses by either increase the cache size or increase the associativity, or both. Finally, Capacity misses occurs when the cache is not big enough to contains all the cache blocks required by the program. You can reduce this miss rate by making the cache larger. There are two write policy as far as cache write is concerned. Write through requires a write buffer and a nightmare scenario is when the store occurs so frequent that you saturates your write buffer. The second write polity is write back. In this case, you only write to the cache and only when the cache block is being replaced do you write the cache block back to memory. No fancy replacement policy is needed for the direct mapped cache. That is what caused direct mapped cache trouble to begin with – only one place to go in the cache causing conflict misses. 11/11/2018

Performance Summary When CPU performance increased Decreasing base CPI
Miss penalty becomes more significant Decreasing base CPI Greater proportion of time spent on memory stalls Increasing clock rate Memory stalls account for more CPU cycles Can’t neglect cache behavior when evaluating system performance Fourth Edition 11/11/2018

Multilevel Caches Primary cache attached to CPU
Small, but fast Level-2 cache services misses from primary cache Larger, slower, but still faster than main memory Main memory services L-2 cache misses Some high-end systems include L-3 cache Fourth Edition 11/11/2018

Multilevel Cache Example
Given CPU base CPI = 1, clock rate = 4GHz Miss rate/instruction = 2% Main memory access time = 100ns With just primary cache Miss penalty = 100ns/0.25ns = 400 cycles Effective CPI = × 400 = 9 Fourth Edition 11/11/2018

Example (cont.) Now add L-2 cache Primary miss with L-2 hit
Access time = 5ns Global miss rate to main memory = 0.5% Primary miss with L-2 hit Penalty = 5ns/0.25ns = 20 cycles Primary miss with L-2 miss Extra penalty = 500 cycles CPI = × × 400 = 3.4 Performance ratio = 9/3.4 = 2.6 Fourth Edition 11/11/2018

Multilevel Cache Considerations
Primary cache Focus on minimal hit time L-2 cache Focus on low miss rate to avoid main memory access Hit time has less overall impact Results L-1 cache usually smaller than a single cache L-1 block size smaller than L-2 block size Fourth Edition 11/11/2018

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main memory Each gets a private virtual address space holding its frequently used code and data Protected from other programs CPU and OS translate virtual addresses to physical addresses VM “block” is called a page VM translation “miss” is called a page fault §5.4 Virtual Memory Fourth Edition 11/11/2018

Address Translation What is the page size?
Fixed-size pages (4K) What is the size of page table entry? 18 bits + 1 bit for valid bit + 1 reference bit = 20 bits; typically 32 bits for indexing convenience What is the size of the page table? Fourth Edition 11/11/2018

Page Table Size Typical Parameters: 32-bit virtual address; 4 KB pages; 4 bytes per page table entry Number of PTEs = virtual address space / page size = 232/212 = 220 Size of page table = 220 PTEs X 4 bytes/PTE = 222 bytes = 4MB Each program has its own page table. So if 1000 programs are running then all page tables will use 4GB of main memory. Approaches to limit page table Most programs do not use entire virtual space. Use a limit register to restrict size – if a user goes above limit then increase page table size. Store pages on disk – sort of like virtual memory. 11/11/2018

Page Fault Penalty On page fault, the page must be fetched from disk
Takes millions of clock cycles Handled by OS code Try to minimize page fault rate Fully associative placement Smart replacement algorithms Fourth Edition 11/11/2018

Page Tables Stores placement information If page is present in memory
Array of page table entries, indexed by virtual page number Page table register in CPU points to page table in physical memory If page is present in memory PTE stores the physical page number Plus other status bits (referenced, dirty, …) If page is not present PTE can refer to location in swap space on disk Fourth Edition 11/11/2018

Translation Using a Page Table
Fourth Edition 11/11/2018

Mapping Pages to Storage

Replacement and Writes
To reduce page fault rate, prefer least-recently used (LRU) replacement Reference bit (aka use bit) in PTE set to 1 on access to page Periodically cleared to 0 by OS A page with reference bit = 0 has not been used recently Disk writes take millions of cycles Block at once, not individual locations Write through is impractical Use write-back Dirty bit in PTE set when page is written Fourth Edition 11/11/2018

Fast Translation Using a TLB
Address translation would appear to require extra memory references One to access the PTE Then the actual memory access But access to page tables has good locality So use a fast cache of PTEs within the CPU Called a Translation Look-aside Buffer (TLB) Typical: 16–512 PTEs, 0.5–1 cycle for hit, 10–100 cycles for miss, 0.01%–1% miss rate Misses could be handled by hardware or software Fourth Edition 11/11/2018

Fast Translation Using a TLB

TLB Misses If page is in memory If page is not in memory (page fault)
Load the PTE from memory and retry Could be handled in hardware Can get complex for more complicated page table structures Or in software Raise a special exception, with optimized handler If page is not in memory (page fault) OS handles fetching the page and updating the page table Then restart the faulting instruction Fourth Edition 11/11/2018

TLB Miss Handler TLB miss indicates
Page present, but PTE not in TLB Page not preset Must recognize TLB miss before destination register overwritten Raise exception Handler copies PTE from memory to TLB Then restarts instruction If page not present, page fault will occur Fourth Edition 11/11/2018

Page Fault Handler Use faulting virtual address to find PTE
Locate page on disk Choose page to replace If dirty, write to disk first Read page into memory and update page table Make process runnable again Restart from faulting instruction Fourth Edition 11/11/2018

TLB and Cache Interaction
If cache tag uses physical address Need to translate before cache lookup Alternative: use virtual address tag Complications due to aliasing Different virtual addresses for shared physical address 11/11/2018 Fourth Edition

The Memory Hierarchy The BIG Picture
Common principles apply at all levels of the memory hierarchy Based on notions of caching At each level in the hierarchy Block placement Finding a block Replacement on a miss Write policy §5.5 A Common Framework for Memory Hierarchies 11/11/2018 Fourth Edition

Concluding Remarks Fast memories are small, large memories are slow
We really want fast, large memories  Caching gives this illusion  Principle of locality Programs use a small part of their memory space frequently Memory hierarchy L1 cache  L2 cache  …  DRAM memory  disk Memory system design is critical for multiprocessors §5.12 Concluding Remarks 11/11/2018 Fourth Edition

Memory Organization [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and Irwin, PSU 2005] Other handouts To handout next.

Similar presentations

Presentation on theme: "Memory Organization [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and Irwin, PSU 2005] Other handouts To handout next."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Memory Organization [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and Irwin, PSU 2005] Other handouts To handout next.

Similar presentations

Presentation on theme: "Memory Organization [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and Irwin, PSU 2005] Other handouts To handout next."— Presentation transcript:

Similar presentations

About project

Feedback