CSE 331 Computer Organization and Design Fall 2007 Week 15

CSE 331 Computer Organization and Design Fall 2007 Week 15
Section 1: Mary Jane Irwin ( Section 2: Krishna Narayanan Course material on ANGEL: cms.psu.edu [adapted from D. Patterson slides]

Head’s Up Last week’s material This week’s material Final exam week
Intro to pipelined datapath design; memory design This week’s material Memory hierarchies, intro to cache design Reading assignment – PH: Final exam week Final Exam is Tues., Dec 18th, 10:10 to noon, 112 Walker Reminders HW 8 (last) will be due, Dec 12 (by 11:55pm) Quiz 8 (last) will close, Dec 13 (by 11:55pm) Dec. 12th deadline for filing grade corrections/updates

Review: Major Components of a Computer
Processor Devices Control Input Memory Datapath Output Cache Main Memory Secondary Memory (Disk) That is, any computer, no matter how primitive or advance, can be divided into five parts: 1. The input devices bring the data from the outside world into the computer. 2. These data are kept in the computer’s memory until ... 3. The datapath request and process them. 4. The operation of the datapath is controlled by the computer’s controller. All the work done by the computer will NOT do us any good unless we can get the data back to the outside world. 5. Getting the data back to the outside world is the job of the output devices. The most COMMON way to connect these 5 components together is to use a network of busses. Workstation Design Target: 25% of cost on Processor, 25% of cost on Memory (minimum memory size), rest on I/O devices, power supplies, box

P6’s clock rate was much higher, and if the product road map were to unfold over the next several years the way we had envisioned (Footnote: The roadmap did unfold exactly that way. More magic from Moore’s Law.) it would go much higher still. Main memory comprised dynamic RAMs which were improving with time, but mostly in capacity, not in speed. This meant that the disparity between fast CPUs and relatively slow main memory was bad and about to get much worse. The Pentium Chronicles, Colwell, pg. 76

The “Memory Wall” The Processor vs DRAM speed disparity continues to grow Clocks per DRAM access Clocks per instruction (CPI) Good memory hierarchy (cache) design is increasingly important to overall performance

(Relative) size of the memory at each level
The Memory Hierarchy Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology Processor 4-8 bytes (word) 1 to 4 blocks 1,024+ bytes (disk sector = page) 8-32 bytes (block) Inclusive– what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory (Relative) size of the memory at each level

The Memory Hierarchy: Why Does it Work?
Temporal Locality (Locality in Time):  Keep most recently accessed data items closer to the processor Spatial Locality (Locality in Space):  Move blocks consisting of contiguous words to the upper levels Lower Level Memory To Processor Upper Level Memory How does the memory hierarchy work? Well it is rather simple, at least in principle. In order to take advantage of the temporal locality, that is the locality in time, the memory hierarchy will keep those more recently accessed data items closer to the processor because chances are (points to the principle), the processor will access them again soon. In order to take advantage of the spatial locality, not ONLY do we move the item that has just been accessed to the upper level, but we ALSO move the data items that are adjacent to it. +1 = 15 min. (X:55) Blk X From Processor Blk Y

The Memory Hierarchy: Terminology
Hit: data is in some block in the upper level (Blk X) Hit Rate: fraction of memory accesses found in upper level Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data is not in the upper level so needs to be retrieve from a block in the lower level (Blk Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to bring in a block from the lower level and replace a block in the upper level with it Time to deliver the block the processor Hit Time << Miss Penalty Lower Level Memory Upper Level To Processor From Processor Blk X Blk Y A HIT is when the data the processor wants to access is found in the upper level (Blk X). The fraction of the memory access that are HIT is defined as HIT rate. HIT Time is the time to access the Upper Level where the data is found (X). It consists of: (a) Time to access this level. (b) AND the time to determine if this is a Hit or Miss. If the data the processor wants cannot be found in the Upper level. Then we have a miss and we need to retrieve the data (Blk Y) from the lower level. By definition (definition of Hit: Fraction), the miss rate is just 1 minus the hit rate. This miss penalty also consists of two parts: (a) The time it takes to replace a block (Blk Y to BlkX) in the upper level. (b) And then the time it takes to deliver this new block to the processor. It is very important that your Hit Time to be much much smaller than your miss penalty. Otherwise, there will be no reason to build a memory hierarchy.

How is the Hierarchy Managed?
registers  memory by compiler (programmer?) cache  main memory by the cache controller hardware main memory  disks by the operating system (virtual memory) virtual to physical address mapping assisted by the hardware (TLB) by the programmer (files)

(block address) modulo (# of blocks in the cache)
Two questions to answer (in hardware): Q1: How do we know if a data item is in the cache? Q2: If it is, how do we find it? Direct mapped For each item of data at the lower level, there is exactly one location in the cache where it might be - so lots of items at the lower level must share locations in the upper level Address mapping: (block address) modulo (# of blocks in the cache) First consider block sizes of one word

Caching: A Simple First Example
Main Memory 0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx 1111xx Two low order bits define the byte in the word (32b words) Cache Index Valid Tag Data 00 01 10 11 Q2: How do we find it? Use next 2 low order memory address bits – the index – to determine which cache block (i.e., modulo the number of blocks in the cache) Q1: Is it there? Compare the cache tag to the high order 2 memory address bits to tell if the memory block is in the cache For lecture Put picture on board for worksheets that come later A 32 bit address box – with the two low order bits as the byte in the word, the next “chunk” of bits a the INDEX and the most significant “chunk” of bits as the TAB Then give a specific example for the worksheets to follow of a (word) address box with 2 INDEX bits and 2 TAG bits – and fill them in with specific addresses during the following examples Valid bit indicates whether an entry contains valid information – if the bit is not set, there cannot be a match for this block (block address) modulo (# of blocks in the cache)

Direct Mapped Cache Consider the main memory word reference string
Start with an empty cache - all blocks initially marked as not valid miss 1 miss 2 miss 3 miss 00 Mem(0) 00 Mem(0) 00 Mem(0) 00 Mem(1) 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(1) 00 Mem(2) 00 Mem(3) 4 miss 3 hit 4 hit 15 miss 01 4 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) For lecture 11 15 8 requests, 6 misses

MIPS Direct Mapped Cache Example
One word/block, cache size = 1K words Byte offset 20 Tag 10 Index Hit Data 32 Data Index Tag Valid 1 2 . 1021 1022 1023 20 Let’s use a specific example with realistic numbers: assume we have a 1 K word (4Kbyte) direct mapped cache with block size equals to 4 bytes (1 word). In other words, each block associated with the cache tag will have 4 bytes in it (Row 1). With Block Size equals to 4 bytes, the 2 least significant bits of the address will be used as byte select within the cache block. Since the cache size is 1K word, the upper 32 minus 10+2 bits, or 20 bits of the address will be stored as cache tag. The rest of the (10) address bits in the middle, that is bit 2 through 11, will be used as Cache Index to select the proper cache entry Temporal! What kind of locality are we taking advantage of?

Handling Cache Hits Read hits (I$ and D$) Write hits (D$ only)
this is what we want! Write hits (D$ only) allow cache and memory to be inconsistent write the data only into the cache (then write-back the cache contents to the memory when that cache block is “evicted”) need a dirty bit for each cache block to tell if it needs to be written back to memory when it is evicted require the cache and memory to be consistent always write the data into both the cache and the memory (write-through) don’t need a dirty bit writes run at the speed of the main memory - slow! – or can use a write buffer, so only have to stall if the write buffer is full

Review: Why Pipeline? For Throughput!
To avoid a structural hazard need two caches on-chip: one for instructions (I$) and one for data (D$) I n s t r. O r d e Time (clock cycles) Inst 0 Inst 1 Inst 2 Inst 4 Inst 3 ALU I$ Reg D$ To keep the pipeline running at its maximum rate both I$ and D$ need to satisfy a request from the datapath every cycle. What happens when they can’t do that?

Another Reference String Mapping
Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid miss 4 miss miss 4 miss 01 4 00 01 4 00 Mem(0) 00 Mem(0) 01 Mem(4) 00 Mem(0) miss 4 miss miss 4 miss 01 4 00 01 4 00 01 Mem(4) 00 Mem(0) 01 Mem(4) 00 Mem(0) For class handout 8 requests, 8 misses Ping pong effect due to conflict misses - two memory locations that map into the same cache block

Sources of Cache Misses
Compulsory (cold start or process migration, first reference): First access to a block, “cold” fact of life, not a whole lot you can do about it If you are going to run “millions” of instruction, compulsory misses are insignificant Conflict (collision) Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity Capacity Cache cannot contain all blocks accessed by the program Solution: increase cache size (Capacity miss) That is the cache misses are due to the fact that the cache is simply not large enough to contain all the blocks that are accessed by the program. The solution to reduce the Capacity miss rate is simple: increase the cache size. Here is a summary of other types of cache miss we talked about. First is the Compulsory misses. These are the misses that we cannot avoid. They are caused when we first start the program. Then we talked about the conflict misses. They are the misses that caused by multiple memory locations being mapped to the same cache location. There are two solutions to reduce conflict misses. The first one is, once again, increase the cache size. The second one is to increase the associativity. For example, say using a 2-way set associative cache instead of directed mapped cache. But keep in mind that cache miss rate is only one part of the equation. You also have to worry about cache access time and miss penalty. Do NOT optimize miss rate alone. Finally, there is another source of cache miss we will not cover today. Those are referred to as invalidation misses caused by another process, such as IO , update the main memory so you have to flush the cache to avoid inconsistency between memory and cache. +2 = 43 min. (Y:23)

Handling Cache Misses Read misses (I$ and D$) Write misses (D$ only)
stall the pipeline, fetch the block from the next level in the memory hierarchy, install it in the cache and send the requested word to the processor, let the pipeline resume Write misses (D$ only) stall the pipeline, fetch the block from next level in the memory hierarchy, install it in the cache (which may involve having to evict a dirty block if using a write-back cache), write the word to the cache, let the pipeline resume or (normally used in write-back caches) Write allocate – just write the word into the cache (updating the tag too), no need to check for cache hit, no need to stall or (normally used in write-through caches with a write buffer) No-write allocate – skip the cache write (but must invalidate that cache block since it will now hold stale data) and just write the word to the write buffer (and eventually to the next memory level), no need to stall if the write buffer isn’t full Let’s look at our 1KB direct mapped cache again. Assume we do a 16-bit write to memory location 0x and causes a cache miss in our 1KB direct mapped cache that has 32-byte block select. After we write the cache tag into the cache and write the 16-bit data into Byte 0 and Byte 1, do we have to read the rest of the block (Byte 2, 3, ... Byte 31) from memory? If we do read the rest of the block in, it is called write allocate. But stop and think for a second. Is it really necessary to bring in the rest of the block on a write miss? True, the principle of spatial locality implies that we are likely to access them soon. But the type of access we are going to do is likely to be another write. So if even if we do read in the data, we may end up overwriting them anyway so it is a common practice to NOT read in the rest of the block on a write miss. If you don’t bring in the rest of the block, or use the more technical term, Write Not Allocate, you better have some way to tell the processor the rest of the block is no longer valid. This bring us to the topic of sub-blocking.

Taking Advantage of Spatial Locality
Let cache block hold more than one word Start with an empty cache - all blocks initially marked as not valid miss 1 hit 2 miss 00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 00 Mem(3) Mem(2) 3 hit 4 miss 3 hit 01 5 4 00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 01 Mem(5) Mem(4) 00 Mem(3) Mem(2) 00 Mem(3) Mem(2) 00 Mem(3) Mem(2) For lecture 4 hit 15 miss 01 Mem(5) Mem(4) 01 Mem(5) Mem(4) 11 15 14 00 Mem(3) Mem(2) 00 Mem(3) Mem(2) 8 requests, 4 misses

Multiword Block Direct Mapped Cache
Four words/block, cache size = 1K words Byte offset Hit Data 32 Block offset 20 Tag 8 Index Data Index Tag Valid 1 2 . 253 254 255 20 to take advantage for spatial locality want a cache block that is larger than word word in size. What kind of locality are we taking advantage of?

Without thinking (enough, anyway) I said, “We learned many things about the public and its relationship to technology, and we realized that as bad as FDIV turned out to be, it would be far worse if Intel made the same mistake twice. So we pounded the crap out of the floating-point units in the P6.” Naturally, that is the quote they ended up using. … Some friends bought a new hammer, wrote “craphammer” on its handle, and left it on our front doorstep. My mother … threatened to wash my mouth out with soap. The Pentium Chronicles, Colwell, pg. 134

Miss Rate vs Block Size vs Cache Size
Miss rate goes up if the block size becomes a significant fraction of the cache size because the number of blocks that can be held in the same size cache is smaller (increasing capacity misses)

Increased Miss Penalty
Block Size Tradeoff Larger block sizes take advantage of spatial locality but If the block size is too big relative to the cache size, the miss rate will go up Larger block size means larger miss penalty Latency to first word in block + transfer time for remaining words Average Access Time Increased Miss Penalty & Miss Rate Block Size Miss Rate Exploits Spatial Locality Fewer blocks compromises Temporal Locality Block Size Miss Penalty Block Size As I said earlier, block size is a tradeoff. In general, larger block size will reduce the miss rate because it take advantage of spatial locality. But remember, miss rate NOT the only cache performance metrics. You also have to worry about miss penalty. As you increase the block size, your miss penalty will go up because as the block gets larger, it will take you longer to fill up the block. Even if you look at miss rate by itself, which you should NOT, bigger block size does not always win. As you increase the block size, assuming keeping cache size constant, your miss rate will drop off rapidly at the beginning due to spatial locality. However, once you pass certain point, your miss rate actually goes up. As a result of these two curves, the Average Access Time (point to equation), which is really the more important performance metric than the miss rate, will go down initially because the miss rate is dropping much faster than the increase in miss penalty. But eventually, as you keep on increasing the block size, the average access time can go up rapidly because not only is the miss penalty is increasing, the miss rate is increasing as well. In general, Average Memory Access Time = Hit Time + Miss Penalty x Miss Rate

Multiword Block Considerations
Read misses (I$ and D$) Processed the same as for single word blocks – a miss returns the entire block from memory Miss penalty grows as block size grows Early restart – datapath resumes execution as soon as the requested word of the block is returned Requested word first – requested word is transferred from the memory to the cache (and datapath) first Nonblocking cache – allows the datapath to continue to access the cache while the cache is handling an earlier miss Write misses (D$) Can’t use write allocate or will end up with a “garbled” block in the cache (e.g., for 4 word blocks, a new tag, one word of data from the new block, and three words of data from the old block), so must fetch the block from memory first and pay the stall time Early restart works best for instruction caches (since it works best for sequential accesses) – if the memory system can deliver a word every clock cycle, it can return words just in time. But if the processor needs another word from a different block before the previous transfer is complete, then the processor will have to stall until the memory is no longer busy. Unless you have a nonblocking cache that come in two flavors Hit under miss – allow additional cache hits during a miss with the goal of hiding some of the miss latency Miss under miss – allow multiple outstanding cache misses (need a high bandwidth memory system to support it)

Other Ways to Reduce Cache Miss Rates
Allow more flexible block placement In a direct mapped cache a memory block maps to exactly one cache block At the other extreme, could allow a memory block to be mapped to any cache block – fully associative cache A compromise is to divide the cache into sets each of which consists of n “ways” (n-way set associative) Use multiple levels of caches Add a second level of caches on chip – normally a unified L2 cache (i.e., it holds both instructions and data) L1 caches focuses on minimizing hit time in support of a shorter clock cycle (smaller with smaller block sizes) L2 cache focuses on reducing miss rate to reduce the penalty of long main memory access times (larger with larger block sizes)

Improving Cache Performance
Reduce the hit time smaller cache direct mapped cache smaller blocks for writes no write allocate – just write to write buffer write allocate – write to a delayed write buffer that then writes to the cache Reduce the miss penalty smaller blocks for large blocks fetch critical word first use a write buffer check write buffer (and/or victim cache) on read miss – may get lucky use multiple cache levels – L2 cache not tied to CPU clock rate faster backing store/improved memory bandwidth wider buses SDRAMs Reduce the miss rate bigger cache associative cache larger blocks (16 to 64 bytes) use a victim cache – a small buffer that holds the most recently discarded blocks

Cache Summary The Principle of Locality:
Program likely to access a relatively small portion of the address space at any instant of time Temporal Locality: Locality in Time Spatial Locality: Locality in Space Three major categories of cache misses: Compulsory misses: sad facts of life, e.g., cold start misses Conflict misses: increase cache size and/or associativity Nightmare Scenario: ping pong effect! Capacity misses: increase cache size Cache design space total size, block size, associativity (replacement policy) write-hit policy (write-through, write-back) write-miss policy (write allocate, write buffers) Let’s summarize today’s lecture. I know you have heard this many times and many ways but it is still worth repeating. Memory hierarchy works because of the Principle of Locality which says a program will access a relatively small portion of the address space at any instant of time. There are two types of locality: temporal locality, or locality in time and spatial locality, or locality in space. So far, we have covered three major categories of cache misses. Compulsory misses are cache misses due to cold start. You cannot avoid them but if you are going to run billions of instructions anyway, compulsory misses usually don’t bother you. Conflict misses are misses caused by multiple memory location being mapped to the same cache location. The nightmare scenario is the ping pong effect when a block is read into the cache but before we have a chance to use it, it was immediately forced out by another conflict miss. You can reduce Conflict misses by either increase the cache size or increase the associativity, or both. Finally, Capacity misses occurs when the cache is not big enough to contains all the cache blocks required by the program. You can reduce this miss rate by making the cache larger. There are two write policy as far as cache write is concerned. Write through requires a write buffer and a nightmare scenario is when the store occurs so frequent that you saturates your write buffer. The second write polity is write back. In this case, you only write to the cache and only when the cache block is being replaced do you write the cache block back to memory. No fancy replacement policy is needed for the direct mapped cache. That is what caused direct mapped cache trouble to begin with – only one place to go in the cache causing conflict misses.

Review: (DDR) SDRAM Operation
Column Address +1 After a row is read into the SRAM register Input CAS as the starting “burst” address along with a burst length Transfers a burst of data (ideally a cache block) from a series of sequential addr’s within that row A clock controls transfer of successive words in the burst N cols DRAM Row Address N rows M bit planes N x M SRAM M-bit Output Cycle Time Have become the RAM architecture of choice for building memories 1st M-bit Access 2nd M-bit 3rd M-bit 4th M-bit RAS CAS Row Address Col Address Row Add

Memory Systems that Support Caches
The off-chip interconnect and memory architecture affects overall system performance dramatically on-chip Assume 1 clock cycle (1 ns) to send the address from the cache to the Main Memory 50 ns (50 processor clock cycles) for DRAM first word access time, 10 ns (10 clock cycles) cycle time (remaining words in burst for SDRAM) 1 clock cycle (1 ns) to return a word of data from the Main Memory to the cache Memory-Bus to Cache bandwidth number of bytes accessed from Main Memory and transferred to cache/CPU per clock cycle CPU Cache bus 32-bit data & 32-bit addr per cycle Main Memory

One Word Wide Memory Organization
If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall the number of cycles required to return one data word from memory cycle(s) to send address cycle(s) to read DRAM cycle(s) to return data total clock cycles miss penalty Number of bytes transferred per clock cycle (bandwidth) for a miss is bytes per clock on-chip CPU Cache 1 50 52 bus Main Memory For lecture 4/52 = 0.077

Burst Memory Organization
What if the block size is four words and a (DDR) SDRAM is used? cycle(s) to send 1st address cycle(s) to read DRAM cycle(s) to return last data word total clock cycles miss penalty Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock on-chip CPU 1 50 + 3*10 = 80 82 Cache bus 50 cycles 10 cycles Main Memory For lecture (4 x 4)/82 = 0.183

DRAM Memory System Summary
Its important to match the cache characteristics caches access one block at a time (usually more than one word) with the DRAM characteristics use DRAMs that support fast multiple word accesses, preferably ones that match the block size of the cache with the memory-bus characteristics make sure the memory-bus can support the DRAM access rates and patterns with the goal of increasing the Memory-Bus to Cache bandwidth

Final Exam Brief review of material covered on the final Questions ?

You can spend your entire life as a design engineer and never have the good fortune to find yourself on a team of incredible people like I did on the P6 project. It is an experience that will elevate your own abilities, give you … the incredible sensation of having your own intellect amplified beyond reason by proximity to so many brilliant people. Engineering is fun, but engineering in a project like p6 is sublime. … Hold every project you find yourself in to the highest standard. Expect your team to attain greatness and never settle for less. The Pentium Chronicles, Colwell, pg. 168

CSE 331 Computer Organization and Design Fall 2007 Week 15

Similar presentations

Presentation on theme: "CSE 331 Computer Organization and Design Fall 2007 Week 15"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSE 331 Computer Organization and Design Fall 2007 Week 15

Similar presentations

Presentation on theme: "CSE 331 Computer Organization and Design Fall 2007 Week 15"— Presentation transcript:

Similar presentations

About project

Feedback