CSE331 W15.1Irwin&Li Fall 2006 PSU CSE 331 Computer Organization and Design Fall 2006 Week 15 Section 1: Mary Jane Irwin (www.cse.psu.edu/~mji)www.cse.psu.edu/~mji.

CSE331 W15.1Irwin&Li Fall 2006 PSU CSE 331 Computer Organization and Design Fall 2006 Week 15 Section 1: Mary Jane Irwin (www.cse.psu.edu/~mji)www.cse.psu.edu/~mji Section 2: Feihui Li (www.cse.psu.edu/~feli )www.cse.psu.edu/~feli Course material on ANGEL: cms.psu.edu [ adapted from D. Patterson slides ]

CSE331 W15.2Irwin&Li Fall 2006 PSU Head’s Up  Last week’s material l Intro to pipelined datapath design; memory design  This week’s material l Memory hierarchies -Reading assignment – PH: 7.1-7.2  Final exam week l Final Exam is Tuesday, Dec 19, 4:40 to 6:30, 110 Business  Reminders l HW 8 (last) is due, Dec. 14 (by 11:55pm) l Quiz 7 (last) is due, Dec. 15 (by 11:55pm) l Dec. 13 th deadline for filing grade corrections/updates

CSE331 W15.3Irwin&Li Fall 2006 PSU To say that the P6 project had its share of people issues would be an understatement. Any effort that combines extreme creativity, mindbending stress, and colorful personalities has to account for the “people factor,” … The people factor is probably the least understood aspect of hardware design, and nothing you learn in graduate school compares with what a day-to-day, long-term, high-visibility, pressure-cooker project can teach you about your coworkers, or yourself. The Pentium Chronicles, Colwell, pg. 137

CSE331 W15.4Irwin&Li Fall 2006 PSU Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output Cache Main Memory Secondary Memory (Disk)

CSE331 W15.5Irwin&Li Fall 2006 PSU The Memory Hierarchy L1$ L2$ Main Memory Secondary Memory Increasing distance from the processor in access time Processor (Relative) size of the memory at each level Inclusive– what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM 4-8 bytes (word) 1 to 4 blocks 1,024+ bytes (disk sector = page) 8-32 bytes (block)  Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology

CSE331 W15.6Irwin&Li Fall 2006 PSU The Memory Hierarchy: Why Does it Work?  Temporal Locality (Locality in Time):  Keep most recently accessed instructions/datum closer to the processor  Spatial Locality (Locality in Space):  Move blocks consisting of contiguous words to the upper levels Lower Level Memory Upper Level Memory To Processor From Processor Blk X Blk Y

CSE331 W15.7Irwin&Li Fall 2006 PSU The Memory Hierarchy: Terminology  Hit: data is in some block in the upper level (Blk X) l Hit Rate: the fraction of memory accesses found in the upper level l Hit Time: Time to access the upper level which consists of -RAM access time + Time to determine hit/miss  Miss: data is not in the upper level so needs to be retrieve from a block in the lower level (Blk Y) l Miss Rate = 1 - (Hit Rate) l Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor l Hit Time << Miss Penalty Lower Level Memory Upper Level Memory To Processor From Processor Blk X Blk Y

CSE331 W15.8Irwin&Li Fall 2006 PSU How is the Hierarchy Managed?  registers  memory l by compiler (programmer?)  cache  main memory l by the cache controller hardware  main memory  disks l by the operating system (virtual memory) l virtual to physical address mapping assisted by the hardware (TLB) l by the programmer (files)

CSE331 W15.9Irwin&Li Fall 2006 PSU Why? The “Memory Wall”  The Processor vs DRAM speed “gap” continues to grow Clocks per instruction Clocks per DRAM access  Good cache design is increasingly important to overall performance

CSE331 W15.10Irwin&Li Fall 2006 PSU  Two questions to answer (in hardware): l Q1: How do we know if a data item is in the cache? l Q2: If it is, how do we find it?  Direct mapped l For each item of data at the lower level, there is exactly one location in the cache where it might be - so lots of items at the lower level must share locations in the upper level l Address mapping: (block address) modulo (# of blocks in the cache) l First consider block sizes of one word The Cache

CSE331 W15.11Irwin&Li Fall 2006 PSU Caching: A Simple First Example 00 01 10 11 Cache 0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx 1111xx Main Memory TagData Q1: Is it there? Valid Two low order bits define the byte in the word (32-b words) Q2: How do we find it? (block address) modulo (# of blocks in the cache) Index

CSE331 W15.12Irwin&Li Fall 2006 PSU Caching: A Simple First Example 00 01 10 11 Cache Main Memory Q2: How do we find it? Use next 2 low order memory address bits – the index – to determine which cache block (i.e., modulo the number of blocks in the cache) TagData Q1: Is it there? Compare the cache tag to the high order 2 memory address bits to tell if the memory block is in the cache Valid 0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx 1111xx Two low order bits define the byte in the word (32b words) (block address) modulo (# of blocks in the cache) Index

CSE331 W15.13Irwin&Li Fall 2006 PSU Direct Mapped Cache 0123 4 3415  Consider the main memory word reference string 0 1 2 3 4 3 4 15 Start with an empty cache - all blocks initially marked as not valid

CSE331 W15.14Irwin&Li Fall 2006 PSU Direct Mapped Cache 0123 4 3415  Consider the main memory word reference string 0 1 2 3 4 3 4 15 00 Mem(0) 00 Mem(1) 00 Mem(0) 00 Mem(1) 00 Mem(2) miss hit 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 014 11 15 00 Mem(1) 00 Mem(2) 00 Mem(3) Start with an empty cache - all blocks initially marked as not valid l 8 requests, 6 misses

CSE331 W15.15Irwin&Li Fall 2006 PSU  One word/block, cache size = 1K words MIPS Direct Mapped Cache Example 20 Tag 10 Index Data IndexTagValid 0 1 2. 1021 1022 1023 31 30... 13 12 11... 2 1 0 Byte offset What kind of locality are we taking advantage of? 20 Data 32 Hit

CSE331 W15.16Irwin&Li Fall 2006 PSU  Read hits (I$ and D$) l this is what we want!  Write hits (D$ only) l allow cache and memory to be inconsistent -write the data only into the cache (then write-back the cache contents to the memory when that cache block is “evicted”) -need a dirty bit for each cache block to tell if it needs to be written back to memory when it is evicted l require the cache and memory to be consistent -always write the data into both the cache and the memory (write-through) -don’t need a dirty bit -writes run at the speed of the main memory - slow! – or can use a write buffer, so only have to stall if the write buffer is full Handling Cache Hits

CSE331 W15.17Irwin&Li Fall 2006 PSU Review: Why Pipeline? For Throughput! I n s t r. O r d e r Time (clock cycles) Inst 0 Inst 1 Inst 2 Inst 4 Inst 3 ALU I$ Reg D$Reg ALU I$ Reg D$Reg ALU I$ Reg D$Reg ALU I$ Reg D$Reg ALU I$ Reg D$Reg To keep the pipeline running at its maximum rate both I$ and D$ need to satisfy a request from the datapath every cycle. What happens when they can’t do that?  To avoid a structural hazard need two caches on- chip: one for instructions (I$) and one for data (D$)

CSE331 W15.18Irwin&Li Fall 2006 PSU Another Reference String Mapping 0404 0 404  Consider the main memory word reference string 0 4 0 4 0 4 0 4 Start with an empty cache - all blocks initially marked as not valid

CSE331 W15.19Irwin&Li Fall 2006 PSU Another Reference String Mapping 0404 0 404  Consider the main memory word reference string 0 4 0 4 0 4 0 4 miss 00 Mem(0) 01 4 01 Mem(4) 0 00 00 Mem(0) 01 4 00 Mem(0) 01 4 00 Mem(0) 01 4 01 Mem(4) 0 00 01 Mem(4) 0 00 Start with an empty cache - all blocks initially marked as not valid  Ping pong effect due to conflict misses - two memory locations that map into the same cache block l 8 requests, 8 misses

CSE331 W15.20Irwin&Li Fall 2006 PSU Sources of Cache Misses  Compulsory (cold start or process migration, first reference): l First access to a block, “cold” fact of life, not a whole lot you can do about it l If you are going to run “millions” of instruction, compulsory misses are insignificant  Conflict (collision) l Multiple memory locations mapped to the same cache location -Solution 1: increase cache size -Solution 2: increase associativity  Capacity l Cache cannot contain all blocks accessed by the program -Solution: increase cache size

CSE331 W15.21Irwin&Li Fall 2006 PSU Handling Cache Misses  Read misses (I$ and D$) l stall the pipeline, fetch the block from the next level in the memory hierarchy, write the word+tag in the cache and send the requested word to the processor, let the pipeline resume  Write misses (D$ only) 1. stall the pipeline, fetch the block from next level in the memory hierarchy, install it in the cache (may involve having to evict a dirty block if using a write-back cache), write the word+tag in the cache, let the pipeline resume or (normally used in write-back caches) 2. Write allocate – just write the word+tag into the cache (may involve having to evict a dirty block), no need to check for cache hit, no need to stall or (normally used in write-through caches with a write buffer) 3. No-write allocate – skip the cache write (but must invalidate that cache block since it will now hold stale data) and just write the word to the write buffer (and eventually to the next memory level), no need to stall if the write buffer isn’t full

CSE331 W15.22Irwin&Li Fall 2006 PSU Multiword Block Direct Mapped Cache 8 Index Data IndexTagValid 0 1 2. 253 254 255 31 30... 13 12 11... 4 3 2 1 0 Byte offset 20 Tag HitData 32 Block offset  Four words/block, cache size = 1K words What kind of locality are we taking advantage of?

CSE331 W15.23Irwin&Li Fall 2006 PSU Taking Advantage of Spatial Locality 0  Let cache block hold more than one word 0 1 2 3 4 3 4 15 1 2 343 415 Start with an empty cache - all blocks initially marked as not valid

CSE331 W15.24Irwin&Li Fall 2006 PSU Taking Advantage of Spatial Locality 0  Let cache block hold more than one word 0 1 2 3 4 3 4 15 1 2 343 415 00 Mem(1) Mem(0) miss 00 Mem(1) Mem(0) hit 00 Mem(3) Mem(2) 00 Mem(1) Mem(0) miss hit 00 Mem(3) Mem(2) 00 Mem(1) Mem(0) miss 00 Mem(3) Mem(2) 00 Mem(1) Mem(0) 01 54 hit 00 Mem(3) Mem(2) 01 Mem(5) Mem(4) hit 00 Mem(3) Mem(2) 01 Mem(5) Mem(4) 00 Mem(3) Mem(2) 01 Mem(5) Mem(4) miss 11 1514 Start with an empty cache - all blocks initially marked as not valid l 8 requests, 4 misses

CSE331 W15.25Irwin&Li Fall 2006 PSU End of First Lecture of the Week

CSE331 W15.26Irwin&Li Fall 2006 PSU We concluded that no single quantifiable metric could reliably predict career success. However, we also felt that fast-trackers have identifiable qualities and show definite trends. One quality is a whatever-it-takes attitude. All high- output engineers have a willingness to do whatever it takes to make a project succeed, especially their particular corner of it. If that means working weekends, doing extracurricular research, or rewriting tools, they do it. The Pentium Chronicles, Colwell, pg. 140

CSE331 W15.27Irwin&Li Fall 2006 PSU Miss Rate vs Block Size vs Cache Size  Miss rate goes up if the block size becomes a significant fraction of the cache size because the number of blocks that can be held in the same size cache is smaller (increasing capacity misses)

CSE331 W15.28Irwin&Li Fall 2006 PSU Block Size Tradeoff l Larger block size means larger miss penalty -Latency to first word in block + transfer time for remaining words Miss Penalty Block Size Miss Rate Exploits Spatial Locality Fewer blocks compromises Temporal Locality Block Size Average Access Time Increased Miss Penalty & Miss Rate Block Size  In general, Average Memory Access Time = Hit Time + Miss Penalty x Miss Rate  Larger block sizes take advantage of spatial locality but l If the block size is too big relative to the cache size, the miss rate will go up

CSE331 W15.29Irwin&Li Fall 2006 PSU Multiword Block Considerations  Read misses (I$ and D$) l Processed the same as for single word blocks – a miss returns the entire block from memory l Miss penalty grows as block size grows -Early restart – datapath resumes execution as soon as the requested word of the block is returned -Requested word first – requested word is transferred from the memory to the cache (and datapath) first l Nonblocking cache – allows the datapath to continue to access the cache while the cache is handling an earlier miss  Write misses (D$) l Can’t use write allocate or will end up with a “garbled” block in the cache (e.g., for 4 word blocks, a new tag, one word of data from the new block, and three words of data from the old block), so must fetch the block from memory first and pay the stall time

CSE331 W15.30Irwin&Li Fall 2006 PSU Other Ways to Reduce Cache Miss Rates 1. Allow more flexible block placement l In a direct mapped cache a memory block maps to exactly one cache block l At the other extreme, could allow a memory block to be mapped to any cache block – fully associative cache l A compromise is to divide the cache into sets each of which consists of n “ways” (n-way set associative) 2. Use multiple levels of caches l Add a second level of caches on chip – normally a unified L2 cache (i.e., it holds both instructions and data) -L1 caches focuses on minimizing hit time in support of a shorter clock cycle (smaller with smaller block sizes) -L2 cache focuses on reducing miss rate to reduce the penalty of long main memory access times (larger with larger block sizes)

CSE331 W15.31Irwin&Li Fall 2006 PSU Cache Summary  The Principle of Locality: l Program likely to access a relatively small portion of the address space at any instant of time -Temporal Locality: Locality in Time -Spatial Locality: Locality in Space  Three major categories of cache misses: l Compulsory misses: sad facts of life, e.g., cold start misses l Conflict misses: increase cache size and/or associativity Nightmare Scenario: ping pong effect! l Capacity misses: increase cache size  Cache design space l total size, block size, associativity (replacement policy) l write-hit policy (write-through, write-back) l write-miss policy (write allocate, write buffers)

CSE331 W15.32Irwin&Li Fall 2006 PSU Improving Cache Performance Reduce the hit time l smaller cache l direct mapped cache l smaller blocks l for writes -no write allocate – just write to write buffer -write allocate – write to a delayed write buffer that then writes to the cache Reduce the miss rate l bigger cache l associative cache l larger blocks (16 to 64 bytes) l use a victim cache – a small buffer that holds the most recently discarded blocks Reduce the miss penalty l smaller blocks -for large blocks fetch critical word first l use a write buffer -check write buffer (and/or victim cache) on read miss – may get lucky l use multiple cache levels – L2 cache not tied to CPU clock rate l faster backing store/improved memory bandwidth -wider buses -SDRAMs

CSE331 W15.33Irwin&Li Fall 2006 PSU N rows N cols DRAM Column Address M-bit Output M bit planes N x M SRAM Row Address Review: (DDR) SDRAM Operation  After a row is read into the SRAM register l Input CAS as the starting “burst” address along with a burst length l Transfers a burst of data (ideally a cache block) from a series of sequential addr’s within that row - A clock controls transfer of successive words in the burst +1 Row Address CAS RAS Col Address 1 st M-bit Access2 nd M-bit3 rd M-bit4 th M-bit Cycle Time Row Add

CSE331 W15.34Irwin&Li Fall 2006 PSU  The off-chip interconnect and memory architecture affects overall system performance dramatically Memory Systems that Support Caches CPU Cache Main Memory bus  Assume 1. 1 clock cycle (1 ns) to send the address from the cache to the Main Memory 2. 50 ns (50 processor clock cycles) for DRAM first word access time, 10 ns (10 clock cycles) cycle time (remaining words in burst for SDRAM) 3. 1 clock cycle (1 ns) to return a word of data from the Main Memory to the cache  Memory-Bus to Cache bandwidth number of bytes accessed from Main Memory and transferred to cache/CPU per clock cycle 32-bit data & 32-bit addr per cycle on-chip

CSE331 W15.35Irwin&Li Fall 2006 PSU One Word Wide Memory Organization CPU Cache Main Memory bus on-chip  If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall the number of cycles required to return one data word from memory cycle(s) to send address cycle(s) to read DRAM cycle(s) to return data total clock cycles miss penalty  Number of bytes transferred per clock cycle (bandwidth) for a miss is bytes per clock

CSE331 W15.36Irwin&Li Fall 2006 PSU One Word Wide Memory Organization CPU Cache Main Memory bus on-chip  If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall the number of cycles required to return one data word from memory cycle(s) to send address cycle(s) to read DRAM cycle(s) to return data total clock cycles miss penalty  Number of bytes transferred per clock cycle (bandwidth) for a miss is bytes per clock 1 50 1 52 4/52 = 0.077

CSE331 W15.37Irwin&Li Fall 2006 PSU Burst Memory Organization CPU Cache Main Memory bus on-chip  What if the block size is four words and a (DDR) SDRAM is used? cycle(s) to send 1 st address cycle(s) to read DRAM cycle(s) to return last data word total clock cycles miss penalty  Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock

CSE331 W15.38Irwin&Li Fall 2006 PSU Burst Memory Organization CPU Cache Main Memory bus on-chip  What if the block size is four words and a (DDR) SDRAM is used? cycle(s) to send 1 st address cycle(s) to read DRAM cycle(s) to return last data word total clock cycles miss penalty  Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock 50 cycles 10 cycles 1 50 + 3*10 = 80 1 82 (4 x 4)/82 = 0.183

CSE331 W15.39Irwin&Li Fall 2006 PSU DRAM Memory System Summary  Its important to match the cache characteristics l caches access one block at a time (usually more than one word) with the DRAM characteristics l use DRAMs that support fast multiple word accesses, preferably ones that match the block size of the cache with the memory-bus characteristics l make sure the memory-bus can support the DRAM access rates and patterns l with the goal of increasing the Memory-Bus to Cache bandwidth

CSE331 W15.1Irwin&Li Fall 2006 PSU CSE 331 Computer Organization and Design Fall 2006 Week 15 Section 1: Mary Jane Irwin (www.cse.psu.edu/~mji)www.cse.psu.edu/~mji.

Similar presentations

Presentation on theme: "CSE331 W15.1Irwin&Li Fall 2006 PSU CSE 331 Computer Organization and Design Fall 2006 Week 15 Section 1: Mary Jane Irwin (www.cse.psu.edu/~mji)www.cse.psu.edu/~mji."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSE331 W15.1Irwin&Li Fall 2006 PSU CSE 331 Computer Organization and Design Fall 2006 Week 15 Section 1: Mary Jane Irwin (www.cse.psu.edu/~mji)www.cse.psu.edu/~mji.

Similar presentations

Presentation on theme: "CSE331 W15.1Irwin&Li Fall 2006 PSU CSE 331 Computer Organization and Design Fall 2006 Week 15 Section 1: Mary Jane Irwin (www.cse.psu.edu/~mji)www.cse.psu.edu/~mji."— Presentation transcript:

Similar presentations

About project

Feedback