CS252/Kubiatowicz Lec 15.1 10/22/03 CS252 Graduate Computer Architecture Lecture 15 Prediction (Finished) Caches I: 3 Cs and 7 ways to reduce misses October.

Slides:



Advertisements
Similar presentations
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Lecture 12 Reduce Miss Penalty and Hit Time
Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Performance of Cache Memory
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
CIS429/529 Cache Basics 1 Caches °Why is caching needed? Technological development and Moore’s Law °Why are caches successful? Principle of locality °Three.
Now, Review of Memory Hierarchy
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.
DAP Spr.‘98 ©UCB 1 Lecture 13: Memory Hierarchy—Ways to Reduce Misses.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
CS 61C L35 Caches IV / VM I (1) Garcia, Fall 2004 © UCB Andy Carle inst.eecs.berkeley.edu/~cs61c-ta inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures.
EENG449b/Savvides Lec /1/04 April 1, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
CIS629 - Fall 2002 Caches 1 Caches °Why is caching needed? Technological development and Moore’s Law °Why are caches successful? Principle of locality.
CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.
ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
Lec17.1 °For in-order pipeline, 2 options: Freeze pipeline in Mem stage (popular early on: Sparc, R4000) IF ID EX Mem stall stall stall … stall Mem Wr.
CS252/Kubiatowicz Lec 3.1 1/24/01 CS252 Graduate Computer Architecture Lecture 3 Caches and Memory Systems I January 24, 2001 Prof. John Kubiatowicz.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
EENG449b/Savvides Lec /7/05 April 7, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.
Lec17.1 °Q1: Where can a block be placed in the upper level? (Block placement) °Q2: How is a block found if it is in the upper level? (Block identification)
DAP Spr.‘98 ©UCB 1 Lecture 11: Memory Hierarchy—Ways to Reduce Misses.
CMPE 421 Parallel Computer Architecture
Lecture 19: Virtual Memory
Lecture 19 Today’s topics Types of memory Memory hierarchy.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
1010 Caching ENGR 3410 – Computer Architecture Mark L. Chang Fall 2006.
Computer Organization & Programming
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
CPE 626 CPU Resources: Introduction to Cache Memories Aleksandar Milenkovic Web:
CMSC 611: Advanced Computer Architecture
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
Yu-Lun Kuo Computer Sciences and Information Engineering
The Goal: illusion of large, fast, cheap memory
CSC 4250 Computer Architectures
Lecture 13: Reduce Cache Miss Rates
5.2 Eleven Advanced Optimizations of Cache Performance
John Kubiatowicz ( CS152 Computer Architecture and Engineering Lecture 18 Speculation/ILP (Con’t) Locality and Memory Technology.
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Advanced Computer Architectures
Systems Architecture II
Lecture 08: Memory Hierarchy Cache Performance
CPE 631 Lecture 05: Cache Design
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Cache - Optimization.
Cache Memory Rabi Mahapatra
Presentation transcript:

CS252/Kubiatowicz Lec /22/03 CS252 Graduate Computer Architecture Lecture 15 Prediction (Finished) Caches I: 3 Cs and 7 ways to reduce misses October 22 nd, 2003 Prof. John Kubiatowicz

CS252/Kubiatowicz Lec /22/03 Review: Yeh and Patt classification GBHR GPHT PABHR PAPHT PABHR GAg PAgPAp GAg: Global History Register, Global History Table PAg: Per-Address History Register, Global History Table PAp: Per-Address History Register, Per-Address History Table

CS252/Kubiatowicz Lec /22/03 Review: Other Global Variants: Try to Avoid Aliasing GAsGShare GAs: Global History Register, Per-Address (Set Associative) History Table Gshare: Global History Register, Global History Table with Simple attempt at anti-aliasing GBHR PAPHT GPHT GBHR (  n) Address (n) 

CS252/Kubiatowicz Lec /22/03 An anti-aliasing predictor: Bi-Mode [Chih-Chieh Lee, I-Cheng K. Chen, and Trevor N. Mudge] Two separate Gshare predictors+Choser –One for each bias –Only one used/updated! Sort branches by bias –Meta predictor chooses –Contructive aliasing helps rather than hinders Direction Predictors History (  n) Address (n)  Bias: True Bias: False Result Choice Predictor s bits n bits

CS252/Kubiatowicz Lec /22/03 Review: What are Important Metrics? Clearly, Hit Rate matters –Even 1% can be important when above 90% hit rate Speed: Does this affect cycle time? Space: Clearly Total Space matters! –Papers which do not try to normalize across different options are playing fast and lose with data –Try to get best performance for the cost How many different predictors are there? –MANY, MANY, MANY!

CS252/Kubiatowicz Lec /22/03 An alternative: Genetic Programming for Design Genetic programming has two key aspects: –An Encoding of the design space. »This is a symbolic representation of the result space (genome). »Much of the domain-specific knowledge and “art” involved here. –A Reproduction strategy »Includes a method for generating offspring from parents Mutation: Changing random portions of an individual Crossover: Merging aspects of two individuals »Includes a method for evaluating the effectiveness (“fitness”) of individual solutions. Generation of new branch predictors via genetic programming: –Everything derived from a “basic” predictor (table) + simple operators. –Expressions arranged in a tree –Mutation: random modification of node/replacement of subtree –Crossover: swapping the subtrees of two parents. Paper by Joel Emer and Nikolas Gloy, "A Language for Describing Predictors and its Application to Automatic Synthesis"

CS252/Kubiatowicz Lec /22/03 Review: Memory Disambiguation Memory disambiguation buffer contains set of active stores and loads in program order. –Loads and stores are entered at issue time –May not have addresses yet Optimistic dependence speculation: assume that loads and stores don’t depend on each other Need disambiguation buffer to catch errors. All checks occur at address resolution time: –When store address is ready, check for loads that are (1) later in time and (2) have same address. »These have been incorrectly speculated: flush and restart –When load address is ready, check for stores that are (1) earlier in time and (2) have same address if (match) then if (store value ready) then return value else return pointer to reservation station else optimistically start load access

CS252/Kubiatowicz Lec /22/03 Review: STORE sets Naïve speculation can cause problems for certain load-store pairs. “Counter-Speculation”: For each load, keep track of set of stores that have forwarded information in past. –If (prior store in store-set has unresolved address) then wait for store address to be completed else if (match) then if (store value ready) then return value else return pointer to reservation station else optimistically start load access Index SSID Store Inum Load/Store PC Store Set ID Table (SSIT) Last Fetched Store Table (LFST)

CS252/Kubiatowicz Lec /22/03 Processor Only Thus Far in Course: –CPU cost/performance, ISA, Pipelined Execution CPU-DRAM Gap 1980: no cache in µproc; level cache on chip (1989 first Intel µproc with a cache on chip) Review: Who Cares About the Memory Hierarchy? µProc 60%/yr. DRAM 7%/yr DRAM CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) Performance “Moore’s Law” “Less’ Law?”

CS252/Kubiatowicz Lec /22/03 Generations of Microprocessors Time of a full cache miss in instructions executed: 1st Alpha: 340 ns/5.0 ns = 68 clks x 2 or136 2nd Alpha:266 ns/3.3 ns = 80 clks x 4 or320 3rd Alpha:180 ns/1.7 ns =108 clks x 6 or648 Why not recompute the value rather than taking the time to fetch it from memory?

CS252/Kubiatowicz Lec /22/03 Processor-Memory Performance Gap “Tax” Processor % Area %Transistors (­cost)(­power) Alpha %77% StrongArm SA11061%94% Pentium Pro64%88% –2 dies per package: Proc/I$/D$ + L2$ Caches have no inherent value, only try to close performance gap

CS252/Kubiatowicz Lec /22/03 What is a cache? Small, fast storage used to improve average access time to slow memory. Exploits spacial and temporal locality In computer architecture, almost everything is a cache! –Registers a cache on variables –First-level cache a cache on second-level cache –Second-level cache a cache on memory –Memory a cache on disk (virtual memory) –TLB a cache on page table –Branch-prediction a cache on prediction information? Proc/Regs L1-Cache L2-Cache Memory Disk, Tape, etc. BiggerFaster

CS252/Kubiatowicz Lec /22/03 Example: 1 KB Direct Mapped Cache For a 2 ** N byte cache: –The uppermost (32 - N) bits are always the Cache Tag –The lowest M bits are the Byte Select (Block Size = 2 ** M) Cache Index : Cache Data Byte : Cache TagExample: 0x50 Ex: 0x01 0x50 Stored as part of the cache “state” Valid Bit : 31 Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Byte 992Byte 1023 : Cache Tag Byte Select Ex: 0x00 9 Block address

CS252/Kubiatowicz Lec /22/03 Set Associative Cache N-way set associative: N entries for each Cache Index –N direct mapped caches operates in parallel Example: Two-way set associative cache –Cache Index selects a “set” from the cache –The two tags in the set are compared to the input in parallel –Data is selected based on the tag result Cache Data Cache Block 0 Cache TagValid ::: Cache Data Cache Block 0 Cache TagValid ::: Cache Index Mux 01 Sel1Sel0 Cache Block Compare Adr Tag Compare OR Hit

CS252/Kubiatowicz Lec /22/03 Disadvantage of Set Associative Cache N-way Set Associative Cache versus Direct Mapped Cache: –N comparators vs. 1 –Extra MUX delay for the data –Data comes AFTER Hit/Miss decision and set selection In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: –Possible to assume a hit and continue. Recover later if miss. Cache Data Cache Block 0 Cache TagValid ::: Cache Data Cache Block 0 Cache TagValid ::: Cache Index Mux 01 Sel1Sel0 Cache Block Compare Adr Tag Compare OR Hit

CS252/Kubiatowicz Lec /22/03 What happens on a Cache miss? For in-order pipeline, 2 options: –Freeze pipeline in Mem stage (popular early on: Sparc, R4000) IF ID EX Mem stall stall stall … stall Mem Wr IF ID EX stall stall stall … stall stall Ex Wr –Use Full/Empty bits in registers + MSHR queue »MSHR = “Miss Status/Handler Registers” (Kroft) Each entry in this queue keeps track of status of outstanding memory requests to one complete memory line. Per cache-line: keep info about memory address. For each word: register (if any) that is waiting for result. Used to “merge” multiple requests to one memory line »New load creates MSHR entry and sets destination register to “Empty”. Load is “released” from pipeline. »Attempt to use register before result returns causes instruction to block in decode stage. »Limited “out-of-order” execution with respect to loads. Popular with in-order superscalar architectures. Out-of-order pipelines already have this functionality built in… (load queues, etc).

CS252/Kubiatowicz Lec /22/03 Review: Cache Performance CPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle time Memory stall clock cycles = (Reads x Read miss rate x Read miss penalty + Writes x Write miss rate x Write miss penalty) Memory stall clock cycles = Memory accesses x Miss rate x Miss penalty Average Memory Access time (AMAT) = Hit Time + (Miss Rate x Miss Penalty) Note: memory hit time is included in execution cycles. Memory Time from view of Memory Stage

CS252/Kubiatowicz Lec /22/03 Impact on Performance Suppose a processor executes at –Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1 –50% arith/logic, 30% ld/st, 20% control Suppose that 10% of memory operations get 50 cycle miss penalty Suppose that 1% of instructions get same miss penalty CPI = ideal CPI + average stalls per instruction 1.1(cycles/ins) + [ 0.30 (DataMops/ins) x 0.10 (miss/DataMop) x 50 (cycle/miss)] + [ 1 (InstMop/ins) x 0.01 (miss/InstMop) x 50 (cycle/miss)] = ( ) cycle/ins = % of the time the proc is stalled waiting for memory! AMAT=(1/1.3)x[1+0.01x50]+(0.3/1.3)x[1+0.1x50]=2.54

CS252/Kubiatowicz Lec /22/03 Review: Four Questions for Memory Hierarchy Designers Q1: Where can a block be placed in the upper level? (Block placement) –Fully Associative, Set Associative, Direct Mapped Q2: How is a block found if it is in the upper level? (Block identification) –Tag/Block Q3: Which block should be replaced on a miss? (Block replacement) –Random, LRU Q4: What happens on a write? (Write strategy) –Write Back or Write Through (with Write Buffer)

CS252/Kubiatowicz Lec /22/03 CS 252 Administrivia Don’t have test graded (Really sorry!) –May be a bit before this happens. Homework: send it to me –By (say) midnight tonight Proposals: –Send me your tentative proposals via today »Use my normal address –I will try to comment on them very soon –Final version will be due next Wednesday