1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 12, 2003 Topics: 1. Cache Performance (concl.) 2. Cache Coherence.

Slides:



Advertisements
Similar presentations
Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.
Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Lecture 12 Reduce Miss Penalty and Hit Time
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Performance of Cache Memory
Cache Optimization Summary
Cs 325 virtualmemory.1 Accessing Caches in Virtual Memory Environment.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
CS252/Culler Lec 4.1 1/31/02 CS203A Graduate Computer Architecture Lecture 14 Cache Design Taken from Prof. David Culler’s notes.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Nov 14, 2005 Topic: Cache Coherence.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 6, 2002 Topic: 1. Virtual Memory; 2. Cache Coherence.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
Vir. Mem II CSE 471 Aut 011 Synonyms v.p. x, process A v.p. y, process B v.p # index Map to same physical page Map to synonyms in the cache To avoid synonyms,
Cache Organization of Pentium
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
Multiprocessor Cache Coherency
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 4, 2002 Topic: 1. Caches (contd.); 2. Virtual Memory.
Lecture 12: Memory Hierarchy— Five Ways to Reduce Miss Penalty (Second Level Cache) Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
MBG 1 CIS501, Fall 99 Lecture 11: Memory Hierarchy: Caches, Main Memory, & Virtual Memory Michael B. Greenwald Computer Architecture CIS 501 Fall 1999.
1 March 2010Summary, EE800 EE800 Circuit Elements in Digital Computations (Review) Professor S. Ko Electrical and Computer Engineering University of Saskatchewan.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component,
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Memory Hierarchy and Cache Design (3). Reducing Cache Miss Penalty 1. Giving priority to read misses over writes 2. Sub-block placement for reduced miss.
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
Translation Lookaside Buffer
CMSC 611: Advanced Computer Architecture
COMP 740: Computer Architecture and Implementation
Main Memory Cache Architectures
Cache Organization of Pentium
CSC 4250 Computer Architectures
Multiprocessor Cache Coherency
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
William Stallings Computer Organization and Architecture 7th Edition
Example Cache Coherence Problem
CMSC 611: Advanced Computer Architecture
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate
Cache - Optimization.
Synonyms v.p. x, process A v.p # index Map to same physical page
Presentation transcript:

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 12, 2003 Topics: 1. Cache Performance (concl.) 2. Cache Coherence

2 Review: Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.

3 1. Fast Hit Times via Small, Simple Caches  Simple caches can be faster cache hit time increasingly a bottleneck to CPU performance cache hit time increasingly a bottleneck to CPU performance  set associativity requires complex tag matching  slower  direct-mapped are simpler  faster  shorter CPU cycle times –tag check can be overlapped with transmission of data  Smaller caches can be faster can fit on the same chip as CPU can fit on the same chip as CPU  avoid penalty of going off-chip for L2 caches: compromise for L2 caches: compromise  keep tags on chip, and data off chip –fast tag check, yet greater cache capacity L1 data cache reduced from 16KB in Pentium III to 8KB in Pentium IV L1 data cache reduced from 16KB in Pentium III to 8KB in Pentium IV

4 2. Fast Hits by Avoiding Addr. Translation  For Virtual Memory: can send virtual address to cache? Called Virtually Addressed Cache or just Virtual Cache, vs. Physical Cache Benefits: avoid translation from virtual to real address; saves time Benefits: avoid translation from virtual to real address; saves time Problems: Problems:  Every time process is switched logically must flush the cache; otherwise get false hits –Cost is time to flush + “compulsory” misses from empty cache  Dealing with aliases (sometimes called synonyms); Two different virtual addresses map to same physical address  I/O must interact with cache, so need mapping to virtual address  Some Solutions partially address these issues HW guarantee: each cache frame holds unique physical address HW guarantee: each cache frame holds unique physical address SW guarantee: lower n bits must have same address; as long as covers index field & direct mapped, they must be unique; called page coloring SW guarantee: lower n bits must have same address; as long as covers index field & direct mapped, they must be unique; called page coloring  Solution to cache flush Add process identifier tag that identifies process as well as address within process: can’t get a hit if wrong process Add process identifier tag that identifies process as well as address within process: can’t get a hit if wrong process

5 Virtually Addressed Caches CPU TLB Cache MEM VA PA Conventional Organization CPU Cache TLB MEM VA PA Virtually Addressed Cache Translate only on miss VA Tags

6 3. Pipeline Write Hits  Write Hits take slightly longer than Read Hits: cannot parallelize tag matching with data transfer cannot parallelize tag matching with data transfer  must match tags before data is written!  Summary of Key Idea: pipeline the writes pipeline the writes  check tag first; if match, let CPU resume  let the actual write take its time

7 TechniqueMRMPHT Complexity Larger Block Size+–0 Higher Associativity+–1 Victim Caches+2 Pseudo-Associative Caches +2 HW Prefetching of Instr/Data+2 Compiler Controlled Prefetching+3 Compiler Reduce Misses+0 Priority to Read Misses+1 Subblock Placement ++1 Early Restart & Critical Word 1st +2 Non-Blocking Caches+3 Second Level Caches+2 Small & Simple Caches–+0 Avoiding Address Translation+2 Cache Optimization Summary

8 Impact of Caches  : Speed = ƒ(no. operations)  1997 Pipelined Execution & Fast Clock Rate Pipelined Execution & Fast Clock Rate Out-of-Order completion Out-of-Order completion Superscalar Instruction Issue Superscalar Instruction Issue  1999: Speed = ƒ(non-cached memory accesses)  Has impact on: Compilers, Architects, Algorithms, Data Structures? Compilers, Architects, Algorithms, Data Structures?

9 Cache Coherence Section 6.3 & Appendix I

10 Cache Coherence  Common problem with multiple copies of mutable information (in both hardware and software) “If a datum is copied and the copy is to match the original at all times, then all changes to the original must cause the copy to be immediately updated or invalidated.” (Richard L. Sites, co-architect of DEC Alpha) “If a datum is copied and the copy is to match the original at all times, then all changes to the original must cause the copy to be immediately updated or invalidated.” (Richard L. Sites, co-architect of DEC Alpha) 1234AAAC-ABB1234AAAC-ABB Copy becomes stale Copies diverge; hard to recover from 1234AAAB-ABB1234AAAB-ABB Write update 1234AAA--ABB1234AAA--ABB Write invalidate

11 Example of Cache Coherence  I/O in uniprocessor with primary unified cache MM copy and cache copy of memory block not always coherent MM copy and cache copy of memory block not always coherent WT cache WT cache  MM copy stale while write update to MM in transit WB cache WB cache  MM copy stale while cache copy Dirty Inconsistency of no concern if no one reads/writes MM copy Inconsistency of no concern if no one reads/writes MM copy If I/O directed to main memory, need to maintain coherence If I/O directed to main memory, need to maintain coherence

12 Example of Cache Coherence (contd)  Uniprocessor with a split primary cache I-cache contains instruction I-cache contains instruction D-cache contains data D-cache contains data Often contents are disjoint Often contents are disjoint If self-modifying code is allowed, then same cache block may appear in both caches, and consistency must be enforced If self-modifying code is allowed, then same cache block may appear in both caches, and consistency must be enforced MS-DOS allows self-modifying code MS-DOS allows self-modifying code  Strong motivation for unified caches in Intel i386 and i486  Pentium has split primary cache, and supports SMC by enforcing coherence between I and D caches  Coordinating primary and secondary caches in uniprocessor  Shared memory multiprocessors

13 Two “Snoopy” Protocols  We will discuss two protocols A simple three-state protocol A simple three-state protocol  Section 6.3 & Appendix I of HP3 The MESI protocol The MESI protocol  IEEE standard  Used by many machines, including Pentium and PowerPC 601  Snooping: monitor memory bus activity by individual caches monitor memory bus activity by individual caches taking some actions based on this activity taking some actions based on this activity introduces a fourth category of miss to the 3C model: coherence misses introduces a fourth category of miss to the 3C model: coherence misses  First, we need some notation to discuss the protocols

14 Notation: Write-Through Cache

15 Notation: Write-Back Cache

16 Three-State Write-Invalidate Protocol  Minor modification of WB cache  Assumptions Single bus and MM Single bus and MM Two or more CPUs, each with WB cache Two or more CPUs, each with WB cache Every cache block in one of three states: Invalid, Clean, Dirty (called Invalid, Shared, Exclusive in Figure 6.10 of HP3) Every cache block in one of three states: Invalid, Clean, Dirty (called Invalid, Shared, Exclusive in Figure 6.10 of HP3) MM copies of blocks have no state MM copies of blocks have no state At any moment, a single cache owns bus (is bus master) At any moment, a single cache owns bus (is bus master) Bus master does not obey bus command Bus master does not obey bus command All misses (reads or writes) serviced by All misses (reads or writes) serviced by  MM if all cache copies are Clean  the only Dirty cache copy (which is no longer Dirty ), and MM copy is written instead of being read

17 Understanding the Protocol MM C1C2 A A A A A -- A B Bus owner Clean Another Clean copy exists Can read without notifying other caches Bus owner Clean Another Clean copy exists Can read without notifying other caches Bus owner Clean No other cache copies Can read without notifying other caches Bus owner Dirty No other cache copies Can read or write without notifying other caches Only two global states Most up-to-date copy is MM copy, and all cache copies are Clean Most up-to-date copy is a single unique cache copy in state Dirty

18 State Diagram of Cache Block (Part 1)

19 State Diagram of Cache Block (Part 2)

20 Comparison with Single WB Cache  Similarities Read hit invisible on bus Read hit invisible on bus All misses visible on bus All misses visible on bus  Differences In single WB cache, all misses are serviced by MM; in three- state protocol, misses are serviced either by MM or by unique cache block holding only Dirty copy In single WB cache, all misses are serviced by MM; in three- state protocol, misses are serviced either by MM or by unique cache block holding only Dirty copy In single WB cache, write hit is invisible on bus; in three-state protocol, write hit of Clean block: In single WB cache, write hit is invisible on bus; in three-state protocol, write hit of Clean block:  invalidates all other Clean blocks by a Bus Write Miss (necessary action)

21 Correctness of Three-State Protocol  Problem: State transition of FSM is supposed to be atomic, but they are not in this protocol, because of the bus  Example: CPU read miss in Dirty state CPU access to cache detects a miss CPU access to cache detects a miss Request bus Request bus Acquire bus, and change state of cache block Acquire bus, and change state of cache block Evict dirty block to MM Evict dirty block to MM Put Bus Read Miss on bus Put Bus Read Miss on bus Receive requested block from MM or another cache Receive requested block from MM or another cache Release bus, and read from cache block just received Release bus, and read from cache block just received  Bus arbitration may cause gap between steps 2 and 3 Whole sequence of operations no longer atomic Whole sequence of operations no longer atomic App. I.1 argues that protocol will work correctly if steps 3-7 are atomic, i.e., bus is not a split-transaction bus App. I.1 argues that protocol will work correctly if steps 3-7 are atomic, i.e., bus is not a split-transaction bus

22 Adding More Bits to Protocols  Add third bit, called Shared, to Valid and Dirty bits Get five states (M, O, E, S, I) Get five states (M, O, E, S, I) Developed in context of Futurebus+, with intention of explaining all snoopy protocols, all of which use 3, 4, or 5 states Developed in context of Futurebus+, with intention of explaining all snoopy protocols, all of which use 3, 4, or 5 states

23 MESI Protocol  Four-state, write-invalidate  Improved version of three-state protocol Clean state split into Exclusive and Shared states Clean state split into Exclusive and Shared states Dirty state equivalent to Modified state Dirty state equivalent to Modified state  Several slightly different versions of MESI protocol Will describe version implemented by Futurebus+ Will describe version implemented by Futurebus+ PowerPC 601 MESI protocol does not support cache-to-cache transfer of blocks PowerPC 601 MESI protocol does not support cache-to-cache transfer of blocks

24 State Diag. of MESI Cache Block (Part 1)

25 State Diag. of MESI Cache Block (Part 2)

26 Comparison with Three-State Protocol  Similarities Read hit invisible on bus Read hit invisible on bus All misses handled the same way All misses handled the same way  Differences Big improvement in handling write hits Big improvement in handling write hits  Write hit in Exclusive state invisible on bus  Write hit in Shared state involves no block transfer, only a control signal A A -- A A A A B Exclusive state Can be read or written Shared state Can be read only Modified state Can be read and written

27 Comments on Write-Invalidate Protocols  Performance Processor can lose cache block through invalidation by another processor Processor can lose cache block through invalidation by another processor Average memory access time goes up, since writes to shared blocks take more time (other copies have to be invalidated) Average memory access time goes up, since writes to shared blocks take more time (other copies have to be invalidated)  Implementation Bus and CPU want to simultaneously access same cache Bus and CPU want to simultaneously access same cache  Either same block or different blocks, but conflict nonetheless Three possible solutions Three possible solutions  Use a single tag array, and accept structural hazards  Use two separate tag arrays for bus and CPU, which must now be kept coherent at all times  Use a multiported tag array (both Intel Pentium and PowerPC 601 use this solution)