Lecture 7: Directory-Based Cache Coherence

Slides:



Advertisements
Similar presentations
1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
Advertisements

1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
Cache Optimization Summary
1 Lecture 4: Directory-Based Coherence Details of memory-based (SGI Origin) and cache-based (Sequent NUMA-Q) directory protocols.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.
1 Lecture 20: Coherence protocols Topics: snooping and directory-based coherence protocols (Sections )
1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
Cache Coherence in Scalable Machines CSE 661 – Parallel and Vector Architectures Prof. Muhamed Mudawar Computer Engineering Department King Fahd University.
1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
Lecture 3. Directory-based Cache Coherence Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.
Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 5, 2005 Session 22.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 7, 2005 Session 23.
The University of Adelaide, School of Computer Science
1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.
Lecture 8: Snooping and Directory Protocols
Architecture and Design of AlphaServer GS320
Scalable Cache Coherent Systems
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
Multiprocessor Cache Coherency
Scalable Cache Coherent Systems
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Prof John D. Kubiatowicz
Cache Coherence in Scalable Machines
The University of Adelaide, School of Computer Science
Lecture 9: Directory-Based Examples II
CS5102 High Performance Computer Systems Distributed Shared Memory
Lecture 2: Snooping-Based Coherence
Scalable Cache Coherent Systems
Chip-Multiprocessor.
Lecture 8: Directory-Based Cache Coherence
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Lecture 9: Directory Protocol Implementations
Lecture 25: Multiprocessors
Lecture 9: Directory-Based Examples
Scalable Cache Coherent Systems
High Performance Computing
Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini
Lecture 8: Directory-Based Examples
Lecture 25: Multiprocessors
Cache Coherence: Part II Scalable Approaches Todd C
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Cache coherence CEG 4131 Computer Architecture III
Lecture 24: Virtual Memory, Multiprocessors
Scalable Cache Coherent Systems
Lecture 23: Virtual Memory, Multiprocessors
Lecture 24: Multiprocessors
Lecture: Coherence Topics: wrap-up of snooping-based coherence,
Scalable Cache Coherent Systems
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Scalable Cache Coherent Systems
Lecture 10: Directory-Based Examples II
Scalable Cache Coherent Systems
Lecture 17 Multiprocessors and Thread-Level Parallelism
Scalable Cache Coherent Systems
Presentation transcript:

Lecture 7: Directory-Based Cache Coherence Topics: scalable multiprocessor organizations, directory protocol design issues

Scalable interconnection network Scalable Multiprocessors P1 P2 Pn C1 C2 Cn Mem 1 CA1 Mem 2 CA2 Mem n CAn Scalable interconnection network CC NUMA: Cache coherent non-uniform memory access

Directory-Based Protocol For each block, there is a centralized “directory” that maintains the state of the block in different caches The directory is co-located with the corresponding memory Requests and replies on the interconnect are no longer seen by everyone – the directory serializes writes P C Dir Mem CA

Scalable interconnect Hierarchical Protocol Each “node” can be comprised of multiple processors with some form of cache coherence – the nodes can employ a different form of cache coherence among them Especially attractive if programs exhibit locality P P P P P P P P Mem Mem Scalable interconnect

Definitions Home node: the node that stores memory and directory state for the cache block in question Dirty node: the node that has a cache copy in modified state Owner node: the node responsible for supplying data (usually either the home or dirty node) Also, exclusive node, local node, requesting node, etc.

Scalable interconnection network Protocol Steps P1 P2 Pn C1 C2 Cn Dir Mem 1 CA1 Dir Mem 2 CA2 Dir Mem n CAn Scalable interconnection network What happens on a read miss and a write miss? How is information stored in a directory?

Directory Organizations Centralized Directory: one fixed location – bottleneck! Flat Directories: directory info is in a fixed place, determined by examining the address – can be further categorized as memory-based or cache-based Hierarchical Directories: the processors are organized as a logical tree structure and each parent keeps track of which of its immediate children has a copy of the block – less storage (?), more searching, can exploit locality

Flat Memory-Based Directories Directory is associated with memory and stores info for all cache copies A presence vector stores a bit for every processor, for every memory block – the overhead is a function of memory/block size and #processors Reducing directory overhead:

Flat Memory-Based Directories Directory is associated with memory and stores info for all cache copies A presence vector stores a bit for every processor, for every memory block – the overhead is a function of memory/block size and #processors Reducing directory overhead: Width: pointers (keep track of processor ids of sharers) (need overflow strategy), 2-level protocol to combine info for multiple processors Height: increase block size, track info only for blocks that are cached (note: cache size << memory size)

Flat Cache-Based Directories The directory at the memory home node only stores a pointer to the first cached copy – the caches store pointers to the next and previous sharers (a doubly linked list) Main memory Cache 7 Cache 3 Cache 26

Flat Cache-Based Directories The directory at the memory home node only stores a pointer to the first cached copy – the caches store pointers to the next and previous sharers (a doubly linked list) Potentially lower storage, no bottleneck for network traffic, Invalidates are now serialized (takes longer to acquire exclusive access), replacements must update linked list, must handle race conditions while updating list

Data Sharing Patterns Two important metrics that guide our design choices: invalidation frequency and invalidation size – turns out that invalidation size is rarely greater than four Read-only data: constantly read, never updated (raytrace) Producer-consumer: flag-based synchronization, updates from neighbors (Ocean) Migratory: reads and writes from a single processor for a period of time (global sum) Irregular: unpredictable accesses (distributed task queue)

Protocol Optimizations 3 C1 C2 4 2 1 5 Mem Request Response 3 C1 C2 C1 C2 2 2 4 3 4 1 1 Mem Mem Intervention Forwarding Reply Forwarding

Serializing Writes for Coherence Potential problems: updates may be re-ordered by the network; General solution: do not start the next write until the previous one has completed Strategies for buffering writes: buffer at home: requires more storage at home node buffer at requestors: the request is forwarded to the previous requestor and a linked list is formed NACK and retry: the home node nacks all requests until the outstanding request has completed

Title Bullet