CS 213 Lecture 11: Multiprocessor 3: Directory Organization

Slides:

Advertisements

Similar presentations

Copyright Josep Torrellas 2003,20081 Cache Coherence Instructor: Josep Torrellas CS533 Term: Spring 2008.

Advertisements

1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)

1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.

Cache Optimization Summary

CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.

CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.

1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.

EECC756 - Shaaban #1 lec # 13 Spring Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: –Processor-Cache-Memory.

Directory-based Cache Coherence Large-scale multiprocessors.

CS252/Patterson Lec /28/01 CS 213 Lecture 9: Multiprocessor: Directory Protocol.

1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.

ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches.

NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.

EECC756 - Shaaban #1 lec # 12 Spring Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: –Processor-Cache-Memory.

EECC756 - Shaaban #1 lec # 11 Spring Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: –Processor-Cache-Memory.

1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

Cache Coherence for Large-Scale Machines Todd C. Mowry CS 495 Sept. 26 & Oct. 1, 2002 Topics Hierarchies Directory Protocols.

Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.

Cache Coherence in Scalable Machines CSE 661 – Parallel and Vector Architectures Prof. Muhamed Mudawar Computer Engineering Department King Fahd University.

Lecture 3. Directory-based Cache Coherence Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

CS252 Graduate Computer Architecture Lecture 20 April 12 th, 2010 Distributed Shared Memory Prof John D. Kubiatowicz

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

22/12/2005 Distributed Shared-Memory Architectures by Seda Demirağ Distrubuted Shared-Memory Architectures by Seda Demirağ.

1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.

Lecture 8: Snooping and Directory Protocols

Scalable Cache Coherent Systems

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

CS 704 Advanced Computer Architecture

Lecture 18: Coherence and Synchronization

Multiprocessor Cache Coherency

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

Scalable Cache Coherent Systems

CMSC 611: Advanced Computer Architecture

Prof John D. Kubiatowicz

Example Cache Coherence Problem

Prof John D. Kubiatowicz

Cache Coherence in Scalable Machines

Lecture 9: Directory-Based Examples II

Lecture 2: Snooping-Based Coherence

Scalable Cache Coherent Systems

Lecture 8: Directory-Based Cache Coherence

Lecture 7: Directory-Based Cache Coherence

11 – Snooping Cache and Directory Based Multiprocessors

Lecture 25: Multiprocessors

Lecture 9: Directory-Based Examples

Scalable Cache Coherent Systems

Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini

Lecture 8: Directory-Based Examples

Cache Coherence: Part II Scalable Approaches Todd C

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Cache coherence CEG 4131 Computer Architecture III

Scalable Cache Coherent Systems

Scalable Cache Coherent Systems

Lecture 17 Multiprocessors and Thread-Level Parallelism

CPE 631 Lecture 20: Multiprocessors

Lecture 19: Coherence and Synchronization

Prof John D. Kubiatowicz

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Scalable Cache Coherent Systems

Lecture 10: Directory-Based Examples II

Scalable Cache Coherent Systems

Lecture 17 Multiprocessors and Thread-Level Parallelism

Scalable Cache Coherent Systems

Presentation transcript:

CS 213 Lecture 11: Multiprocessor 3: Directory Organization

Example Directory Protocol Contd. Write miss: block has a new owner. A message is sent to old owner causing the cache to send the value of the block to the directory from which it is sent to the requesting processor, which becomes the new owner. Sharers is set to identity of new owner, and state of block is made Exclusive. Cache to Cache Transfer: Can occur with a remote read or write miss. Idea: Transfer block directly from the cache with exclusive copy to the requesting cache. Why go through directory? Rather inform directory after the block is transferred => 3 transfers over the IN instead of 4.

Basic Directory Transactions

Protocol Enhancements for Latency Forwarding messages: memory-based protocols Intervention is like a req, but issued in reaction to req. and sent to cache, rather than memory.

Example Processor 1 Processor 2 Interconnect Directory Memory P2: Write 20 to A1 A1 and A2 map to the same cache block

Example Processor 1 Processor 2 Interconnect Directory Memory P2: Write 20 to A1 A1 and A2 map to the same cache block

Example Processor 1 Processor 2 Interconnect Directory Memory P2: Write 20 to A1 A1 and A2 map to the same cache block

Example Processor 1 Processor 2 Interconnect Directory Memory P2: Write 20 to A1 Write Back A1 and A2 map to the same cache block

Example Processor 1 Processor 2 Interconnect Directory Memory P2: Write 20 to A1 A1 and A2 map to the same cache block

Example Processor 1 Processor 2 Interconnect Directory Memory P2: Write 20 to A1 A1 and A2 map to the same cache block

Example Processor 1 Processor 2 Interconnect Directory Memory P2: Write 20 to A1 A1 and A2 map to the same cache block

Assume Network latency = 25 cycles

Reducing Storage Overhead Optimizations for full bit vector schemes increase cache block size (reduces storage overhead proportionally) use multiprocessor nodes (bit per mp node, not per processor) still scales as P*M, but reasonable for all but very large machines 256-procs, 4 per cluster, 128B line: 6.25% ovhd. Reducing “width” addressing the P term? Reducing “height” addressing the M term? P M

Storage Reductions Width observation: Height observation: most blocks cached by only few nodes don’t have a bit per node, but entry contains a few pointers to sharing nodes P=1024 => 10 bit ptrs, can use 100 pointers and still save space sharing patterns indicate a few pointers should suffice (five or so) need an overflow strategy when there are more sharers Height observation: number of memory blocks >> number of cache blocks most directory entries are useless at any given time organize directory as a cache, rather than having one entry per memory block

Insight into Directory Requirements If most misses involve O(P) transactions, might as well broadcast! => Study Inherent program characteristics: frequency of write misses? how many sharers on a write miss how these scale Also provides insight into how to organize and store directory information

Cache Invalidation Patterns

Cache Invalidation Patterns

Sharing Patterns Summary Generally, few sharers at a write, scales slowly with P Code and read-only objects (e.g, scene data in Raytrace) no problems as rarely written Migratory objects (e.g., cost array cells in LocusRoute) even as # of PEs scale, only 1-2 invalidations Mostly-read objects (e.g., root of tree in Barnes) invalidations are large but infrequent, so little impact on performance Frequently read/written objects (e.g., task queues) invalidations usually remain small, though frequent Synchronization objects low-contention locks result in small invalidations high-contention locks need special support (SW trees, queueing locks) Implies directories very useful in containing traffic if organized properly, traffic and latency shouldn’t scale too badly Suggests techniques to reduce storage overhead

Overflow Schemes for Limited Pointers Broadcast (DiriB): Directory size is I. If more copies are needed (Overflow), enable broadcast bit so that invalidation signal will be broadcast to all processors in case of a write bad for widely-shared frequently read data No-broadcast (DiriNB): Don’t allow more than I copies to be present at any time. If a new request arrives, invalidate one of the existing copies - on overflow, new sharer replaces one of the old ones bad for widely read data Coarse vector (DiriCV) change representation to a coarse vector, 1 bit per k nodes on a write, invalidate all nodes that a bit corresponds to Ref: Chaiken, et al “Directory-Based Cache Coherence in Large-Scale Multiprocessors”, IEEE Computer, June 1990.

Overflow Schemes (contd.) Software (DiriSW) trap to software, use any number of pointers (no precision loss) MIT Alewife: 5 ptrs, plus one bit for local node but extra cost of interrupt processing on software processor overhead and occupancy latency 40 to 425 cycles for remote read in Alewife 84 cycles for 5 inval, 707 for 6. Dynamic pointers (DiriDP) use pointers from a hardware free list in portion of memory manipulation done by hw assist, not sw e.g. Stanford FLASH

Some Data General conclusions: 64 procs, 4 pointers, normalized to full-bit-vector Coarse vector quite robust General conclusions: full bit vector simple and good for moderate-scale several schemes should be fine for large-scale

Summary of Directory Organizations Flat Schemes: Issue (a): finding source of directory data go to home, based on address Issue (b): finding out where the copies are memory-based: all info is in directory at home cache-based: home has pointer to first element of distributed linked list Issue (c): communicating with those copies memory-based: point-to-point messages (perhaps coarser on overflow) can be multicast or overlapped cache-based: part of point-to-point linked list traversal to find them serialized Hierarchical Schemes: all three issues through sending messages up and down tree no single explict list of sharers only direct communication is between parents and children

Summary of Directory Approaches Directories offer scalable coherence on general networks no need for broadcast media Many possibilities for organizing directory and managing protocols Hierarchical directories not used much high latency, many network transactions, and bandwidth bottleneck at root Both memory-based and cache-based flat schemes are alive for memory-based, full bit vector suffices for moderate scale measured in nodes visible to directory protocol, not processors will examine case studies of each

Summary Caches contain all information on state of cached memory blocks Snooping and Directory Protocols similar; bus makes snooping easier because of broadcast (snooping => uniform memory access) Directory has extra data structure to keep track of state of all cache blocks Distributing directory => scalable shared address multiprocessor => Cache coherent, Non uniform memory access