LimitLess Directories: A Scalable Cache Coherence Scheme By: David Chaiken, John Kubiatowicz, John Kubiatowicz, Anant Agarwal Anant Agarwal.

Slides:



Advertisements
Similar presentations
CSCI 8150 Advanced Computer Architecture
Advertisements

Copyright Josep Torrellas 2003,20081 Cache Coherence Instructor: Josep Torrellas CS533 Term: Spring 2008.
4/16/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.
ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches.
NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
Ch 10 Shared memory via message passing Problems –Explicit user action needed –Address spaces are distinct –Small Granularity of Transfer Distributed Shared.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
LimitLESS Directories: A Scalable Cache Coherence Scheme By: David Chaiken, John Kubiatowicz, John Kubiatowicz, Anant Agarwal Anant Agarwal Presented by:
CS252 Graduate Computer Architecture Lecture 20 April 12 th, 2010 Distributed Shared Memory Prof John D. Kubiatowicz
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 5, 2005 Session 22.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 7, 2005 Session 23.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University
Siva and Osman March 7, 2000 Cache Coherence Schemes for Multiprocessors Sivakumar M Osman Unsal.
Architecture and Design of AlphaServer GS320
Scalable Cache Coherent Systems
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
Reactive Synchronization Algorithms for Multiprocessors
Multiprocessor Cache Coherency
Ivy Eva Wu.
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Krste Asanovic Electrical Engineering and Computer Sciences
Example Cache Coherence Problem
Parallel and Multiprocessor Architectures – Shared Memory
Scalable Cache Coherent Systems
CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #
Death Match ’92: NUMA v. COMA
Outline Midterm results summary Distributed file systems – continued
Multiprocessors - Flynn’s taxonomy (1966)
Lecture 8: Directory-Based Cache Coherence
Lecture 7: Directory-Based Cache Coherence
11 – Snooping Cache and Directory Based Multiprocessors
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Lecture 25: Multiprocessors
Scalable Cache Coherent Systems
High Performance Computing
Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Cache coherence CEG 4131 Computer Architecture III
Lecture 24: Multiprocessors
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Distributed Resource Management: Distributed Shared Memory
CPE 631 Lecture 20: Multiprocessors
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
CSE 486/586 Distributed Systems Cache Coherence
Scalable Cache Coherent Systems
Multiprocessors and Multi-computers
Scalable Cache Coherent Systems
Presentation transcript:

LimitLess Directories: A Scalable Cache Coherence Scheme By: David Chaiken, John Kubiatowicz, John Kubiatowicz, Anant Agarwal Anant Agarwal

Cache Coherence The gap between the computing power of microprocessors and that of the largest supercomputers is shrinking, while the price/performance advantage of microprocessor is increasing. The gap between the computing power of microprocessors and that of the largest supercomputers is shrinking, while the price/performance advantage of microprocessor is increasing. Cache enhance the performance of multiprocessors by reducing network traffic and average memory access time Cache enhance the performance of multiprocessors by reducing network traffic and average memory access time Cache coherence arise because multiple processors may be reading and modifying the same memory block within their own cache Cache coherence arise because multiple processors may be reading and modifying the same memory block within their own cache Common Solution Common Solution Snoopy coherence Snoopy coherence Directory based coherence Directory based coherence Compiler directed coherence Compiler directed coherence

Directory (Full-map) The message-based protocols allocate The message-based protocols allocate a section of the system’s memory a section of the system’s memory  Directory  Directory Each block of memory has an associated directory entry which contains a bit for each cache in the system. Each block of memory has an associated directory entry which contains a bit for each cache in the system. That bit indicates whether or not the associated cache contains a copy of memory block That bit indicates whether or not the associated cache contains a copy of memory block

Directory based Coherence The basic concept is that a processor must ask for permission to load an entry from the primary memory to its cache. The basic concept is that a processor must ask for permission to load an entry from the primary memory to its cache. When an entry is changed the directory must be notified either before the change is initiated or when it is complete. When an entry is changed the directory must be notified either before the change is initiated or when it is complete. When an entry is changed the directory either updates or invalidates the other caches with that entry. When an entry is changed the directory either updates or invalidates the other caches with that entry.

TypeSymbolNameData? CacheToMemoryRREQWREQREPMUPDATEACKC Read Request Write Request Replace Modified Update Invalidate Ack. ** MemoryToCacheRDATAWDATAINVBUSY Read Data Write Data Invalidate Busy Signal ** ComponentNameMeaningMemoryRead-OnlyRead-WriteRead-TransactionWrite-Transaction Some number of caches have read-only copies of the data Exactly one cache has a read-write copy of the data Holding read request, update is in progress Holding write request, invalidating is in progress CacheInvalidRead-OnlyRead-Write Cache block may not be read or written Cache block may be read, but not written Cache block may be read or written Transition Label Input Message PreconditionDirectory Entry Change Output Message (s) 1 i-> RREQ --P=P U { i }RDATA -> i 2 i-> WREQ P={ i } P={ } -- P={ i } WDATA -> i 3 i-> WREQ P={k1,…kn}^ i  P P={k1,…kn}^ i  P P={i}, AckCtr = n P={i}, AckCtr = n-1 ¥kj INV-> kj ¥kj≠i INV-> kj 4 j-> WREQP={ i }P={j}, AckCtr = 1INV-> i 5 j-> RREQP={ i }P={j}, AckCtr = 1INV-> i 6 i-> REPMP={ i }P={ }-- 7 j-> RREQ j->WREQ j->ACKC j->REPM -- AckCtr ≠ 1 -- AckCtr = AckCtr BUSY->j -- 8 j->ACKC J->UPDATE AckCtr = 1, P={i}, P={ i } AckCtr = 0 WDATA -> i 9 j->RREQ j->WREQ j->REPM -- BUSY->j j->UPDATE j->ACKC P={ i } AckCtr = 0 RDATA -> i <- Protocol messages for hardware coherence ^ Directory states Annotation of the state transition diagram Annotation of the state transition diagram

Directory based Coherence FULL-MAP Directory Entry FULL-MAP Directory Entry Advantages ? Advantages ? No broadcast is necessary No broadcast is necessary Disadvantages ? Disadvantages ? Coherence traffic is high due to all requests to the directory Coherence traffic is high due to all requests to the directory Great need for memory Great need for memory Read-Only x x State N State N

Directory based Coherence Limited Directory Entry Limited Directory Entry Advantages ? Advantages ? Its performance is comparable to that of a full-map scheme in case where there is limited sharing of data between processors Its performance is comparable to that of a full-map scheme in case where there is limited sharing of data between processors Cheaper to implement Cheaper to implement Disadvantages ? Disadvantages ? The protocol is susceptible to thrashing when the number of processors sharing data exceeding the number of pointers in the directory entry The protocol is susceptible to thrashing when the number of processors sharing data exceeding the number of pointers in the directory entry Read-Only State Node ID Node ID Node ID Node ID State Node ID Node ID Node ID Node ID

LimitLess ( Limited directory Locally Extended through Software Support. ) The LimitLess scheme attempts to combine the full map and limited directory ideas in order to achieve a robust yet affordable and scalable cache coherence solution. The LimitLess scheme attempts to combine the full map and limited directory ideas in order to achieve a robust yet affordable and scalable cache coherence solution. The main idea behind this method is to handle the common case in hardware and the exceptional case in software. The main idea behind this method is to handle the common case in hardware and the exceptional case in software. Using limited directories implemented in hardware to keep track of a fixed amount of cached memory blocks. When the capacity of the directory entry is exceeded, then the directory interrupts the local processor and a full map directory is emulated in software. Using limited directories implemented in hardware to keep track of a fixed amount of cached memory blocks. When the capacity of the directory entry is exceeded, then the directory interrupts the local processor and a full map directory is emulated in software.

Architectural Features LimitLEss Alewife is a large-scale multiprocessor with distributed shared memory and a cost- effective mesh network for communication. Alewife is a large-scale multiprocessor with distributed shared memory and a cost- effective mesh network for communication. An Alewife node consists of a 33MHz SPACLE processor, 64K bytes of direct-mapped cache, 4M bytes of globally-shared main memory, and a floating-point coprocessor An Alewife node consists of a 33MHz SPACLE processor, 64K bytes of direct-mapped cache, 4M bytes of globally-shared main memory, and a floating-point coprocessor

Architectural Features LimitLEss Be capable of rapid trap handling (five to ten cycles ). Be capable of rapid trap handling (five to ten cycles ). A rapid context switching processor A rapid context switching processor A finely-tuned software trap architecture. A finely-tuned software trap architecture. The processor needs complete access to coherence related controller state The processor needs complete access to coherence related controller state The directory controller must be able to invoke processor trap handlers when necessary. The directory controller must be able to invoke processor trap handlers when necessary. An interface to the network that allows the processor to launch and to intercept coherence protocol packets. An interface to the network that allows the processor to launch and to intercept coherence protocol packets. IPI( Interprocessor-Interrrupt) IPI( Interprocessor-Interrrupt) ProcessorController Condition Bits Trap Lines Data Bus Address Bus

Architectural Features LimitLess IPI provides IPI provides a superset of the network functionality a superset of the network functionality Used to send and receive cache protocol packets Used to send and receive cache protocol packets Used to send preemptive message to remote processors Used to send preemptive message to remote processors Network Packet Structure Network Packet Structure Protocol Opcode Protocol Opcode for cache coherence traffic for cache coherence traffic Interrupt Opcode Interrupt Opcode set the most significant bit set the most significant bit for interprocessor message for interprocessor message Transmission of IPI Packets Transmission of IPI Packets enqueue the request on IPI output Queue enqueue the request on IPI output Queue Reception of IPI packets Reception of IPI packets place the packet in the IPI input Queue place the packet in the IPI input Queue IPI input traps are synchronous. IPI input traps are synchronous. Source processor Packet Length Opcode Operand 1 Operand 2.. Operand m-1 Data word Data word 2.. Data word n-1

Meta States & Trap Handler Meta State Description Normal Directory handled by hardware The worker-sets of such block are no larger than the # of hardware pointers Trans-In-Progress Be entered when a packet is passed to software (by placing it in the IPI input Queue). Controller blocks all future packets for the associated memory block. Be cleared after processing the packet. Trap-On-Write Trap: WREQ, UPDATE, REPM Read requests are handled as usual Write requests are forward to IPI input Queue. After packets are forwarded, Directory Mode is changed to Trans-In- Progress. Trap-Always Pass all incoming packets to processor, then Directory Mode is changed to Trans- In-Progress Meta States Meta States Trap Handler Trap Handler First time overflow: First time overflow: The trap code allocates a full-map bit-vector in local memory. The trap code allocates a full-map bit-vector in local memory. Empty all hardware pointers, set the corresponding bits in the vector Empty all hardware pointers, set the corresponding bits in the vector Directory Mode is set to Trap-On-Write before trap returns Directory Mode is set to Trap-On-Write before trap returns Additional overflow: Additional overflow: Empty all hardware pointers, set the corresponding bits in the vector Empty all hardware pointers, set the corresponding bits in the vector Termination (on WREQ or local write fault ): Termination (on WREQ or local write fault ): Empty all hardware pointers Empty all hardware pointers Record the identity of requester in the directory Record the identity of requester in the directory Set the ActCtr to the # of bits in the vector that are set Set the ActCtr to the # of bits in the vector that are set Place directory in Normal Mode, Write Transaction Sate. Place directory in Normal Mode, Write Transaction Sate. Invalidate all caches with the bit set in vector Invalidate all caches with the bit set in vector

Conclusion This paper proposed a new scheme for cache coherence, called LimitLess, which is being implemented in Alewife machine. This paper proposed a new scheme for cache coherence, called LimitLess, which is being implemented in Alewife machine. Hardware requirement includes rapid trap handling and a flexible processor interface to the network. Hardware requirement includes rapid trap handling and a flexible processor interface to the network. Preliminary simulation results indicate that the LimitLEss scheme approaches the performance of a full-map directory protocol with the memory efficiency of a limited directory protocol. Preliminary simulation results indicate that the LimitLEss scheme approaches the performance of a full-map directory protocol with the memory efficiency of a limited directory protocol. Furthermore, the LimitLess scheme provides a migration path toward a future in which cache coherence is handled entirely in software Furthermore, the LimitLess scheme provides a migration path toward a future in which cache coherence is handled entirely in software