Presentation is loading. Please wait.

Presentation is loading. Please wait.

LimitLess Directories: A Scalable Cache Coherence Scheme By: David Chaiken, John Kubiatowicz, John Kubiatowicz, Anant Agarwal Anant Agarwal.

Similar presentations


Presentation on theme: "LimitLess Directories: A Scalable Cache Coherence Scheme By: David Chaiken, John Kubiatowicz, John Kubiatowicz, Anant Agarwal Anant Agarwal."— Presentation transcript:

1 LimitLess Directories: A Scalable Cache Coherence Scheme By: David Chaiken, John Kubiatowicz, John Kubiatowicz, Anant Agarwal Anant Agarwal

2 Cache Coherence The gap between the computing power of microprocessors and that of the largest supercomputers is shrinking, while the price/performance advantage of microprocessor is increasing. The gap between the computing power of microprocessors and that of the largest supercomputers is shrinking, while the price/performance advantage of microprocessor is increasing. Cache enhance the performance of multiprocessors by reducing network traffic and average memory access time Cache enhance the performance of multiprocessors by reducing network traffic and average memory access time Cache coherence arise because multiple processors may be reading and modifying the same memory block within their own cache Cache coherence arise because multiple processors may be reading and modifying the same memory block within their own cache Common Solution Common Solution Snoopy coherence Snoopy coherence Directory based coherence Directory based coherence Compiler directed coherence Compiler directed coherence

3 Directory (Full-map) The message-based protocols allocate The message-based protocols allocate a section of the system’s memory a section of the system’s memory  Directory  Directory Each block of memory has an associated directory entry which contains a bit for each cache in the system. Each block of memory has an associated directory entry which contains a bit for each cache in the system. That bit indicates whether or not the associated cache contains a copy of memory block That bit indicates whether or not the associated cache contains a copy of memory block

4 Directory based Coherence The basic concept is that a processor must ask for permission to load an entry from the primary memory to its cache. The basic concept is that a processor must ask for permission to load an entry from the primary memory to its cache. When an entry is changed the directory must be notified either before the change is initiated or when it is complete. When an entry is changed the directory must be notified either before the change is initiated or when it is complete. When an entry is changed the directory either updates or invalidates the other caches with that entry. When an entry is changed the directory either updates or invalidates the other caches with that entry.

5 TypeSymbolNameData? CacheToMemoryRREQWREQREPMUPDATEACKC Read Request Write Request Replace Modified Update Invalidate Ack. ** MemoryToCacheRDATAWDATAINVBUSY Read Data Write Data Invalidate Busy Signal ** ComponentNameMeaningMemoryRead-OnlyRead-WriteRead-TransactionWrite-Transaction Some number of caches have read-only copies of the data Exactly one cache has a read-write copy of the data Holding read request, update is in progress Holding write request, invalidating is in progress CacheInvalidRead-OnlyRead-Write Cache block may not be read or written Cache block may be read, but not written Cache block may be read or written Transition Label Input Message PreconditionDirectory Entry Change Output Message (s) 1 i-> RREQ --P=P U { i }RDATA -> i 2 i-> WREQ P={ i } P={ } -- P={ i } WDATA -> i 3 i-> WREQ P={k1,…kn}^ i  P P={k1,…kn}^ i  P P={i}, AckCtr = n P={i}, AckCtr = n-1 ¥kj INV-> kj ¥kj≠i INV-> kj 4 j-> WREQP={ i }P={j}, AckCtr = 1INV-> i 5 j-> RREQP={ i }P={j}, AckCtr = 1INV-> i 6 i-> REPMP={ i }P={ }-- 7 j-> RREQ j->WREQ j->ACKC j->REPM -- AckCtr ≠ 1 -- AckCtr = AckCtr -1 -- BUSY->j -- 8 j->ACKC J->UPDATE AckCtr = 1, P={i}, P={ i } AckCtr = 0 WDATA -> i 9 j->RREQ j->WREQ j->REPM -- BUSY->j -- 10 j->UPDATE j->ACKC P={ i } AckCtr = 0 RDATA -> i <- Protocol messages for hardware coherence ^ Directory states Annotation of the state transition diagram Annotation of the state transition diagram

6 Directory based Coherence FULL-MAP Directory Entry FULL-MAP Directory Entry Advantages ? Advantages ? No broadcast is necessary No broadcast is necessary Disadvantages ? Disadvantages ? Coherence traffic is high due to all requests to the directory Coherence traffic is high due to all requests to the directory Great need for memory Great need for memory Read-Only x x....... State 1 2 3....... N State 1 2 3....... N

7 Directory based Coherence Limited Directory Entry Limited Directory Entry Advantages ? Advantages ? Its performance is comparable to that of a full-map scheme in case where there is limited sharing of data between processors Its performance is comparable to that of a full-map scheme in case where there is limited sharing of data between processors Cheaper to implement Cheaper to implement Disadvantages ? Disadvantages ? The protocol is susceptible to thrashing when the number of processors sharing data exceeding the number of pointers in the directory entry The protocol is susceptible to thrashing when the number of processors sharing data exceeding the number of pointers in the directory entry Read-Only 12 12 10 10 13 13 23 23 State Node ID Node ID Node ID Node ID State Node ID Node ID Node ID Node ID

8 LimitLess ( Limited directory Locally Extended through Software Support. ) The LimitLess scheme attempts to combine the full map and limited directory ideas in order to achieve a robust yet affordable and scalable cache coherence solution. The LimitLess scheme attempts to combine the full map and limited directory ideas in order to achieve a robust yet affordable and scalable cache coherence solution. The main idea behind this method is to handle the common case in hardware and the exceptional case in software. The main idea behind this method is to handle the common case in hardware and the exceptional case in software. Using limited directories implemented in hardware to keep track of a fixed amount of cached memory blocks. When the capacity of the directory entry is exceeded, then the directory interrupts the local processor and a full map directory is emulated in software. Using limited directories implemented in hardware to keep track of a fixed amount of cached memory blocks. When the capacity of the directory entry is exceeded, then the directory interrupts the local processor and a full map directory is emulated in software.

9 Architectural Features LimitLEss Alewife is a large-scale multiprocessor with distributed shared memory and a cost- effective mesh network for communication. Alewife is a large-scale multiprocessor with distributed shared memory and a cost- effective mesh network for communication. An Alewife node consists of a 33MHz SPACLE processor, 64K bytes of direct-mapped cache, 4M bytes of globally-shared main memory, and a floating-point coprocessor An Alewife node consists of a 33MHz SPACLE processor, 64K bytes of direct-mapped cache, 4M bytes of globally-shared main memory, and a floating-point coprocessor

10 Architectural Features LimitLEss Be capable of rapid trap handling (five to ten cycles ). Be capable of rapid trap handling (five to ten cycles ). A rapid context switching processor A rapid context switching processor A finely-tuned software trap architecture. A finely-tuned software trap architecture. The processor needs complete access to coherence related controller state The processor needs complete access to coherence related controller state The directory controller must be able to invoke processor trap handlers when necessary. The directory controller must be able to invoke processor trap handlers when necessary. An interface to the network that allows the processor to launch and to intercept coherence protocol packets. An interface to the network that allows the processor to launch and to intercept coherence protocol packets. IPI( Interprocessor-Interrrupt) IPI( Interprocessor-Interrrupt) ProcessorController Condition Bits Trap Lines Data Bus Address Bus

11 Architectural Features LimitLess IPI provides IPI provides a superset of the network functionality a superset of the network functionality Used to send and receive cache protocol packets Used to send and receive cache protocol packets Used to send preemptive message to remote processors Used to send preemptive message to remote processors Network Packet Structure Network Packet Structure Protocol Opcode Protocol Opcode for cache coherence traffic for cache coherence traffic Interrupt Opcode Interrupt Opcode set the most significant bit set the most significant bit for interprocessor message for interprocessor message Transmission of IPI Packets Transmission of IPI Packets enqueue the request on IPI output Queue enqueue the request on IPI output Queue Reception of IPI packets Reception of IPI packets place the packet in the IPI input Queue place the packet in the IPI input Queue IPI input traps are synchronous. IPI input traps are synchronous. Source processor Packet Length Opcode Operand 1 Operand 2.. Operand m-1 Data word Data word 2.. Data word n-1

12 Meta States & Trap Handler Meta State Description Normal Directory handled by hardware The worker-sets of such block are no larger than the # of hardware pointers Trans-In-Progress Be entered when a packet is passed to software (by placing it in the IPI input Queue). Controller blocks all future packets for the associated memory block. Be cleared after processing the packet. Trap-On-Write Trap: WREQ, UPDATE, REPM Read requests are handled as usual Write requests are forward to IPI input Queue. After packets are forwarded, Directory Mode is changed to Trans-In- Progress. Trap-Always Pass all incoming packets to processor, then Directory Mode is changed to Trans- In-Progress Meta States Meta States Trap Handler Trap Handler First time overflow: First time overflow: The trap code allocates a full-map bit-vector in local memory. The trap code allocates a full-map bit-vector in local memory. Empty all hardware pointers, set the corresponding bits in the vector Empty all hardware pointers, set the corresponding bits in the vector Directory Mode is set to Trap-On-Write before trap returns Directory Mode is set to Trap-On-Write before trap returns Additional overflow: Additional overflow: Empty all hardware pointers, set the corresponding bits in the vector Empty all hardware pointers, set the corresponding bits in the vector Termination (on WREQ or local write fault ): Termination (on WREQ or local write fault ): Empty all hardware pointers Empty all hardware pointers Record the identity of requester in the directory Record the identity of requester in the directory Set the ActCtr to the # of bits in the vector that are set Set the ActCtr to the # of bits in the vector that are set Place directory in Normal Mode, Write Transaction Sate. Place directory in Normal Mode, Write Transaction Sate. Invalidate all caches with the bit set in vector Invalidate all caches with the bit set in vector

13 Conclusion This paper proposed a new scheme for cache coherence, called LimitLess, which is being implemented in Alewife machine. This paper proposed a new scheme for cache coherence, called LimitLess, which is being implemented in Alewife machine. Hardware requirement includes rapid trap handling and a flexible processor interface to the network. Hardware requirement includes rapid trap handling and a flexible processor interface to the network. Preliminary simulation results indicate that the LimitLEss scheme approaches the performance of a full-map directory protocol with the memory efficiency of a limited directory protocol. Preliminary simulation results indicate that the LimitLEss scheme approaches the performance of a full-map directory protocol with the memory efficiency of a limited directory protocol. Furthermore, the LimitLess scheme provides a migration path toward a future in which cache coherence is handled entirely in software Furthermore, the LimitLess scheme provides a migration path toward a future in which cache coherence is handled entirely in software


Download ppt "LimitLess Directories: A Scalable Cache Coherence Scheme By: David Chaiken, John Kubiatowicz, John Kubiatowicz, Anant Agarwal Anant Agarwal."

Similar presentations


Ads by Google