LimitLESS Directories: A Scalable Cache Coherence Scheme By: David Chaiken, John Kubiatowicz, John Kubiatowicz, Anant Agarwal Anant Agarwal Presented by:

Slides:



Advertisements
Similar presentations
CSCI 8150 Advanced Computer Architecture
Advertisements

Copyright Josep Torrellas 2003,20081 Cache Coherence Instructor: Josep Torrellas CS533 Term: Spring 2008.
Cache Coherence Mechanisms (Research project) CSCI-5593
4/16/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.
Multiple Processor Systems
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
DISTRIBUTED CONSISTENCY MANAGEMENT IN A SINGLE ADDRESS SPACE DISTRIBUTED OPERATING SYSTEM Sombrero.
Slide 1 Kubiatowicz, Chaiken and Agarwal, "Closing the Window of Vulnerability in Multiphase Memory Transactions" MIT Computer Science Dept. CS258 Lecture.
CS252/Patterson Lec /28/01 CS 213 Lecture 9: Multiprocessor: Directory Protocol.
ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches.
NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
Computer Organization
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Jonathan Walpole (based on a slide set from Vidhya Sivasankaran)
1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
CHAPTER 9: Input / Output
Lecture 3. Directory-based Cache Coherence Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
CS425/CSE424/ECE428 – Distributed Systems Nikita Borisov - UIUC1 Some material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra,
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
CS252 Graduate Computer Architecture Lecture 20 April 12 th, 2010 Distributed Shared Memory Prof John D. Kubiatowicz
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 5, 2005 Session 22.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.
CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 7, 2005 Session 23.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University
Siva and Osman March 7, 2000 Cache Coherence Schemes for Multiprocessors Sivakumar M Osman Unsal.
LimitLess Directories: A Scalable Cache Coherence Scheme By: David Chaiken, John Kubiatowicz, John Kubiatowicz, Anant Agarwal Anant Agarwal.
Distributed Shared Memory
Architecture and Design of AlphaServer GS320
Reactive Synchronization Algorithms for Multiprocessors
Multiprocessor Cache Coherency
CMSC 611: Advanced Computer Architecture
Krste Asanovic Electrical Engineering and Computer Sciences
Parallel and Multiprocessor Architectures – Shared Memory
CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #
Death Match ’92: NUMA v. COMA
Outline Midterm results summary Distributed file systems – continued
11 – Snooping Cache and Directory Based Multiprocessors
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Lecture 25: Multiprocessors
High Performance Computing
Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini
Lecture 25: Multiprocessors
Cache coherence CEG 4131 Computer Architecture III
Lecture 24: Multiprocessors
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
CSE 486/586 Distributed Systems Cache Coherence
Multiprocessors and Multi-computers
Presentation transcript:

LimitLESS Directories: A Scalable Cache Coherence Scheme By: David Chaiken, John Kubiatowicz, John Kubiatowicz, Anant Agarwal Anant Agarwal Presented by: Sampath Rudravaram

Cache Coherence The gap between the computing power of microprocessors and that of the largest supercomputers is shrinking, while the price/performance advantage of microprocessor is increasing. Cache enhance the performance of multiprocessors by reducing network traffic and average memory access time Cache coherence arise because multiple processors may be reading and modifying the same memory block within their own cache Common Solution Snoopy coherence Directory based coherence <-- Compiler directed coherence

Directory (Full-map) The message-based protocols allocate a section of the system’s memory  Directory Each block of memory has an associated directory entry which contains a bit for each cache in the system. That bit indicates whether or not the associated cache contains a copy of memory block

Directory based Coherence The basic concept is that a processor must ask for permission to load an entry from the primary memory to its cache. When an entry is changed the directory must be notified either before the change is initiated or when it is complete. When an entry is changed the directory either updates or invalidates the other caches with that entry.

Directory based Coherence FULL-MAP Directory Entry Advantages ? -> No broadcast is necessary Disadvantages ? -> Coherence traffic is high due to all requests to the directory ->Great need for memory( size grows as Ө(N^2)) Read-only x x X State N

Directory based Coherence Limited Directory Entry Advantages ? ->Its performance is comparable to that of a full-map scheme in case where there is limited sharing of data between processors ->Cheaper to implement Disadvantages ? -> The protocol is susceptible to thrashing when the number of processors sharing data exceeds the number of pointers in the directory entry Read-Only State Node ID Node ID Node ID Node ID

LimitLESS ( Limited directory Locally Extended through Software Support. ) The LimitLess scheme attempts to combine the full map and limited directory ideas in order to achieve a robust yet affordable and scalable cache coherence solution. The main idea behind this method is to handle the common case in hardware and the exceptional case in software. Using limited directories implemented in hardware to keep track of a fixed amount of cached memory blocks. When the capacity of the directory entry is exceeded, then the directory interrupts the local processor and a full map directory is emulated in software.

TypeSymbolNameData? Cache To Memory RREQ WREQ REPM UPDATE ACKC Read Request Write Request Replace Modified Update Invalidate Ack. **** Memory To Cache RDATA WDATA INV BUSY Read Data Write Data Invalidate Busy Signal **** ComponentNameMeaning MemoryRead-Only Read-Write Read-Transaction Write-Transaction Some number of caches have read-only copies of the data Exactly one cache has a read-write copy of the data Holding read request, update is in progress Holding write request, invalidating is in progress CacheInvalid Read-Only Read-Write Cache block may not be read or written Cache block may be read, but not written Cache block may be read or written Transition Label Input Message PreconditionDirectory Entry Change Output Message (s) 1 i-> RREQ --P=P U { i }RDATA -> i 2 i-> WREQ P={ i } P={ } -- P={ i } WDATA -> i 3 i-> WREQ P={k1,…kn}^ i  P P={k1,…kn}^ i  P P={i}, AckCtr = n P={i}, AckCtr = n-1 ¥kj INV-> kj ¥kj≠i INV-> kj 4 j-> WREQP={ i }P={j}, AckCtr = 1INV-> i 5 j-> RREQP={ i }P={j}, AckCtr = 1INV-> i 6 i-> REPMP={ i }P={ }-- 7 j-> RREQ j->WREQ j->ACKC j->REPM -- AckCtr ≠ 1 -- AckCtr = AckCtr BUSY->j -- 8 j->ACKC J->UPDATE AckCtr = 1, P={i}, P={ i } AckCtr = 0 WDATA -> i 9 j->RREQ j->WREQ j->REPM -- BUSY->j j->UPDATE j->ACKC P={ i } AckCtr = 0 RDATA -> i <- Protocol messages for hardware coherence ^ Directory states Annotation of the state transition diagram

Architectural Features LimitLESS Alewife is a large-scale multiprocessor with distributed shared memory and a cost- effective mesh network for communication. An Alewife node consists of a 33MHz SPARCLE processor, 64K bytes of direct- mapped cache, 4M bytes of globally-shared main memory, and a floating-point coprocessor

A 16-node Alewife machineA 128-node Alewife Chassis

Architectural Features LimitLESS Be capable of rapid trap handling (five to ten cycles ). A rapid context switching processor A finely-tuned software trap architecture. The processor needs complete access to coherence related controller state The directory controller must be able to invoke processor trap handlers when necessary. An interface to the network that allows the processor to launch and to intercept coherence protocol packets. ) IPI( Interprocessor-Interrrupt ) ProcessorController Condition Bits Trap Lines Data Bus Address Bus

Architectural Features LimitLESS IPI provides a superset of the network functionality -> Used to send and receive cache protocol packets -> Used to send preemptive message to remote processors Network Packet Structure Protocol Opcode ->for cache coherence traffic Interrupt Opcode ->for interprocessor message Transmission of IPI Packets -> enqueue the request on IPI output Queue Reception of IPI packets ->place the packet in the IPI input Queue IPI input traps are synchronous. Source processor Packet Length Opcode Operand 1 Operand 2.. Operand m-1 Data word Data word 2.. Data word n-1

Queue based diagram of the Alewife controller

Meta States & Trap Handler Meta States Trap Handler First time overflow: -The trap code allocates a full-map bit-vector in local memory. -Empty all hardware pointers, set the corresponding bits in the vector -Directory Mode is set to Trap-On-Write before trap returns Additional overflow: -Empty all hardware pointers, set the corresponding bits in the vector Termination (on WREQ or local write fault ): -Empty all hardware pointers -Record the identity of requester in the directory -Set the ActCtr to the # of bits in the vector that are set -Place directory in Normal Mode, Write Transaction Sate. -Invalidate all caches with the bit set in vector

PERFORMANCE MEASUREMENT Comparision of the performance of limited,LimitLESS and full-map directories. Comparision of the performance of limited,LimitLESS and full-map directories. Evaluated in terms of the total number of cycles needed to execute an application on a 64 processor Alewife machine. Evaluated in terms of the total number of cycles needed to execute an application on a 64 processor Alewife machine.

Measurement Technique ASIM,The Alewife System Simulator

Performance Results Application Dir4NB LimitLESS4Full-Map Multigrid SIMPLE Matexpr Weather > four-pointer limited protocol,full-map protocol,LimitLESS scheme with Ts=50 -> 64-node Alewife machine with 64K byte caches and 2D mesh n/ws

Performance Results (contd..) -> Result when the variable in Weather is not optimised.

Performance Results (contd..) -> Result when the variable in Weather is optimised

Performance Results (Contd..) -> Result when emulation latency = 50 for LimitLESS protocol.

Conclusion This paper proposed a new scheme for cache coherence, called LimitLess, which is being implemented in Alewife machine. This paper proposed a new scheme for cache coherence, called LimitLess, which is being implemented in Alewife machine. Hardware requirement includes rapid trap handling and a flexible processor interface to the network. Hardware requirement includes rapid trap handling and a flexible processor interface to the network. Preliminary simulation results indicate that the LimitLEss scheme approaches the performance of a full-map directory protocol with the memory efficiency of a limited directory protocol. Preliminary simulation results indicate that the LimitLEss scheme approaches the performance of a full-map directory protocol with the memory efficiency of a limited directory protocol. Furthermore, the LimitLess scheme provides a migration path toward a future in which cache coherence is handled entirely in software Furthermore, the LimitLess scheme provides a migration path toward a future in which cache coherence is handled entirely in software