A Novel Directory-Based Non-Busy, Non- Blocking Cache Coherence Huang Yomgqin, Yuan Aidong, Li Jun, Hu Xiangdong 2009 International forum on computer Science-Technology and Application
ABSTRACT The implementation of multiprocessors cache coherence and memory consistency can help the homemade CPUs support a wide range of system designs. lot of research on various cache coherence protocols, such as Piranha prototype system,GS320 and AMD64, but use NB2CC protocol. It divides the serial processing into two steps: conflict detection and conflict solution. Conflict detection is completed at the home node, while conflict solution is distributed to owners.
INTRODUCTION The directory-based cache coherence protocols are widely used in many systems for their good expansibility. Generally, there are two ways to solve the cache coherence problem: Direction and Indirection. Some thought that the directory-based protocols incur a performance penalty on sharing misses due to indirection. However, DBP introduce a level of indirection to obtain scalability at the cost of increasing sharing miss latency. They believe that the Token counting is a new and good way.
Indirection protocols, also called traditional 3-hop protocols, have been widely used in many systems. The home node will forward the request from the local to the owner when it doesn’t have the newest data, and the forwarded request is serviced by the owner. Figure 1. (a) Basic directory-based protocols
SGI Origin solves the conflict at the home node, as position A shown in Figure 1(b). When a request arrives at the home node, it sets the related block’s state to be active (we call it “busy”). All subsequent requests for that block are queued (at the home node or in the network) until the active request is deactivated. So directory busy state is necessary in SGI Origin. Figure 1.(b): Different conflict solution positions.
Comparison In GS320, global switch( Network) is where conflicts are solved. In Piranha prototype, the solution position is moved to the end of the system (Owners). This protocol also introduces other innovative techniques, which make great contributions to directory-based protocols, such as clean-exclusive optimization, reply forwarding from remote owners, eager exclusive replies and avoiding the use of negative acknowledgment (NAK) messages. This makes the protocol more specific to the current design and limits its applicability.
MATHEMATICS MODEL - Serial processing in DBP could be divided into two steps: conflict detection and conflict solution. - We define several sets for description convenience. - Request = {R1, R2… Rn-1} is the set of requests for a shared address block (e.g., address X) at a logic time. -Home = {H1, H2…Hn-1} is the ordering of requests in set Request processed by the home directory. If Hi > Hj, then Ri is processed by the home directory before Rj. -Local = {L1, L2…Ln-1} is the ordering of requests in set Request satisfied by owners. If Li > Lj, then Ri is satisfied by the owner before Rj. Premise 1: For any Request Ri and Ri+1, Hi > Hi+1. Premise 2: If Hi > Hi+1, then Li > Li+1. Deduction 1: For any Request Ri and Rj, if Hi > Hj, then Li > Lj.
NB2CC To achieve high efficiency and concurrency at small cost, NB2CC inherits a lot of characteristics of traditional protocols, such as relaxed memory model, the method of avoiding protocol deadlock and basic process of request races.
A. Avoiding Deadlock NB2CC uses three virtual channels (VC0, VC1, and VC2) to eliminate the possibility of protocol deadlocks without resorting to NAKs/retries. The first channel (VC0) carries all requests (RQ1) from a processor to the home node. Messages from the home directory/memory (replies (ACK1) or forwarded messages (RQ2) to third party nodes or processors) are always carried on the second channel (VC1). The third channel (VC2) carries replies (ACK2) from a third-party node or processor to the requestor.
B. Non-Negative ACK -The lack of NAKs/retries leads to a more efficient protocol and provides several important and desirable characteristics. since an owner node is guaranteed to service a forwarded request, the directory state could be changed immediately (non-busy). we inherently eliminate live-lock and starvation problems that arise due to the presence of NAKs.
C. P2P Order in VC1 ONLY A: - A(RQ1(A)) → Home (RQ2(A)) → Owner - In the same time,the transaction updates the directory state immediately. - Coherence Acknowledge(Coh_ack1(A)) is sent to the Local A. B requests: - Home(RQ2(B)) → Owner A * Two requests travel through VC1 channel Between Home node and node A: Coh_ack1(A) and RQ2(B) with knowing the order.
D. Ownership Migrated Machine (OMM) - NB2CC guarantees that no more than two forwarded requests are sent to each owner, because of this technique. - A cache line in the owned state holds the most recent, correct copy of the data and other processors can hold a copy of the most recent, correct data,too.
E. Illegible Invalidates Acknowledge (IIA) NB2CC supports aggressive relaxed memory model, such as the Alpha memory model, which requires the use of explicit memory barrier instructions to impose memory ordering. It supports eager exclusive replies, and it is possible for a request generated at the home node to be locally satisfied while remote invalidations caused by a previous operation still have not been committed at the owners. It injects invalidation messages from the home node and gathers the corresponding acknowledgments at the requesting node.
Receiving counter and received counter are needed for IIA (48 bits or 64 bits are enough). Receiving counter (sent from home node to local node along with ACK1) is used to record the number of acknowledgements that the local node needs to receive. The received counter will be added 1 as soon as invalidations acknowledge received (From VC2). Memory Barrier (MB) could be done when the above two counter are equal. This brings a great help to improve the performance because there is no need to receive all invalid acknowledges to complete a RQ1.
With the increase of chip frequency, network delay plays a more and more important role in system based on research. Non-blocking for invalidations processing at the owner is a good method to hide the network delay. The combination of IIA, aggressive relaxed memory model and eager exclusive replies techniques increases the concurrency of the system and brings nice performance to NB2CC.
F. Putting it all together Many techniques, such as Avoiding Deadlock, P2P Order in VC1, No Negative ACK, which have been already used in other systems, such as GS320 and Piranha, are incorporated by NB2CC.
ANALYSIS -Only the requests for the same block need to be delayed in this system rather than blocking the head request of the queue. -NB2CC is balanced and has no hot point, providing flexible interface for programmer and complier to reach high system performance. -NB2CC is designed for a high concurrency and pipelining system, which solves the multi cache coherence problem in a software-transparent way. - The low overhead and little dependence on the hardware implementation lead to an implementation-free protocol.
NB2CC vs. GS320 - NB2CC does not support early commit and invalidate acknowledge is needed for the purpose of regularity. -Pipelining and non-blocking will bring high efficiency when processing invalidates. -Complex optimizations used in GS320 are avoided here because we need a simple, regular and efficient protocol. - With respect to conflict solution, Owners are responsible for this job in NB2CC, while in GS320 global switch takes the responsibility. - Hot points are decentralized in our protocol.
Future work If needed, other techniques could be incorporated in this protocol, as long as there is no conflict with the basic rules that are provided here.
Thanks