Presentation is loading. Please wait.

Presentation is loading. Please wait.

Xiaofang Chen1 Yu Yang1 Ganesh Gopalakrishnan1 Ching-Tsun Chou2

Similar presentations


Presentation on theme: "Xiaofang Chen1 Yu Yang1 Ganesh Gopalakrishnan1 Ching-Tsun Chou2"— Presentation transcript:

1 Xiaofang Chen1 Yu Yang1 Ganesh Gopalakrishnan1 Ching-Tsun Chou2
Reducing Verification Complexity of a Multicore Coherence Protocol Using Assume/Guarantee Xiaofang Chen1 Yu Yang1 Ganesh Gopalakrishnan1 Ching-Tsun Chou2 Good morning, everyone I’m here to present our work on reducing the verification complexity of hierarchical cache coherence protocols using assume-guarantee. This is a joint work with Yu Yang, my advisor Ganesh Gapalakrishnan, and Ching-Tsun Chou from Intel. 1University of Utah 2Intel Corporation * Supported by Intel SRC Customization Award 2005-TJ-1318 and NSF CNS FMCAD 2006

2 Hierarchical Cache Coherence Protocols
Chip-level protocols Intra-cluster protocols Multiple core architecture has been chosen as the architecture of choice for future processors. With CMP, there naturally exists a memory hierarchy where cache coherence protocols are usually used. In this figure, we show a general infrastructure which uses a 3-level hierarchical coherence protocol. The first level is the chip-level protocol, used to maintain the coherence in the caches used by the multiple cores. The second level is the intra-cluster protocol, used to maintain the coherence among different chips in one cluster. And the third-level is the inter-cluster protocol. These three levels as a whole maintain the coherence of the whole system. mem mem dir dir Inter-cluster protocols

3 Verification Challenges
No public domain benchmarks More complicated with more Corner cases State space In our study to verify hierarchical coherence protocols, there are several challenges. First of all, there’s no publicly available hierarchical protocols with reasonable complexity which can be served as a benchmark. And secondly, hierarchical protocols are usually more complicated, with more corner cases and bigger state space than non-hierarchical protocols.

4 A Multicore Coherence Protocol
Remote Cluster 1 Home Cluster Remote Cluster 2 L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L2 Cache+Local Dir L2 Cache+Local Dir L2 Cache+Local Dir RAC RAC RAC In response to these problems, we first developed a 2-level hierarchical protocol, shown here in a multicore system, as a public domain benchmark. As is typical in model checking based coherence verification, one address is modeled. And the home cluster is associated with the main memory for this address. There are also two remote clusters, remote cluster 1 and 2, which are the same. Inside a cluster, there are two cores, each with an L1 cache. The L2 cache is shared by the L1 caches, and it is inclusive of them. “RAC” here stands for remote access controller. It is used to communicate with other clusters and the global dir. The level-1 protocol here involves the L1 caches, the L2 cache, the local dir and the RAC of a cluster. It is used to keep the coherence inside a cluster, and the level-2 protocol is used to maintain the coherence among the clusters. It only has the high-level information of which cluster has the cache line, in which specific states. Global Dir Main Memory

5 Protocol Features Both levels use MESI protocols
Level-1: FLASH Level-2: DASH Silent drop on non-Modified cache lines Network channels are non-FIFO In detail, both levels of this protocol use the MESI protocol, i.e. the protocol with more features and more complexity than the MSI protocol. MESI protocols can support silent drops on non-Modified cache lines, which also makes it more complex. The first level uses a protocol similar with the Stanford Flash protocol, and the second level uses one similar with the DASH protocol. These protocols are both invalidation-based directory protocols, and people have been studying them in the past years. By adapting these two protocols with MESI features, this protocol has the similar complexity with those real protocols being used in industry. So it is a nice benchmark. Finally, network channels modeled in the protocol use non-FIFO ordering. That is, a message sent later can arrive earlier. This is a preferred modeling style in the high level, as it can support different hardware implementations for the networking. Of course, it also makes the protocol more complex.

6 An Example Scenario Remote Cluster 1 Home Cluster Remote Cluster 2 5 1
4.2 Excl Invld 5 1 4.1 Excl Invld 3 2 Excl: 1 Now, let’s look at a simple scenario of how this protocol works. In this figure, at the beginning, an L1 cache in remote cluster 1 has an exclusive copy. At step 1, an L1 cache requests an excl copy in remote cluster 2. The local dir cannot satisfy this request, so it forwards the request to the global dir. The global dir shows the remote cluster 1 is the owner, so it again forwards the request. In step 4, the request is forwarded to the real owner, and at the same time the data in the L1 cache is silently dropped. So in step 5, a NACK is replied. The cache line data in the L2 cache of remote cluster 1 is finally supplied to the requesting L1 cache in remote cluster 2. 1 Req_Ex 2 Req_Ex 3 Fwd_ReqEx 4.1 Fwd_ReqEx 4.2 Silent-drop 5 NACK

7 Complexity of the Protocol
Multiplicative effect of four protocols running concurrently Model check failed after 161,876,000 of states This protocol is in fact very complex, as we can see, it has 4 instances of protocols running concurrently. It turns out that model check failed on this protocol, due to state explosion, after more than 1 million of states, i.e. the typical number of states currently an explicit state model checker can explore. Here, we use 64-bit Murphi with 20GB memory, without hash compaction. IA-64 machine, 1.4GHz Itanium-2 processor

8 Intuitively, We Want to …
Split a hierarchical protocol into several smaller ones Verify the smaller protocols A/G proof In model checking based verification, esp. for hierarchical coherence protocols, state explosion prevents verification of even 1 address, 3 nodes for REAL protocols. In those cases, A/G allows you to abstract some nodes by simply projecting out states, patch up missing info using constraints. So the constraints model the “missing modules” adequately without incurring the interleaving state explosion those missing modules will cause. This is why we win! We justify constraints using A/G. So it seems obvious that such hierarchical protocols cannot be model checked brute-force using an explicit state model checker. Intuitively, we want to split such a large protocol into several smaller ones. If by verifying the smaller ones, we can claim the correctness of the original protocol. Then that will be very nice. There are several ways to do the splitting. Here I’ll present one approach used in our paper.

9 A/G Approach Abstraction Constraining … Original protocol
Suppose this is the original protocol, and we first abstract this protocol which results in an overaprpoximation. Then we constrain it back to an appropriate space. This constraining is done in a counter-example guided manner, i.e. when we model check the abstracted protocols, errors can happen. And we either fix the real bugs or do the refinement.

10 For Our 2-Level Protocol
Verification by building two smaller protocols M1 M2 For our 2-level protocol, we are indeed able to construct two simpler protocols M1 and M2 from M. In fact, there are three protocols, M1, M2 and M3. Because the two remote clusters are the same, M3 is the same with M2. So we are considering two abstracted protocols here. If the three clusters are the same, we only need to build and model check one simpler protocol M1. And if the three clusters are all different, we need to consider 3 abstracted protocols. We’ll see how each abstracted protocol is built in the following slides.

11 Abstracted Protocol #1 Home Cluster L1 Cache L1 Cache Remote Cluster 1
L2 Cache+Local Dir’ L2 Cache+Local Dir L2 Cache+Local Dir’ RAC RAC RAC This figure intuitively shows the first abstracted protocol M1, built from our original protocol M. As we can see, the home cluster still maintains all the details as in the original protocol. While the two remote clusters are abstracted: i.e. the L1 caches and part of the local dir are removed. Considering the 2-level hierarchy, we can see the level-2 protocol is maintained, while the level-1 protocol is only there in one cluster. Global Dir Main Memory

12 Abstracted Protocol #2 Remote Cluster 1 L1 Cache L1 Cache Home Cluster
L2 Cache+Local Dir L2 Cache+Local Dir’ L2 Cache+Local Dir’ RAC RAC RAC Similarly, in the second abstracted protocol M2, one remote cluster is maintained, while the home cluster and remote cluster 2 are abstracted. Global Dir Main Memory

13 Verification Methodology
Abstraction Fixing real bugs in M Refinement Counter-example guided refinement Adding new verification obligations After building these two abstracted protocols, the good thing is that although we cannot fully model check the original hierarchical protocol, we are able to detect and fix bugs in the original protocol, by simply model checking these two abstracted protocols. Now we’ll see how this can be achieved. We will describe the detailed abstraction to generate the abstracted protocols, how to fix bugs, and how to refine the abstracted protocol in a safe way.

14 Abstraction States Transitions Projection Overapproximation
Abstraction for the protocol includes two aspects: the state representation, as we’re using an explicit state model checker, and the transition relations. First, let’s look at how the state representation is abstracted in our approach.

15 Abstraction on States Intra-cluster details Inter-cluster details
The colored diagram here shows the structure of a detailed cluster in the original protocol. It is not necessary to look at the texts here in the diagram. But roughly we can see that a cluster consists of 5 blocks: the L1 caches, the network channels used inside a cluster, the L2 cache, the local dir, and the remote access controller. The level-1 protocol uses the first 4 blocks of the information, i.e. the intra-cluster details. And the level-2 protocol uses the last 2 blocks, i.e. the inter-cluster details. A detailed cluster uses all these 5 blocks in the state representation, and an abstracted cluster only uses the last 2 blocks. So we can see that the abstraction is simply a projection in the state representation. Inter-cluster details

16 Abstracting Transitions
Rule-based system: guard  action; Relaxing guards Relaxing expr values Remove stmt Procs[p].WbMsg.Cmd = WB_Wb Procs[p].L2.Data := Procs[p].WbMsg.Data; Procs[p].L2.HeadPtr := L2; … For transition relations in the protocol, here we consider rule-based systems in which a transition can be represented as “guard -> action”, i.e. when the guard condition is satisfied, the updates in the action part are executed. For the transition relations in our abstraction protocols, they are simply an over-approximation of the original ones. Basically, the over-approximation is introduced by 3 operations: relaxing the guard conditions, relaxing the expression values, and removing certain statements. For example, here we show a sample transition in the original protocol, in which the local dir of cluster receives a WB message from an L1 cache. On receiving this request, the L2 cache stores this most recent data, and the local dir sets the real owner inside the cluster to be the L2 cache. After abstraction, the guard condition becomes “true”, as “WbMsg” is abstracted away. And the action part becomes “L2.Data := d”. Here “d” is a random value in the range of the WB data. And the second update is removed. As we can see, the whole process is doing an over-approximation for each transition relation in the original protocol. true Procs[p].L2.Data := d; …

17 Detecting Bugs in M When a real error is found in Mi Fix bug in M
Regenerate Mi’s Iterate the process Now, let’s first see what to do if a real error trace is found in Mi. We need to fix this bug in M, then regenerate each abstracted protocol Mi, and re-do the model checking on Mi’s.

18 Refinement When a bogus error found in Mi
Analyze and find out problematic rule g → a Locate original rule in M G → A Add a new VO in one abstracted protocol G => P Strengthen rule into g Λ P → a What happened if a bogus error is found in Mi? We can first find out the problematic transition in the abstracted protocol, then locate the original transition relation in the real protocol M. Remember that each transition in an abstracted protocol is generated from a transition in the original protocol. We then add a new verification obligation in one abstracted protocol, this protocol doesn’t necessarily be in Mi itself. I’ll show a detailed example on how to do the refinement in the following slides. Here, the implied expression “P” only involves the variables in the level-2 protocol. So it is always there in each abstracted protocol. We use this expression to strengthen the problematic rule in Mi.

19 Details of Refinement (I)
1 M1 1. False alarm found Remote cluster-1 can modify its L2 line arbitrarily Here is an example of how the refinement is done. Suppose we found a false alarm in M1. This error was introduced because there is a transition in the remote cluster-1, which says the L2 cache can modify its cache line arbitrarily at any time. That is, the guard of the transition is “true” shown here. true → …

20 Details of Refinement (II)
1 M1 2. Locate the original rule in M before abstraction Guard: when the local dir receives a WB from an L1 cache We then locate the original transition in M. We found the original rule guard is: when the local dir of remote cluster-1 receives a WB from an L1 cache. And it’s our abstraction which made this condition to “true” in M1. Procs[p].WbMsg.Cmd = WB → …

21 Details of Refinement (III)
1 3 M1 3. Strengthen problematic rule in 1. Only when local dir is exclusive, could L2 modify its line We then strengthen the problematic rule in the first step, by adding the condition that when the L2 cache line state is exclusive, can the updates happen. true & Procs[p].L2.State = Excl → …

22 Details of Refinement (IV)
1 3 M1 Now the problem comes, why this strengthening is sound? And how can we prove it? 4. Why is strengthening sound?

23 Details of Refinement (V)
1 3 4. We can add a new VO in M2 M2 Procs[p].WbMsg.Cmd = WB => Procs[p].L2.State = Excl The nice thing is that we can prove the soundness of this strengthening in M2. We can add a new VO in M2, saying whenever the local dir receives a WB from an L1 cache, it’s L2 cache state must be exclusive. The problem of adding this VO to which abstracted protocol, is depending on which abstracted protocol has the detailed information. For example, here the details of remote clusters only exist in M2, so we add the VO in M2. Suppose in M2, there’s a problematic rule about the home cluster. Then we need to add a VO in M1, which has the detailed information of the home cluster. Now we come to prove the soundness of this approach. 4

24 Soundness of the Approach
Goal If M1 and M2 can be model checked correct w.r.t. the coherence property Ф in M, M must also be correct w.r.t Ф The goal here is to prove that in our 2-level protocol M, if M1 and M2 can be model checked correct with the coherence property, then M itself must be coherent, without the need to model check M.

25 Soundness Proof Temporal Induction Initial states
Each common var has the same value in M, M1 and M2 Each newly added VO is checked in M1 and M2 Each coherence property is checked Suppose soundness in state s Intuitively, we can prove this using a temporal induction. We know that in the initial states, each common variable in M, M1 and M2 has the same value, as all cache lines are invalid, and there’s no network messages on the flight. And each newly added VO can pass the model checking in M1 and M2 for the initial states. So the guard strengthening for the initial states is sound. Suppose the soundness holds in state s, we now consider every next state s’ of s.

26 Soundness Proof (II) M M1 M2 g  a h1, h2, r11, r12, r21, r22
g1 & p1  a1 h1, h2, r12, r22 h1’, h2’, r12’, r22’ Here we represent a state s of M as h1,h2 – the two levels information of the home cluster, r11, r12 for remote cluster-1, and r21, r22 for remote cluster-2. Suppose the next state s’ of s is represented in the right hand side, via the transition (g->a). Now consider the corresponding state s1 in M1. We know that h1,h2,r12,r22 are the same as in s. And there is a transition corresponding to (g->a). Firstly, we know the g1 is an overapproximation of the g, so g1 is enabled at s1. Secondly, we know that if guard strengthening is applied on this transition, there’s a verification obligation somewhere which says “g->p1” . So p1 is also enabled. Also we know that each abstracted transition in M1 and M2 is built by allowing more values or the same value in the action part. So there must exist a value which has the same updates for each common component. As a result, s1’ is like the following. Similarly, this holds for s2’ of M2 and s3’ of M3. As we can see, the reasoning here is not circular. The assume/guarantee is meta-circular. We provide a formal proof in the paper, which uses a simulation relation. M2 g2 & p2  a2 h1, r11, r12, r22 h2’, r11’, r12’, r22’

27 Experiment Results A real bug found 10 iterations of refinements
The size of each error trace is < 12 One person-day of work Now I show the experiment results by applying this approach to our 2-level hierarchical protocol. After many shallow bugs are removed in M, and the model checking failed due to state explosion, we found a real bug in M by using M1 and M2. And 10 iterations of refinements were performed on M1 and M2 before all bugs went away. It is also worthwhile to mention that each error trace is less than 12 steps. So people can easily look at the traces and fix the protocol. And this requires about one person-day of work, which is modest.

28 Reduction 64-bit Murphi IA-64, with 20GB of memory M > 161,876,000
Protocol Number of states M > 161,876,000 M1 31,919,219 M2 78,689,678 Now let’s see how much reduction is achieved using this approach. This table shows the result by running our 2-level protocol, using 64-bit Murphi which was ported by our group, with 20GB of memory. We can see the original protocol failed the model checking after more than one hundred million of states, while the two abstracted protocols can finish the model checking using less than 80 million states. The time required is about 3 and 7 hours, individually. I need to mention here that no hash compaction was used, as when we wrote the paper, hash compaction hasn’t been fully implemented in the 64-bit Murphi. IA-64 machine, with a 1.4GHz Itanium-2 processor, 40-bit hash compaction, altogether 24GB of memory. Final BFS depth: 74 and 80.

29 More Results Another 2-level hierarchical cache coherence protocol M
Number of states M > 1,521,900,000 M1 234,478,105 M2 283,124,383 Recently, we developed another 2-level hierarchical protocol. In this protocol, the caching hierarchy is non-inclusive. That is, the L2 cache may not necessarily have the cache line which exists in the L1 caches. So the L2 cache and local dir cannot precisely map to the L1 cache line states. This protocol is more complex than the protocol shown in this paper. We listed the verification result here: the large protocol failed the model checking after more than 1.5 billion of states. And each abstracted protocol can finish model checking in less than 0.3 billion states. We used the hash compaction in the 64-bit Murphi here, also with 20GB of memory. This protocol is also publicly available now. Final BFS depth 88 and 94.

30 Conclusion Developed a 2-level hierarchical protocol
Proposed a compositional approach Abstraction Bug fixing Refinement Proved the soundness Here is the conclusion for our work. In the study of verifying hierarchical cache coherence protocols, we first developed a 2-level protocol with reasonable complexity as a benchmark. Based on this protocol, we developed a compositional approach, using counter-example guided refinement. And we also proved the soundness of our approach. We believe this approach can be applied to verify general hierarchical protocols as well.

31 Related Work FMCAD’04 CHARME’99
Chou et. al., A simple method for parameterized verification of cache coherence protocols CHARME’99 McMillan, Verification of infinite state systems by compositional model checking Finally, this work borrows many ideas from Chou’s FMCAD work on parameterized verification of cache coherence protocols, and McMillan’s CHARME work on compositional model checking. The abstractions, the counter-example guided refinement and the compositional reasoning are all influenced by their work.

32 For Details http://www.cs.utah.edu/formal_verification/
Thanks, and all the details including the protocol models can be found in this URL.

33 A Multicore Coherence Protocol
Remote Cluster 1 Home Cluster Remote Cluster 2 L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L2 Cache+Local Dir L2 Cache+Local Dir L2 Cache+Local Dir RAC RAC RAC Global Dir Main Memory

34 Another Decomposing Approach
Split protocols hierarchically Intra-cluster protocol Inter-cluster protocol Another way to decompose a hierarchical protocol is to think of it hierarchically. For example, for our 2-level protocol, one can try to split it into an intra-cluster and an inter-cluster protocol, and using a compositional proof to verify the correctness of the large protocol.

35 Intra-cluster Protocol
Cache L1 Cache L2 Cache+Local Dir Environment RAC This figure shows the intra-cluster protocol, where only one detailed cluster is modeled, and all the others are abstracted as an environment.

36 Inter-cluster Protocol
Remote Cluster 1 Home Cluster Remote Cluster 2 L2 Cache+Local Dir’ L2 Cache+Local Dir’ L2 Cache+Local Dir’ RAC RAC RAC And this figure shows the inter-cluster protocol, where each cluster has been abstracted such that no L1 cache details are there. We can see that if this decomposition approach works, it can have more reduction than our current approach. This is because our current approach always keeps the level-2 protocol, and one level-1 instance. And theoretically, this approach can be easily scaled to verify arbitrary number of hierarchy. Global Dir Main Memory

37 About the Bug IACK


Download ppt "Xiaofang Chen1 Yu Yang1 Ganesh Gopalakrishnan1 Ching-Tsun Chou2"

Similar presentations


Ads by Google