Shuchang Shan † ‡, Yu Hu †, Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences ‡ Graduate University of Chinese Academy of Sciences (GUCAS) Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors
2Outline Introduction TDB execution model Experimental results Conclusion
3 Architectural level Dual Modular Redundancy Memory system L1 Instruction-level DMR Core-level DMR AR-SMT[FTCS’99], SRT[ISCA’00] Thread-level DMR DIVA[MICRO’99], SHREC[MICRO’04], EDDI[TR’02] CRTR [ISCA’03], Reunion[MICRO’06], DCC[DSN’07] Leading thread Trailing thread EX’ CHK Leading instructions Trailing instructions A A’ A’ B B’ B’ For CMP systems, to make use of abundant hardware resources, building Core-level DMR!
4 Core-level Dual Modular Redundancy (DMR) Using coupled cores to verify each other’s execution Static binding –lacks of flexibility –e.g., Reunion [MICRO’06], CRT [ISCA’02], CRTR [ ISCA’03] Dynamic binding –Lacks of scalability for parallel processing –e.g., DCC [DSN’07, WDDD’08]
5 Key issue in Core-level DMR Maintain master-slave memory consistency Master-slave memory consistency –Coupled cores must get the same memory value –External writes causes consistency violation Reunion [Smolens-MICRO’06] –Rollback and recovery for the inconsistency Dynamic Core Coupling (DCC) [LaFrieda-DSN’07] –Consistency window to stall the external writes Scalability problem Consistency violation
6 Scalability problem External writes occur earlier and more frequently as the system scales –Reunion: Unacceptable recovery overhead for consistency violation –DCC: Unacceptable stall latency caused by consistency window Scalable solution needed –Reduce the consistency maintenance overhead Probability of external writes occurring within certain slacks For 4-CMP system: 28% in 100 cycles 37% in 500 cycles For 16-CMP system: 43% in 100 cycles 55% in 500 cycles cycles #External writes within 1K cycles: 0.3 for 4-CMP 3.3 for 16-CMP
7 Basic idea the scope of the master-slave memory consistency maintenance Sphere of Consistency (SoC) –The memory hierarchy –The private caches Master L1 cache Slave Global memory Master L1 cache Slave Global memory Transparent Dynamic Binding (TDB): scalableflexible Reduce the SoC to the scale of private caches; provide scalable and flexible Core-level DMR solution!
8Outline Introduction TDB execution model Experimental results Conclusion
9 TDB principle The same program input for the pair Similar memory access behavior Program A-L1$A’-L1$ Global memory Transparent binding: Master issues L1 miss requests for the logical pair Slave is prevent from accessing the global memory Dynamic binding: using the system network for data communication and result comparison
10 Transparent dynamic binding Master Global memory Slave Program Logical pair:Consumer-consumer Sphere of Consistency The private caches Transparent of slaves Passively waiting Consumer-consumer data access pattern Producer
11 Maintain Consistency under Out- of-Order Execution Out-of-Order execution brings in wrong-path effects [1]: Master Global memory Slave Program Producer MA MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 LRU MRU [1] R. Sendaga, et al.“The impact of wrong-path memory references in cache-coherent multiprocessor systems.” In JPDC’07
12 Maintain Consistency under Out- of-Order Execution Out-of-Order execution brings in wrong-path effects: Master Global memory Slave Program Producer MA MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 2 2 LRU MRU
13 Maintain Consistency under Out- of-Order Execution Out-of-Order execution brings in wrong-path effects: Master Global memory Slave Program Producer MA MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA LRU MRU Pipeline Refresh
14 Maintain Consistency under Out- of-Order Execution Out-of-Order execution brings in wrong-path effects: Master Global memory Slave Program Producer MA MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA MRU LRU 5 5
15 Maintain Consistency under Out- of-Order Execution Out-of-Order execution brings in wrong-path effects: Master Global memory Slave Program Producer MA MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA MRU LRU Master-slave private cache consistency violation Invariant: in-order memory instruction retirement sequence
16 Victim Buffer Assisted Conservative Private Cache Ingress Rule Master Slave Program MA1 1 1 MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 MRU LRU 1 1 Global memory Victim Buffer: Filter the WP data blocks
17 Victim Buffer Assisted Conservative Private Cache Ingress Rule Master Slave Program MA1 1 1 MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 MRU LRU Global memory
18 Victim Buffer Assisted Conservative Private Cache Ingress Rule Master Slave Program MA1 1 1 MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 MRU LRU Global memory
19 Victim Buffer Assisted Conservative Private Cache Ingress Rule Master Slave Program MA1 1 1 MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 MRU LRU Global memory Conservative private cache ingress rule: accept data blocks from correct path into private caches
20 Master Slave Program MA MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 MRU LRU Global memory MA1 MA5 Invariant: in-order memory instruction retirement sequence Maintain Consistency under Out- of-Order Execution Potential master-slave consistency violation
21 update-after-retirement LRU Replacement policy (uar-LRU) Master Slave Program MA1 1 1 MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 MRU LRU 1 1 Global memory MA1 MA5
22 update-after-retirement LRU Replacement policy (uar-LRU) Master Slave Program MA1 1 1 MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 MRU LRU Global memory MA1 MA uar-LRU: update MRU after the instruction retirement to prevent the WP memory references from violating the consistency
23 Master-slave memory consistency violation External writes violates the master-slave memory consistency Atomicity of master-slave data access behavior Lacks of scalability as external writes become more frequent Master-slave input coherence: (a) external writes violates the consistency; (b) the master-slave consistency window in DCC
24 Transparent Input Coherence Strategy Take advantage of Transparent dynamic binding Break the atomicity of master-slave data access behavior Checker
25Outline Introduction TDB execution model Experimental results Conclusion
26 Experimental Setup Full system simulator: simics + GEMS Parallel workloads: SPLASH-2 The Baseline Dual Modular Redundancy System – N active cores and another N disabled cores –Simulate the DMR system where the slaves work without interfering the masters
27 The Performance of TDB Proposal 97.2%, 99.8%, 101.2% and 105.4% over the baseline for 4, 8, 16 and 32 cores respectively Conservative private cache ingress rule helps filter the WP effects
28 Network Traffic of TDB Proposal the total traffic is increased by 5.2%, 3.6%, 1.3% and 2.5% for 4-, 8-, 16- and 32-core CMP systems
29 Comparison against DCC [DSN’07] 9.2% 10.4% 18% 37.1% Transparent Dynamic Binding (TDB): scalableflexible scalable and flexible Core-level DMR solution!
30Conclusion Transparent Dynamic Binding –Reduce SoC to the scale of Private Caches Techniques to maintain the consistency –Consumer-consumer data access pattern –Victim Buffer assisted conservative ingress rule –uar-LRU replacement policy –Transparent input coherence policy Scalable and flexible core-level DMR solution
31Q&A?