Predictable Cache Coherence for Multi-Core Real-Time Systems

Predictable Cache Coherence for Multi-Core Real-Time Systems
Mohamed Hassan, Anirudh M. Kaushik and Hiren Patel RTAS 2017 Thank you for the introduction. Good afternoon everyone. It is my pleasure to present “Predictable Cache Coherence for Multi-Core Real-Time Systems”. This work is done jointly with Mohamed and Hiren at UW.

Motivation: Data sharing in multi-core real-time systems
RT tasks L1 D/I $ P1 Private data Interconnect We are currently witnessing a tremendous push from single-core RT systems to multi-core RT systems due to performance advantages. However, it is well known that moving to multi-core RT systems presents challenges towards guaranteeing temporal requirements of applications. This is because typical multi-core RT systems share certain components such as the LLC $/main-memory which are points of contentions and interference for simultaneous tasks running on different cores. One important challenge is managing shared data between tasks such that it is done correctly or coherently and provide timing guarantees for shared data accesses. LLC/Main memory Shared data S1 Anirudh M. Kaushik, RTAS'17

No caching of shared data [RTSS’09, RTNS’10] T1 T2 T3 P1 P2 S1 P1 S3 One extreme way to achieve this is to disallow private caching of shared data. This avoids shared data interference at the expense of long execution times. S2 P2 S4 Simpler timing analysis Hardware support Long execution time Anirudh M. Kaushik, RTAS'17

No caching of shared data [RTSS’09, RTNS’10] Task scheduling on shared data [ECRTS’10, RTSS’16] Scheduler T1 T2 T3 T1 T2 T3 P1 P2 P1 S1 P2 S1 P1 S3 S1 P1 S3 Another approach that improves upon the previous approach from a performance standpoint is task scheduling on shared data. In this approach, tasks that access the same shared data are co-located on the same core to exploit the benefits of private cache hits on shared data. However, the flipside of this approach requires changes to the OS scheduler. Moreover, with multi-threaded applications wherein threads work on the same shared data, this forces all the threads to colocate on the same core resulting in limited parallelism. S2 P2 S4 S2 P2 S4 Simpler timing analysis Hardware support Long execution time Private cache hits on shared data No hardware support Limited multi-core parallelism Changes to OS scheduler Anirudh M. Kaushik, RTAS'17

No caching of shared data [RTSS’09, RTNS’10] Task scheduling on shared data [ECRTS’10, RTSS’16] Software cache coherence [SAMOS’14] Scheduler /* Access to * private data */ * shared data T1 T2 T3 T1 T2 T3 T1 T2 T3 P1 P2 P1 S1 P2 P1 S1 P2 S1 P1 S3 S1 P1 S3 S1 P1 S3 Finally, a solution that allows private cache hits on shared data without changes to the OS or hardware is software cache coherence. In this approach, the application is modified to protect accesses to shared shared data. Changes to application is a tedious effort and may not be viable for closed-source applications. S2 P2 S4 S2 P2 S4 S2 P2 S4 Simpler timing analysis Hardware support Long execution time Private cache hits on shared data No hardware support Limited multi-core parallelism Changes to OS scheduler Private cache hits on shared data Hardware support Source code changes Anirudh M. Kaushik, RTAS'17

No caching of shared data [RTSS’09, RTNS’10] Task scheduling on shared data [ECRTS’10, RTSS’16] Software cache coherence [SAMOS’14] Disabling simultaneous cached shared data for timing analysis at the cost of lower performance and OS and application changes. /* Access to * private data */ * shared data Scheduler T1 T2 T3 T1 T2 T3 T1 T2 T3 P1 P2 P1 S1 P2 P1 S1 P2 S1 P1 S3 S1 P1 S3 S1 P1 S3 Finally, a solution that allows private cache hits on shared data without changes to the OS or hardware is software cache coherence. In this approach, the application is modified to protect accesses to shared shared data. Changes to application is a tedious effort and may not be viable for closed-source applications. S2 P2 S4 S2 P2 S4 S2 P2 S4 Simpler timing analysis Hardware support Long execution time Private cache hits on shared data No hardware support Limited multi-core parallelism Changes to OS scheduler Private cache hits on shared data Hardware support Source code changes Anirudh M. Kaushik, RTAS'17

No caching of shared data [RTSS’09, RTNS’10] Task scheduling on shared data [ECRTS’10, RTSS’16] Software cache coherence [SAMOS’14] This work: Allowing predictable simultaneous cached shared data accesses through hardware cache coherence, and provide significant performance improvements with no changes to OS and applications. /* Access to * private data */ * shared data Scheduler T1 T2 T3 T1 T2 T3 T1 T2 T3 P1 P2 P1 S1 P2 P1 S1 P2 S1 P1 S3 S1 P1 S3 S1 P1 S3 Finally, a solution that allows private cache hits on shared data without changes to the OS or hardware is software cache coherence. In this approach, the application is modified to protect accesses to shared shared data. Changes to application is a tedious effort and may not be viable for closed-source applications. S2 P2 S4 S2 P2 S4 S2 P2 S4 Simpler timing analysis Hardware support Long execution time Private cache hits on shared data No hardware support Limited multi-core parallelism Changes to OS scheduler Private cache hits on shared data Hardware support Source code changes Anirudh M. Kaushik, RTAS'17

Outline Overview of hardware cache coherence
Challenges of applying conventional hardware cache coherence on RT multi-core systems Our solution: Predictable Modified-Shared-Invalid (PMSI) cache coherence protocol Latency analysis of PMSI Results Conclusion This is the outline for the remainder of my talk. I will first provide a brief overview of hardware cache coherence. I will then show using examples that naively applying a well-studied hardware cache coherence on a RT multi-core system does not necessarily provide timing guarantees for shared data accesses. To this end, I will introduce PMSI, which is our solution for a predictable hardware cache coherence. Anirudh M. Kaushik, RTAS'17

Overview of hardware cache coherence
Modified-Shared-Invalid protocol I Core0 Core1 OtherGETM or OwnPUTM OtherGETM or OtherUPG OwnGETM OwnGETS OwnUPG M S OtherGETS OwnGETM OwnGETS or OtherGETS GETS: Memory load GETM/UPG: Memory store PUTM: Replacement Anirudh M. Kaushik, RTAS'17

Applying hardware cache coherence to multi-core RT systems
Modified-Shared-Invalid protocol Core0 Core1 I L1 D/I $ L1 D/I $ + OtherGETM or OwnPUTM OtherGETM or OtherUPG Real-time interconnect OwnGETM OwnGETS OwnUPG M S LLC/Main memory OtherGETS OwnGETM OwnGETS or OtherGETS Anirudh M. Kaushik, RTAS'17

Modified-Shared-Invalid protocol Core0 Core1 I Multi-core RT system with predictable arbitration to shared components + Well studied cache coherence protocol = Predictable latencies for shared data accesses? L1 D/I $ L1 D/I $ + OtherGETM or OwnPUTM OtherGETM or OtherUPG OwnGETM OwnGETS Real-time interconnect OwnUPG M S LLC/Main memory OtherGETS OwnGETM OwnGETS or OtherGETS Anirudh M. Kaushik, RTAS'17

Modified-Shared-Invalid protocol Core0 Core1 I Multi-core RT system with predictable arbitration to shared components + Well studied cache coherence protocol = Predictable latencies for shared data accesses? L1 D/I $ L1 D/I $ + OtherGETM or OwnPUTM OtherGETM or OtherUPG OwnGETM OwnGETS Real-time interconnect OwnUPG M S LLC $/Main memory OtherGETS OwnGETM OwnGETS or OtherGETS Anirudh M. Kaushik, RTAS'17

Sources of unpredictable memory access latencies due to hardware cache coherence
Inter-core coherence interference on same cache line Inter-core coherence interference on different cache lines Inter-core coherence interference due to write hits Intra-core coherence interference Anirudh M. Kaushik, RTAS'17

Real-time interconnect TDM arbitration
System model TDM schedule Core2 Core3 Core0 Core1 Core2 Core3 Core0 Core0 Core1 Core2 Core3 A:100 M A:- I A:- I A:- I Real-time interconnect TDM arbitration A:50 M PR: Pending requests PWB: Pending write backs C:31 S B:42 S Anirudh M. Kaushik, RTAS'17

Intra-core coherence interference
TDM schedule Core2 Core3 Core0 Core1 Core2 Core3 Core0 Core Event Core2 Read A Core0 Core1 Core2 Core3 PWB A:100 M A:- I A:- I * A:- I A Core2: GetS A A:50 M PR: Pending requests PWB: Pending write backs C:31 S B:42 S * Denotes waiting for data Anirudh M. Kaushik, RTAS'17

TDM schedule Core2 Core3 Core0 Core1 Core2 Core3 Core0 Core Event Core2 Read A Core0 Read B Core0 Core1 Core2 Core3 A:100 M PWB A:- I A:- * A:- I A B:42 S Core0: GetS B A:50 M PR: Pending requests PWB: Pending write backs C:31 S B:42 S Anirudh M. Kaushik, RTAS'17

TDM schedule Core2 Core3 Core0 Core1 Core2 Core3 Core0 Core Event Core2 Read A Core0 Read B Waiting for data Core0 Core1 Core2 Core3 A:100 M A:- I A:- * A:- I B:42 S A:50 M C:31 S B:42 S * Denotes waiting for data Anirudh M. Kaushik, RTAS'17

TDM schedule Core2 Core3 Core0 Core1 Core2 Core3 Core0 Core Event Core2 Read A Core0 Read B Waiting for data Read C Core0 Core1 Core2 Core3 A:100 M PWB A:- I A:- * A:- I A B:42 S A:50 M PR: Pending requests PWB: Pending write backs C:31 S B:42 S Anirudh M. Kaushik, RTAS'17

TDM schedule Core2 Core3 Core0 Core1 Core2 Core3 Core0 Core Event Core2 Read A Core0 Read B Waiting for data Read C Core0 Core1 Core2 Core3 A:100 M PWB A:- I A:- * A:- I A B:42 S A:50 M ✖ Unbounded latency for Core2 C:31 S B:42 S * Denotes waiting for data Anirudh M. Kaushik, RTAS'17

TDM schedule Core2 Core3 Core0 Core1 Core2 Core3 Core0 Core Event Core2 Read A Core0 Read B Waiting for data Read C Core0 Core1 Core3 A:100 M PWB A:- I A:- * A:- I A B:42 S Core0: GetS B/GetS C A:50 M PR: Pending requests PWB: Pending write backs C:31 S B:42 S * Denotes waiting for data Anirudh M. Kaushik, RTAS'17

TDM schedule Core2 Core3 Core0 Core1 Core2 Core3 Core0 Core Event Core2 Read A Core0 Read B Waiting for data Read C Core0 Solution 1: Architectural modification Core1 Core3 A:100 M Predictable arbitration A:- I A:- * A:- I B:42 S WB A Read B Core0: GetS B/GetS C A:50 M C:31 S B:42 S Anirudh M. Kaushik, RTAS'17

TDM schedule Core2 Core3 Core0 Core1 Core2 Core3 Core0 Core Event Core2 Read A Core0 Read B Waiting for data Read C Core0 Solution 1: Architectural modification Core1 Core3 A:100 M Predictable arbitration A:- I A:- X A:- I B:42 S WB A Read B Solution 2: Coherence protocol modifications Core0: GetS B/GetS C Need to maintain state to indicate Core0’s pending WB of A A:50 M C:31 S B:42 S Anirudh M. Kaushik, RTAS'17

Architectural modification + Coherence modifications =
Sources of unpredictable memory access latencies due to cache coherence Inter-core coherence interference on same cache line Solution: Architectural modification to shared LLC/memory Inter-core coherence interference on different cache lines Solution: Architectural modification to the core + Coherence modifications Inter-core coherence interference due to write hits Intra-core coherence interference Architectural modification + Coherence modifications = Predictable Modified-Shared-Invalidate cache coherence protocol (PMSI) Predictable latencies for simultaneous shared data access No application/OS changes Significant performance benefits Anirudh M. Kaushik, RTAS'17

PMSI cache coherence protocol: Protocol modifications
State Core events Bus events Load Store Replacement OwnData OwnUpg OwnPutM OtherGetS OtherGetM OtherUpg OtherPutM I Issue GetS/ISd Issue GetM/IMd X N/A N/A N/A S Hit Issue Upg/SMw M Issue PutM/MIwb Issue PutM/MSwb ISd Read/S ISdI IMd Write/S IMdS IMdI Write/MIwb Read/I Write/MSwb SMw Store/M MIwb Send data/I MSwb Send data/S Anirudh M. Kaushik, RTAS'17

TDM schedule Core2 Core3 Core0 Core1 Core2 Core3 Core0 Core Event Core2 Read A Core0 Core1 Core2 Core3 TDM PR PWB State Bus events OtherGets OtherGetM M Issue PutM/MSwb Issue PutM/MIwb A:100 M MSwb A:- I A:- I * A:- I Read B A Core2: GetS A PR LUT A:50 M Address Core ID Request A 2 GetS C:31 S B:42 S Anirudh M. Kaushik, RTAS'17

TDM schedule Core2 Core3 Core0 Core1 Core2 Core3 Core0 Core Event Core2 Read A Core0 WB A WB slot Core0 Core1 Core2 Core3 TDM PR PWB A:100 MSwb S A:- I State Bus events OwnPutM MSwb Send data/S A:- ISd A:- I Read B A Core0: Send A PR LUT A:50 100 M S Address Core ID Request A 2 GetS C:31 S B:42 S Anirudh M. Kaushik, RTAS'17

TDM schedule Core2 Core3 Core0 Core1 Core2 Core3 Core0 Core Event Core2 Read A Core0 WB A Complete read A Core0 Core1 Core2 Core3 A:100 S A:- I A:100 * S A:- I Data response A:100 PR LUT A:100 S Address Core ID Request A 2 GetS C:31 S B:42 S Anirudh M. Kaushik, RTAS'17

TDM schedule Core2 Core3 Core0 Core1 Core2 Core3 Core0 Core Event Core2 Read A Core0 WB A Complete read A Read B Request slot Core0 Core1 Core2 Core3 TDM PR PWB A:100 S A:- I A:100 S A:- I Read B B:42 I S Read C Core0: GetS B A:100 S C:31 S B:42 S Anirudh M. Kaushik, RTAS'17

Latency analysis of PMSI
N: Number of cores WCLarb Core2 Core3 Core0 Core1 Core2 Core3 Core0 TDM schedule Core0 Core1 Core2 Core3 Pending WB Pending request A:- IMdI A:- IMdI A:- IMd A:50 M MIwb Lacc Real-time interconnect TDM arbitration TDM Address Core ID Request A GetM 1 2 WCLintraCoh LLC $/Main memory WCLinterCoh Anirudh M. Kaushik, RTAS'17

Latency analysis of PMSI
WCL of memory request = Lacc + WCLarb (⍺ N)+ WCLintraCoh (⍺ N) + WCLinterCoh (⍺ N2) Anirudh M. Kaushik, RTAS'17

Results Cycle accurate Gem5 simulator
2, 4, 8 cores, 16KB direct mapped private L1-D/I caches In-order cores, 2GHz operating frequency TDM bus arbitration with slot width = 50 cycles Benchmarks: SPLASH-2, synthetic benchmarks to stress worst-case scenarios Simulation framework available at Anirudh M. Kaushik, RTAS'17

Results: Observed worst case latency per request
PMSI Unpredictable 2 Unpredictable 3 Unpredictable 4 Anirudh M. Kaushik, RTAS'17

Results: Observed worst case latency per request
Latency bound PMSI Unpredictable 2 Unpredictable 3 Unpredictable 4 Anirudh M. Kaushik, RTAS'17

Results: Slowdown relative to MESI cache coherence
Uncache all Single core Uncache shared PMSI MSI MESI Anirudh M. Kaushik, RTAS'17

Conclusion Enumerate sources of unpredictability when applying conventional hardware cache coherence on multi-core real-time systems Propose PMSI cache coherence that allows predictable simultaneous shared data accesses with no changes to application/OS Hardware overhead of PMSI is ~ 128 bytes WCL of memory access with PMSI ⍺ square of number of cores (⍺ N2) PMSI outperforms prior techniques for handling shared data accesses (upto 4x improvement over uncache-shared), and suffers on average 46% slowdown compared to conventional MESI cache coherence protocol Anirudh M. Kaushik, RTAS'17

Thank you & Questions? Anirudh M. Kaushik, RTAS'17

Predictable Cache Coherence for Multi-Core Real-Time Systems

Similar presentations

Presentation on theme: "Predictable Cache Coherence for Multi-Core Real-Time Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Predictable Cache Coherence for Multi-Core Real-Time Systems

Similar presentations

Presentation on theme: "Predictable Cache Coherence for Multi-Core Real-Time Systems"— Presentation transcript:

Similar presentations

About project

Feedback