Predictable Cache Coherence for Multi-Core Real-Time Systems

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.

Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S.

High Performing Cache Hierarchies for Server Workloads

Modeling shared cache and bus in multi-core platforms for timing analysis Sudipta Chattopadhyay Abhik Roychoudhury Tulika Mitra.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.

1. Overview  Introduction  Motivations  Multikernel Model  Implementation – The Barrelfish  Performance Testing  Conclusion 2.

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.

1: Operating Systems Overview

User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.

LOGO OPERATING SYSTEM Dalia AL-Dabbagh

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

Multicore In Real-Time Systems – Temporal Isolation Challenges Due To Shared Resources Ondřej Kotaba, Jan Nowotsch, Michael Paulitsch, Stefan.

Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

COMP25212 CPU Multi Threading Learning Outcomes: to be able to: –Describe the motivation for multithread support in CPU hardware –To distinguish the benefits.

Zheng Wu. Background Motivation Analysis Framework Intra-Core Cache Analysis Cache Conflict Analysis Optimization Techniques WCRT Analysis Experiment.

Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.

Martin Kruliš by Martin Kruliš (v1.1)1.

컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.

The University of Adelaide, School of Computer Science

Lucas De Marchi sponsors: co-authors: Liria Matsumoto Sato

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Framework For Exploring Interconnect Level Cache Coherency

Software Coherence Management on Non-Coherent-Cache Multicores

Lecture: Large Caches, Virtual Memory

Multiscalar Processors

ISPASS th April Santa Rosa, California

Computer Architecture

A Requests Bundling DRAM Controller for Mixed-Criticality System

“Temperature-Aware Task Scheduling for Multicore Processors”

A Study on Snoop-Based Cache Coherence Protocols

Task Scheduling for Multicore CPUs and NUMA Systems

12.4 Memory Organization in Multiprocessor Systems

Cache Memory Presentation I

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

Lecture 5: GPU Compute Architecture

CMSC 611: Advanced Computer Architecture

Hoda NaghibiJouybari Khaled N. Khasawneh and Nael Abu-Ghazaleh

Professor, No school name

Hwisoo So. , Moslem Didehban#, Yohan Ko

Taeweon Suh § Hsien-Hsin S. Lee § Shih-Lien Lu † John Shen †

TLC: A Tag-less Cache for reducing dynamic first level Cache Energy

Merry Christmas Good afternoon, class,

Lecture 5: GPU Compute Architecture for the last time

CMPT 886: Computer Architecture Primer

Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.

Hardware Multithreading

Multiprocessor Highlights

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

(A Research Proposal for Optimizing DBMS on CMP)

Mengjia Yan† , Jiho Choi† , Dimitrios Skarlatos,

Multithreaded Programming

Hoda NaghibiJouybari Khaled N. Khasawneh and Nael Abu-Ghazaleh

Lecture 25: Multiprocessors

High Performance Computing

Lecture 24: Multiprocessors

Memory System Performance Chapter 3

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,

Address-Stride Assisted Approximate Load Value Prediction in GPUs

Presentation transcript:

Predictable Cache Coherence for Multi-Core Real-Time Systems Mohamed Hassan, Anirudh M. Kaushik and Hiren Patel RTAS 2017 Thank you for the introduction. Good afternoon everyone. It is my pleasure to present “Predictable Cache Coherence for Multi-Core Real-Time Systems”. This work is done jointly with Mohamed and Hiren at UW.

Motivation: Data sharing in multi-core real-time systems RT tasks L1 D/I $ P1 Private data Interconnect We are currently witnessing a tremendous push from single-core RT systems to multi-core RT systems due to performance advantages. However, it is well known that moving to multi-core RT systems presents challenges towards guaranteeing temporal requirements of applications. This is because typical multi-core RT systems share certain components such as the LLC $/main-memory which are points of contentions and interference for simultaneous tasks running on different cores. One important challenge is managing shared data between tasks such that it is done correctly or coherently and provide timing guarantees for shared data accesses. LLC/Main memory Shared data S1 Anirudh M. Kaushik, RTAS'17

Motivation: Data sharing in multi-core real-time systems No caching of shared data [RTSS’09, RTNS’10] T1 T2 T3 P1 P2 S1 P1 S3 One extreme way to achieve this is to disallow private caching of shared data. This avoids shared data interference at the expense of long execution times. S2 P2 S4 Simpler timing analysis Hardware support Long execution time Anirudh M. Kaushik, RTAS'17

Motivation: Data sharing in multi-core real-time systems No caching of shared data [RTSS’09, RTNS’10] Task scheduling on shared data [ECRTS’10, RTSS’16] Scheduler T1 T2 T3 T1 T2 T3 P1 P2 P1 S1 P2 S1 P1 S3 S1 P1 S3 Another approach that improves upon the previous approach from a performance standpoint is task scheduling on shared data. In this approach, tasks that access the same shared data are co-located on the same core to exploit the benefits of private cache hits on shared data. However, the flipside of this approach requires changes to the OS scheduler. Moreover, with multi-threaded applications wherein threads work on the same shared data, this forces all the threads to colocate on the same core resulting in limited parallelism. S2 P2 S4 S2 P2 S4 Simpler timing analysis Hardware support Long execution time Private cache hits on shared data No hardware support Limited multi-core parallelism Changes to OS scheduler Anirudh M. Kaushik, RTAS'17

Motivation: Data sharing in multi-core real-time systems No caching of shared data [RTSS’09, RTNS’10] Task scheduling on shared data [ECRTS’10, RTSS’16] Software cache coherence [SAMOS’14] Scheduler /* Access to * private data */ * shared data T1 T2 T3 T1 T2 T3 T1 T2 T3 P1 P2 P1 S1 P2 P1 S1 P2 S1 P1 S3 S1 P1 S3 S1 P1 S3 Finally, a solution that allows private cache hits on shared data without changes to the OS or hardware is software cache coherence. In this approach, the application is modified to protect accesses to shared shared data. Changes to application is a tedious effort and may not be viable for closed-source applications. S2 P2 S4 S2 P2 S4 S2 P2 S4 Simpler timing analysis Hardware support Long execution time Private cache hits on shared data No hardware support Limited multi-core parallelism Changes to OS scheduler Private cache hits on shared data Hardware support Source code changes Anirudh M. Kaushik, RTAS'17

Motivation: Data sharing in multi-core real-time systems No caching of shared data [RTSS’09, RTNS’10] Task scheduling on shared data [ECRTS’10, RTSS’16] Software cache coherence [SAMOS’14] Disabling simultaneous cached shared data for timing analysis at the cost of lower performance and OS and application changes. /* Access to * private data */ * shared data Scheduler T1 T2 T3 T1 T2 T3 T1 T2 T3 P1 P2 P1 S1 P2 P1 S1 P2 S1 P1 S3 S1 P1 S3 S1 P1 S3 Finally, a solution that allows private cache hits on shared data without changes to the OS or hardware is software cache coherence. In this approach, the application is modified to protect accesses to shared shared data. Changes to application is a tedious effort and may not be viable for closed-source applications. S2 P2 S4 S2 P2 S4 S2 P2 S4 Simpler timing analysis Hardware support Long execution time Private cache hits on shared data No hardware support Limited multi-core parallelism Changes to OS scheduler Private cache hits on shared data Hardware support Source code changes Anirudh M. Kaushik, RTAS'17

Motivation: Data sharing in multi-core real-time systems No caching of shared data [RTSS’09, RTNS’10] Task scheduling on shared data [ECRTS’10, RTSS’16] Software cache coherence [SAMOS’14] This work: Allowing predictable simultaneous cached shared data accesses through hardware cache coherence, and provide significant performance improvements with no changes to OS and applications. /* Access to * private data */ * shared data Scheduler T1 T2 T3 T1 T2 T3 T1 T2 T3 P1 P2 P1 S1 P2 P1 S1 P2 S1 P1 S3 S1 P1 S3 S1 P1 S3 Finally, a solution that allows private cache hits on shared data without changes to the OS or hardware is software cache coherence. In this approach, the application is modified to protect accesses to shared shared data. Changes to application is a tedious effort and may not be viable for closed-source applications. S2 P2 S4 S2 P2 S4 S2 P2 S4 Simpler timing analysis Hardware support Long execution time Private cache hits on shared data No hardware support Limited multi-core parallelism Changes to OS scheduler Private cache hits on shared data Hardware support Source code changes Anirudh M. Kaushik, RTAS'17

Outline Overview of hardware cache coherence Challenges of applying conventional hardware cache coherence on RT multi-core systems Our solution: Predictable Modified-Shared-Invalid (PMSI) cache coherence protocol Latency analysis of PMSI Results Conclusion This is the outline for the remainder of my talk. I will first provide a brief overview of hardware cache coherence. I will then show using examples that naively applying a well-studied hardware cache coherence on a RT multi-core system does not necessarily provide timing guarantees for shared data accesses. To this end, I will introduce PMSI, which is our solution for a predictable hardware cache coherence. Anirudh M. Kaushik, RTAS'17

Overview of hardware cache coherence Modified-Shared-Invalid protocol I Core0 Core1 OtherGETM or OwnPUTM OtherGETM or OtherUPG OwnGETM OwnGETS OwnUPG M S OtherGETS OwnGETM OwnGETS or OtherGETS GETS: Memory load GETM/UPG: Memory store PUTM: Replacement Anirudh M. Kaushik, RTAS'17

Applying hardware cache coherence to multi-core RT systems Modified-Shared-Invalid protocol Core0 Core1 I L1 D/I $ L1 D/I $ + OtherGETM or OwnPUTM OtherGETM or OtherUPG Real-time interconnect OwnGETM OwnGETS OwnUPG M S LLC/Main memory OtherGETS OwnGETM OwnGETS or OtherGETS Anirudh M. Kaushik, RTAS'17

Applying hardware cache coherence to multi-core RT systems Modified-Shared-Invalid protocol Core0 Core1 I Multi-core RT system with predictable arbitration to shared components + Well studied cache coherence protocol = Predictable latencies for shared data accesses? L1 D/I $ L1 D/I $ + OtherGETM or OwnPUTM OtherGETM or OtherUPG OwnGETM OwnGETS Real-time interconnect OwnUPG M S LLC/Main memory OtherGETS OwnGETM OwnGETS or OtherGETS Anirudh M. Kaushik, RTAS'17

Applying hardware cache coherence to multi-core RT systems Modified-Shared-Invalid protocol Core0 Core1 I Multi-core RT system with predictable arbitration to shared components + Well studied cache coherence protocol = Predictable latencies for shared data accesses? L1 D/I $ L1 D/I $ + OtherGETM or OwnPUTM OtherGETM or OtherUPG OwnGETM OwnGETS Real-time interconnect OwnUPG M S LLC $/Main memory OtherGETS OwnGETM OwnGETS or OtherGETS Anirudh M. Kaushik, RTAS'17

Sources of unpredictable memory access latencies due to hardware cache coherence Inter-core coherence interference on same cache line Inter-core coherence interference on different cache lines Inter-core coherence interference due to write hits Intra-core coherence interference Anirudh M. Kaushik, RTAS'17

Real-time interconnect TDM arbitration System model TDM schedule Core2 Core3 Core0 Core1 Core2 Core3 Core0 Core0 Core1 Core2 Core3 A:100 M A:- I A:- I A:- I Real-time interconnect TDM arbitration A:50 M PR: Pending requests PWB: Pending write backs C:31 S B:42 S Anirudh M. Kaushik, RTAS'17

Intra-core coherence interference TDM schedule Core2 Core3 Core0 Core1 Core2 Core3 Core0 Core Event Core2 Read A Core0 Core1 Core2 Core3 PWB A:100 M A:- I A:- I * A:- I A Core2: GetS A A:50 M PR: Pending requests PWB: Pending write backs C:31 S B:42 S * Denotes waiting for data Anirudh M. Kaushik, RTAS'17

Intra-core coherence interference TDM schedule Core2 Core3 Core0 Core1 Core2 Core3 Core0 Core Event Core2 Read A Core0 Read B Core0 Core1 Core2 Core3 A:100 M PWB A:- I A:- * A:- I A B:42 S Core0: GetS B A:50 M PR: Pending requests PWB: Pending write backs C:31 S B:42 S Anirudh M. Kaushik, RTAS'17

Intra-core coherence interference TDM schedule Core2 Core3 Core0 Core1 Core2 Core3 Core0 Core Event Core2 Read A Core0 Read B Waiting for data Core0 Core1 Core2 Core3 A:100 M A:- I A:- * A:- I B:42 S A:50 M C:31 S B:42 S * Denotes waiting for data Anirudh M. Kaushik, RTAS'17

Intra-core coherence interference TDM schedule Core2 Core3 Core0 Core1 Core2 Core3 Core0 Core Event Core2 Read A Core0 Read B Waiting for data Read C Core0 Core1 Core2 Core3 A:100 M PWB A:- I A:- * A:- I A B:42 S A:50 M PR: Pending requests PWB: Pending write backs C:31 S B:42 S Anirudh M. Kaushik, RTAS'17

Intra-core coherence interference TDM schedule Core2 Core3 Core0 Core1 Core2 Core3 Core0 Core Event Core2 Read A Core0 Read B Waiting for data Read C Core0 Core1 Core2 Core3 A:100 M PWB A:- I A:- * A:- I A B:42 S A:50 M ✖ Unbounded latency for Core2 C:31 S B:42 S * Denotes waiting for data Anirudh M. Kaushik, RTAS'17

Intra-core coherence interference TDM schedule Core2 Core3 Core0 Core1 Core2 Core3 Core0 Core Event Core2 Read A Core0 Read B Waiting for data Read C Core0 Core1 Core3 A:100 M PWB A:- I A:- * A:- I A B:42 S Core0: GetS B/GetS C A:50 M PR: Pending requests PWB: Pending write backs C:31 S B:42 S * Denotes waiting for data Anirudh M. Kaushik, RTAS'17

Intra-core coherence interference TDM schedule Core2 Core3 Core0 Core1 Core2 Core3 Core0 Core Event Core2 Read A Core0 Read B Waiting for data Read C Core0 Solution 1: Architectural modification Core1 Core3 A:100 M Predictable arbitration A:- I A:- * A:- I B:42 S WB A Read B Core0: GetS B/GetS C A:50 M C:31 S B:42 S Anirudh M. Kaushik, RTAS'17

Intra-core coherence interference TDM schedule Core2 Core3 Core0 Core1 Core2 Core3 Core0 Core Event Core2 Read A Core0 Read B Waiting for data Read C Core0 Solution 1: Architectural modification Core1 Core3 A:100 M Predictable arbitration A:- I A:- * A:- I B:42 S WB A Read B Core0: GetS B/GetS C A:50 M C:31 S B:42 S Anirudh M. Kaushik, RTAS'17

Intra-core coherence interference TDM schedule Core2 Core3 Core0 Core1 Core2 Core3 Core0 Core Event Core2 Read A Core0 Read B Waiting for data Read C Core0 Solution 1: Architectural modification Core1 Core3 A:100 M Predictable arbitration A:- I A:- X A:- I B:42 S WB A Read B Solution 2: Coherence protocol modifications Core0: GetS B/GetS C Need to maintain state to indicate Core0’s pending WB of A A:50 M C:31 S B:42 S Anirudh M. Kaushik, RTAS'17

Architectural modification + Coherence modifications = Sources of unpredictable memory access latencies due to cache coherence Inter-core coherence interference on same cache line Solution: Architectural modification to shared LLC/memory Inter-core coherence interference on different cache lines Solution: Architectural modification to the core + Coherence modifications Inter-core coherence interference due to write hits Intra-core coherence interference Architectural modification + Coherence modifications = Predictable Modified-Shared-Invalidate cache coherence protocol (PMSI) Predictable latencies for simultaneous shared data access No application/OS changes Significant performance benefits Anirudh M. Kaushik, RTAS'17

PMSI cache coherence protocol: Protocol modifications State Core events Bus events Load Store Replacement OwnData OwnUpg OwnPutM OtherGetS OtherGetM OtherUpg OtherPutM I Issue GetS/ISd Issue GetM/IMd X N/A N/A N/A S Hit Issue Upg/SMw M Issue PutM/MIwb Issue PutM/MSwb ISd Read/S ISdI IMd Write/S IMdS IMdI Write/MIwb Read/I Write/MSwb SMw Store/M MIwb Send data/I MSwb Send data/S Anirudh M. Kaushik, RTAS'17

PMSI cache coherence protocol: Protocol modifications State Core events Bus events Load Store Replacement OwnData OwnUpg OwnPutM OtherGetS OtherGetM OtherUpg OtherPutM I Issue GetS/ISd Issue GetM/IMd X N/A N/A N/A S Hit Issue Upg/SMw M Issue PutM/MIwb Issue PutM/MSwb ISd Read/S ISdI IMd Write/S IMdS IMdI Write/MIwb Read/I Write/MSwb SMw Store/M MIwb Send data/I MSwb Send data/S Anirudh M. Kaushik, RTAS'17

Intra-core coherence interference TDM schedule Core2 Core3 Core0 Core1 Core2 Core3 Core0 Core Event Core2 Read A Core0 Core1 Core2 Core3 TDM PR PWB State Bus events OtherGets OtherGetM M Issue PutM/MSwb Issue PutM/MIwb A:100 M MSwb A:- I A:- I * A:- I Read B A Core2: GetS A PR LUT A:50 M Address Core ID Request A 2 GetS C:31 S B:42 S Anirudh M. Kaushik, RTAS'17

Intra-core coherence interference TDM schedule Core2 Core3 Core0 Core1 Core2 Core3 Core0 Core Event Core2 Read A Core0 WB A WB slot Core0 Core1 Core2 Core3 TDM PR PWB A:100 MSwb S A:- I State Bus events OwnPutM MSwb Send data/S A:- ISd A:- I Read B A Core0: Send A PR LUT A:50 100 M S Address Core ID Request A 2 GetS C:31 S B:42 S Anirudh M. Kaushik, RTAS'17

Intra-core coherence interference TDM schedule Core2 Core3 Core0 Core1 Core2 Core3 Core0 Core Event Core2 Read A Core0 WB A Complete read A Core0 Core1 Core2 Core3 A:100 S A:- I A:100 * S A:- I Data response A:100 PR LUT A:100 S Address Core ID Request A 2 GetS C:31 S B:42 S Anirudh M. Kaushik, RTAS'17

Intra-core coherence interference TDM schedule Core2 Core3 Core0 Core1 Core2 Core3 Core0 Core Event Core2 Read A Core0 WB A Complete read A Read B Request slot Core0 Core1 Core2 Core3 TDM PR PWB A:100 S A:- I A:100 S A:- I Read B B:42 I S Read C Core0: GetS B A:100 S C:31 S B:42 S Anirudh M. Kaushik, RTAS'17

Latency analysis of PMSI N: Number of cores WCLarb Core2 Core3 Core0 Core1 Core2 Core3 Core0 TDM schedule Core0 Core1 Core2 Core3 Pending WB Pending request A:- IMdI A:- IMdI A:- IMd A:50 M MIwb Lacc Real-time interconnect TDM arbitration TDM Address Core ID Request A GetM 1 2 WCLintraCoh LLC $/Main memory WCLinterCoh Anirudh M. Kaushik, RTAS'17

Latency analysis of PMSI WCL of memory request = Lacc + WCLarb (⍺ N)+ WCLintraCoh (⍺ N) + WCLinterCoh (⍺ N2) Anirudh M. Kaushik, RTAS'17

Results Cycle accurate Gem5 simulator 2, 4, 8 cores, 16KB direct mapped private L1-D/I caches In-order cores, 2GHz operating frequency TDM bus arbitration with slot width = 50 cycles Benchmarks: SPLASH-2, synthetic benchmarks to stress worst-case scenarios Simulation framework available at https://git.uwaterloo.ca/caesr-pub/pmsi Anirudh M. Kaushik, RTAS'17

Results: Observed worst case latency per request PMSI Unpredictable 2 Unpredictable 3 Unpredictable 4 Anirudh M. Kaushik, RTAS'17

Results: Observed worst case latency per request Latency bound PMSI Unpredictable 2 Unpredictable 3 Unpredictable 4 Anirudh M. Kaushik, RTAS'17

Results: Observed worst case latency per request Latency bound PMSI Unpredictable 2 Unpredictable 3 Unpredictable 4 Anirudh M. Kaushik, RTAS'17

Results: Slowdown relative to MESI cache coherence Uncache all Single core Uncache shared PMSI MSI MESI Anirudh M. Kaushik, RTAS'17

Results: Slowdown relative to MESI cache coherence Uncache all Single core Uncache shared PMSI MSI MESI Anirudh M. Kaushik, RTAS'17

Conclusion Enumerate sources of unpredictability when applying conventional hardware cache coherence on multi-core real-time systems Propose PMSI cache coherence that allows predictable simultaneous shared data accesses with no changes to application/OS Hardware overhead of PMSI is ~ 128 bytes WCL of memory access with PMSI ⍺ square of number of cores (⍺ N2) PMSI outperforms prior techniques for handling shared data accesses (upto 4x improvement over uncache-shared), and suffers on average 46% slowdown compared to conventional MESI cache coherence protocol Anirudh M. Kaushik, RTAS'17

Thank you & Questions? Anirudh M. Kaushik, RTAS'17