Timing Isolation for Memory Systems

Timing Isolation for Memory Systems
Rodolfo Pellizzoni

Publications Single Core Equivalent Virtual Machines for Hard Real-Time Computing on Multicore Processors, IEEE Computer 2016 Memory Bandwidth Management for Efficient Performance Isolation in Multi-core Platforms, IEEE Transactions on Computers 2016 Schedulability Analysis for Memory Bandwidth Regulated Multicore Real-Time Systems, IEEE Transactions on Computers 2016 A Real-Time Scratchpad-centric OS for Multi-core Embedded Systems, RTAS 2016 WCET(m) Estimation in Multi-Core Systems using Single Core Equivalence, ECRTS 2015 Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems, ECRTS 2015 PALLOC: DRAM Bank-Aware Memory Allocator for Performance Isolation on Multicore Platforms, RTAS 2014 Worst Case Analysis of DRAM Latency in Multi-Requestor Systems, RTSS 2013 Real-Time Cache Management Framework for Multi-core Architectures, RTAS 2013

Thanks Ahmed Alhammad, Saud Wasly, Zheng Pei Wu Heechul Yun
Gang Yao, Renato Mancuso, Marco Caccamo, Lui Sha

Multicore Systems Server Desktop Mobile RT/Embedded
Soon more rt/embedded systems will use multicore as well.

Exploiting Multicores
Idea 1: combine multiple software partitions / applications Industrial standards (ARINC 673, AUTOSAR) define systems in terms of partitions / software components We can put more partitions on the same system Problem: independent development/certification This is focus of this talk Idea 2: parallel applications Computationally-intensive applications Problem: task model and optimization

Challenge: Shared Memory Resources
P1 P2 Core1 Memory Hierarchy Core2 Core3 Core4 Multicore P1 P2 P3 P4 P5 P6 P7 P8 CPU Memory Hierarchy Unicore Performance Impact We must bound Interference

How Bad Can it Be? Slowdown ratio
Setup: Core0: X-axis, Core1-3: 470.lbm x 3 (interference) Slowdown ratio = Solo IPC / Corun IPC (*) Measured on Intel Xeon 3530 (4 cores), 8GB 1ch DDR3 DRAM

DO-178C Certification Certification Authorities Software Team (CAST) Position Paper 32, May 2014 All effects due to shared resources in Multicore Processors must be: Identified Analyzed Mitigated

DO-178C Certification CAST Identifies: Main Memory (DRAM)
… but other resources might create interference (not in this talk). Main Memory (DRAM) Shared Cache Space Interconnection Miss Status Holding Registers (MSHRs) in cache

Fully-Compositional Analysis
No assumption made on other cores (black boxes) Allows independent verification / certification. Core#1 Core#2 Core#3 Shared Physical Resource

Fully-Compositional Analysis
No assumption made on other cores (black boxes) Allows independent verification / certification. Core Core t Shared Physical Resource

Overview Our fully-compositional approach: Single-Core Equivalence
Cache isolation Main memory (DRAM) isolation DRAM realities COTS arbitration is complex… Realistic analysis bounds Improving DRAM access latencies HW solutions Our SW solution (PREM)

Single Core Equivalence (SCE)
End goal: Single Core Equivalent (SCE) execution Prove that each core in a M-core platform behaves as a (possibly slower) single-processor system Certify ARINC-653 partition on the (slower) single-processor system

Overview . . . . . . . . . 1. Colored Lockdown Core 1 Core M Core 1
Assigned Cache#1 Interconnection Memory Controller DRAM Assigned Cache#M 1. Colored Lockdown . . . Shared Cache . . . Interconnection Memory Controller DRAM

Overview . . . . . . . . . . . . 1. Colored Lockdown 2. PALLOC Core 1
Core M Core 1 Core M . . . . . . Shared Cache Assigned Cache#1 Assigned Cache#M 1. Colored Lockdown . . . Interconnection Interconnection Memory Controller Memory Controller DRAM DRAM#1 DRAM#M 2. PALLOC . . .

Overview . . . . . . . . . . . . . . . 1. Colored Lockdown 3. MemGuard
Core 1 Core M Core 1 Core M . . . . . . Shared Cache Assigned Cache#1 Assigned Cache#M 1. Colored Lockdown . . . Interconnection 3. MemGuard Memory Contr.#1 . . . Memory Contr.#M Memory Controller DRAM DRAM#1 DRAM#M 2. PALLOC . . .

1. Colored Lockdown Converts shared cache in a deterministic object
Management Model: Consider the cache as a 2D array of blocks Assign arbitrary sets of blocks to each task Profiling is used to determine the most beneficial set of blocks for each task

1. Colored Lockdown 1. Coloring
Manipulate physical memory addresses to enforce vertical positioning

1. Colored Lockdown 1. Coloring 2. Lockdown
Manipulate physical memory addresses to enforce vertical positioning 2. Lockdown Rely on hardware lockdown features to prevent data eviction and enforce horizontal positioning

1. Implementation Coloring Lockdown It is implemented at an OS level
Leverages on the Virtual-to-Physical mapping Has the granularity of a memory page (4KB) It can be performed off-line Lockdown Relies on primitives provided in commercial hardware Protects data from any source of interference Allocated lines can be freed at any moment The OS is responsible for page allocation in cache

1. Profiling and Lockdown Curve
Assumption: all non-locked accesses are misses in cache

1. Other Cache Partitioning Options…
M. Chisholm, N. Kim, B. Ward, N. Otterness, J. Anderson, and F. Smith. “Reconciling the Tension Between Hardware Isolation and Data Sharing in Mixed-Criticality, Multicore Systems”, RTSS’16

Storage Array contains Data Can only Read/Write to Row Buffer
DRAM Background Storage Array contains Data Can only Read/Write to Row Buffer

DRAM Background Front End generates the needed commands, Back End issues commands on command bus

Row Buffer contain data from a different row
DRAM Background READ Row Buffer contain data from a different row Targeting Data in this Row

DRAM Background READ P, A, R

DRAM Background ACT: Load the data from array into buffer
PRE P, A, R ACT: Load the data from array into buffer Pre-Charge: store the data back into array ACT P A Pre-charge command issued on command bus Timing Constraint

CAS (R/W): reads/writes using buffer
DRAM Background P, A, R READ CAS (R/W): reads/writes using buffer P A R Data

Targeting Data Already in Row Buffer
DRAM Background READ Only Need Read Command R Targeting Data Already in Row Buffer Can be issued immediately P A R Data Data R

DRAM Background Latency of a close request is much longer than the latency of an open request READ R Latency of a close request Latency of a open request P A R Data Data R

Bank Parallelism Accessing data in multiple banks
Multiple data can be pipelined

Requests should be spread over multiple banks
Bank Parallelism Requests should be spread over multiple banks Bank 1 Bank 2 Bank 3 Bank 4 Accessing data in same bank, different rows A Data R P A Data R No pipelining – much larger delay

Write-to-Read Penalty
Transactions of the same type can be pipelined… … but a read after a write cannot. Huge Latency Penalty! Data R Data W Data R Data W Write-to-Read (WtR) timing constraint Data W R

Memory Controller (MC)
2. OS & DRAM Banks OS does NOT know DRAM banks OS memory pages are spread all over multiple banks OS Core1 Core2 Core3 Core4 Shared Cache Memory Controller (MC) ???? DRAM DIMM Unpredictable memory performance Bank 1 Bank 2 Bank 3 Bank 4

2. Best Case Core1 Core2 Core3 Core4 Shared Cache Memory Controller (MC) DRAM DIMM Fast (peak b/w) Bank 1 Bank 2 Bank 3 Bank 4

2. Worst Case Shared Cache DRAM DIMM Memory Controller (MC) Bank 4 3 2 1 Core1 Core2 Core3 Core4 Sequential request order Row 1 Row 2 Row 3 Row 4 … time 1 bank & always new row  ~1/10 peak b/w

2. Actual Case Core1 Core2 Core3 Core4 Shared Cache Memory Controller (MC) DRAM DIMM Performance = ?? Bank 1 Bank 2 Bank 3 Bank 4

2. PALLOC OS is aware of DRAM mapping Each page can be allocated to a desired DRAM bank Force the best case (bank partitioning) OS Core1 Core2 Core3 Core4 Shared Cache Memory Controller (MC) DRAM DIMM Bank 1 Bank 2 Bank 3 Bank 4

2. Performance Isolation on 4 Cores
Slowdown ratio Setup: Core0: X-axis, Core1-3: 470.lbm x 3 (interference) PB: DRAM bank partitioning only; PB+PC: DRAM bank and Cache partitioning

3. Bus Bandwidth Management
Why PB+PC not ideal? Cores still contend for access to memory data bus Solution: bandwidth reservation Similar to a real-time server, but guarantees memory bandwidth rather than CPU time

3. MemGuard Determine guaranteed system bandwidth based on PALLOC
Operating System MemGuard Reservation Manager BW Regulator 0.6GB/s BW Regulator 0.2GB/s BW Regulator 0.2GB/s 0.2GB/s BW Regulator Multicore Processor Core1 PMC Core2 PMC Core3 PMC Core4 PMC Memory Controller 1.2GB/s guaranteed DRAM DIMM Determine guaranteed system bandwidth based on PALLOC Split available bandwidth between cores by assigning reserved bandwidth budgets

3. MemGuard Reservation Reserve per-core memory bandwidth via the OS scheduler Use h/w PMC to monitor memory request rate Suspend the RT task Budget 2 1 Core activity 1ms 2ms Regulation period computation main memory access

SCE Analysis: How it Works
All cores have equal memory budget every P time units Assume Round-Robin arbitration SCE: bank isolation with PALLOC, interference on data bus Every memory request delayed by M - 1 other requests M = 4 Core under analysis Other cores

SCE Analysis Computation time after locking Remaining main mem
accesses after locking Size of mem access Min/Max memory bandwidths

SCE Analysis

SCE Results

Problem: COTS Arbiters are Unfair
The main issue: unfair arbitration cores A request can be delayed by other requests Many examples… DRAM (FR-FCFS, Write Buffering) Non-blocking caches Wormhole NoC arbitration

DRAM FR-FCFS Scheduling
COTS memory controllers use First-Ready First-Come-First-Serve (FR-FCFS) arbitration. Each bank has a separate request queue Requests targeting an open row are prioritized over requests targeting a close row FCFS arbitration among banks Maximum throughput but very bad worse case delay… A close request under analysis can be delayed by open requests of other cores to the same bank Solved with bank partitioning An out-of-order core can insert multiple competing requests before the one under analysis

DRAM Write Buffering Reads are prioritized over writes; writes are buffered and transmitted in non-preemptive batches (ARM A15 – 18 req.) If a write batch begins right before a read under analysis arrives, the request is delayed by 18 write requests…

Solutions Analyze the system, accept high latency bounds
Force fair arbitration at low level (HW) Design new predictable arbiters Many proposals for DRAM memory controllers, interconnections, scratchpads, etc. Force fair arbitration at high level (SW) Control memory accesses in software Schedule both CPU execution and memory accesses

Delay Analysis Goal Compute the worst-case memory interference delay of a task under analysis Request driven analysis Based on the task’s own memory demand: H Compute worst-case per request delay: RD Memory interference delay = RD x H FR-FCFS only: FR-FCFS + write bundling: [Kim’14] H. Kim, D. de Niz, B. Andersson, M. Klein, O. Mutlu, and R. R.Rajkumar. “Bounding Memory Interference Delay in COTS-based Multi-Core Systems”, RTAS’14 [Ours] H. Yun, R. Pellizzoni and P. K. Valsan. “Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems”, ECRTS’15

Key Intuition #1 The # of competing requests Nrq is bounded
Because the # of per-core parallel requests is bounded. Example Cortex-A15’s per-core bound = 6 Nrq = 6 x 3 (cores) = 18

Key Intuition #2 DRAM sub-commands of the competing memory requests are overlapped Much less pessimistic than [Kim’14], which simply sums up each sub-command’s maximum delay

RD = read batch delay + write batch delay
Key Intuition #3 The worst-case delay happens when The read buffer has Nrq requests And the write request buffer just becomes full Start a write batch Then the read request under analysis arrives RD = read batch delay + write batch delay

Results: Synthetic Benchmarks
overestimate underestimate Ours(ideal): Read only delay analysis (ignore writes) Ours(opt): assume writes are balanced over multiple banks Ours(worst): all writes are targeting one bank & all row misses

Results: SPEC2006 Main source of pessimism:
The pathological case of write (LLC write-backs) processing

Real-Time DRAM Controller
Goal: minimize worst-case request latency for real-time requestors (cores) Force RR between real-time requestors We also need good bandwidth for non real-time req. Differences in bank mapping & command arbitration Problem: how to compare? Worst-case latency: we derive common analytical formulation for all controllers Average-case: MCSim – memory controller simulator D. Guo, M. Hassan, R. Pellizzoni and H. Patel. “A Comparative Study of Predictable DRAM Controllers ”, Submitted to ACM TECS

DRAM Controller Structure
Address Mapping Map requests to banks Command Generator Generate commands for each request Request Scheduler Order requests Command Scheduler Arbitrate commands Write-to-Read Technique Reordering Rank-Switching

Predictable DRAM Controllers
PMC RTMem ORP MEDUSA MAG DCmc ROC ReOrder MCMC Peq.Size Y N Mix.Criti Rank Addr. Map Int. Priv. Req. Schr RR WC Dir Fix. Pri. Dir+RR TDM Page Policy C H O Cmd. Schr Stat. Dy. Analytical Comp YES Run-time Comp NO Simulator C++ py gem5 NA vhdl sysC Because the DRAM is operated by command, a DRAM controller is needed to issue commands based on the timing constraints.

MCsim Architecture Add subclass component to show the extensibility

MCSim Cycle accurate simulator
Implemented 10 state-of-the-art predictable MCs Each controller requires at most 200 lines of codes Device simulation based on RAMulator Use either memory traces or full-system simulator (Gem5) Yoongu Kim, Weikun Yang, and Onur Mutlu. “Ramulator: A Fast and Extensible DRAM Simulator”, CAL 2015.

Real-Time MC Design Two basic strategies…
Close page policy with interleaved mapping Treat all requests as close requests Interleave over multiple banks for parallelism A Data R A Data R A Data R A Data R

Real-Time MC Design Two basic strategies…
Close page policy with interleaved mapping Treat all requests as close requests Interleave over multiple banks for parallelism Open page policy with private banks Requests can either be open or close Force bank partitioning to avoid row interference Problem: must deal with write-read switching

Write-Read Switch Strategies
Command/Request Reordering Break RR in a “controlled” way Rank-Switching Use different physical chips Request 0 Bank0 R Request 1 Bank1 W Request 2 Bank2 R Request 3 W Bank3 Request 0 Rank0 A R Data Request 1 Rank1 A W Data

Requestors (Interleave vs. Private)
Close and Open Request (64B REQ, 64b) 100% 100% 83% 90% 71% 81% Interleave Increase proportional to the number of requestors Private ORP increase non-linear, because of the bank parallelism. Open request more linearly because the switching delay

Row Hit Ratio (Switching Technique)
Average Latency per Request (8 64B REQ, 64b) 38% 11% Interleave Increase proportional to the number of requestors Private ORP increase non-linear, because of the bank parallelism. Open request more linearly because the switching delay

Data Bus Width (Interleave vs. Private)
Close and Open Request (8REQ 64B, HR=0.45) 89% 80% 38% 0% Interleave Increase proportional to the number of requestors Private ORP increase non-linear, because of the bank parallelism. Open request more linearly because the switching delay

Software Solution PRedictable Execution Model (PREM)
Divide each task into scheduling intervals, separating memory accesses from CPU computation Load phase: load data/instructions main memory -> local memory Execution phase: execute from local memory Unload phase: write-back modified data local memory -> main memory load phase execution phase unload phase

Example: MPC5777M Implementation
Automotive platform 2 application cores run PREM tasks I/O core used for management

Automotive platform 2 application cores run PREM tasks I/O core used for management DMA used to load / unload memory to scratchpad memory

Automotive platform 2 application cores run PREM tasks I/O core used for management DMA used to load / unload memory to scratchpad memory Task 1 executes

Automotive platform 2 application cores run PREM tasks I/O core used for management DMA used to load / unload memory to scratchpad memory

Scheduling Task execution is pipelined
Divide scratchpad in two partitions Execute a task from one partition Load/unload another task in the other partition

Scheduling Task execution is pipelined
Inter-core isolation though (high-level) TDMA arbitration TDMA slot size: time to reload one partition

Schedulability Results

PREM Compilation Three main problems.
Task memory footprint too large to fit in SPM/cache Break task in multiple chunks Data usage depends on inputs Decide what to load based on control flow at run-time Number of cores > number of tasks/threads Pipeline memory phases with execution of same task Solution: compiler-driven predictable data prefetching

Predictable Prefetching
Construct program CFG Extract information on object/pointer usage Start DMA load phase before data is first used Overlap memory phase with program execution Unload data after its use BB1 BB8 BB2 BB3 BB4 BB5 BB6 BB7 load x load y x y unload x unload y

Predictable Prefetching
Preliminary implementation based on LLVM FIFO queue allows issuing multiple DMA operations Need either special hardware or manager core for DMA management / pointer resolution Allocation algorithm optimized for worst-case execution time TDMA arbitration between core for DMA operations

Conclusions Resource contention is a major problem for the deployment of real-time multicore systems Timing isolation is important for independent system verification / certification COTS arbiters are not designed for worst case latency bounds… ... but we can use a mix of OS / Compiler / HW solutions

Questions?

Evaluation Setup Gem5 simulator Linux 3.14 Workload
4 out-of-order cores (based on Cortex-A15) L2 MSHR size is increased to eliminate MSHR contention DRAM controller model [Hansson’14] 533Mhz Linux 3.14 Use PALLOC[Yun’14] to partition DRAM banks and LLC Workload Subject: Latency, SPEC2006 Co-runners: Bandwidth (write) subject co-runner(s) DRAM LLC Core1 Core2 Core3 Core4

DRAM FR-FCFS Scheduling [Rixner’00]
Maximize memory throughput ??? Bank scheduler gives Open first Ready requests between banks are FCFS Unfairness: Open vs close (prevented with palloc) FCFS instead of RR (not preventable; out-of-order cores get more bandwidth) Bank 1 scheduler Channel scheduler Bank 2 Bank N [Rixner’00] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. Owens. Memory access scheduling. ACM SIGARCH Computer Architecture News. 2000

DRAM Controller Address Mapping Request Arbiter Command Generator
Private Bank Interleave Bank Request Arbiter FR-FCFS: COTS policy to improve performance RR, work-conserving TDM: fair arbitration scheme TDM: composable policy, regardless of other factors Command Generator Open-Page: keep the row buffer open until precharged Close-Page: auto-precharge activated row Hybrid-Page: use both policy to generate a pattern of commands Command Scheduler Static: a sequence of command with pre-defined timing Dynamic: scheduling decision is made at run-time depends on the timing constraints and banks General delay: delay between CAS regardless of the bank Request and Command Queues Connect the arbiter, generator, and scheduler. Constructed based on scheduling policy Requestor Level and Memory Hierarchy Level

The core model PRedictable Execution Model (PREM)
Enforce TDMA access for memory memory Expand with the three figures

Memory Performance

Timing Isolation for Memory Systems

Similar presentations

Presentation on theme: "Timing Isolation for Memory Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Timing Isolation for Memory Systems

Similar presentations

Presentation on theme: "Timing Isolation for Memory Systems"— Presentation transcript:

Similar presentations

About project

Feedback