Download presentation
Presentation is loading. Please wait.
Published byTheresa Harrell Modified over 6 years ago
1
Rebound: Scalable Checkpointing for Coherent Shared Memory
Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign
2
Checkpointing in Shared-Memory MPs
HW-based schemes for small CMPs use Global checkpointing All procs participate in system-wide checkpoints Global checkpointing is not scalable Synchronization, bursty movement of data, loss in rollback… rollback save chkpt Fault checkpoint P1 P2 P3 P4
3
Alternative: Coordinated Local Checkpointing
Idea: threads coordinate their checkpointing in groups Rationale: Faults propagate only through communication Interleaving between non-comm. threads is irrelevant P1 P2 P3 P4 P5 P1 P2 P3 P4 P5 Local Chkpt GlobalChkpt Local Chkpt + Scalable: Checkpoint and rollback in processor groups Complexity: Record inter-thread dependences dynamically.
4
Contributions Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory Leverages directory protocol to track inter-thread deps. Opts to boost checkpointing efficiency: Delaying write-back of data to safe memory at checkpoints Supporting multiple checkpoints Optimizing checkpointing at barrier synchronization Avg. performance overhead for 64 procs: 2% Compared to 15% for global checkpointing
5
Background: In-Memory Checkpt with ReVive
[Prvulovic-02] Execution Register Dump P1 P2 P3 CHK Displacement Caches Dirty Cache lines Writebacks W W W W WB Checkpoint Application Stalls Writeback old old Logging old Log Memory
6
Background: In-Memory Checkpt with ReVive
[Pvrulovic-02] Old Register restored P1 P2 P3 CHK W WB Fault Caches Cache Invalidated Memory Lines Reverted Log Memory Global Broadcast protocol Local Coordinated Scalable protocol
7
Coordinated Local Checkpointing Rules
wr x rd x P1 P2 Producer rollback Consumer P1 P2 Producer chkpoint Consumer P1 P2 chkpt P checkpoints P’s producers checkpoint P rolls back P’s consumers rollback Banatre et al. used Coordinated Local checkpointing for bus-based machines [Banatre96]
8
Rebound Fault Model Log (in SW) Main Memory Chip Multiprocessor Any part of the chip can suffer transient or permanent faults. A fault can occur even during checkpointing Off-chip memory and logs suffer no fault on their own (e.g. NVM) Fault detection outside our scope: Fault detection latency has upper-bound of L cycles Redo fig
9
Rebound Architecture Main Memory Chip Multiprocessor LW-ID MyProducer
Directory Cache LW-ID MyProducer MyConsumer Dep Register P+L1 Redo fig
10
Rebound Architecture Main Memory Chip Multiprocessor L2 Directory Cache LW-ID MyProducer MyConsumer Dep Register P+L1 Dependence (Dep) registers in the L2 cache controller: MyProducers : bitmap of proc. that produced data consumed by the local proc. MyConsumers : bitmap of proc. that consumed data produced by the local proc. Redo fig
11
Rebound Architecture Main Memory Chip Multiprocessor L2 Directory Cache LW-ID MyProducer MyConsumer Dep Register P+L1 Dependence (Dep) registers in the L2 cache controller: MyProducers : bitmap of proc. that produced data consumed by the local proc. MyConsumers : bitmap of proc. that consumed data produced by the local proc. Processor ID in each directory entry: LW-ID : last writer to the line in the current checkpoint interval. Redo fig
12
Recording Inter-Thread Dependences
P1 writes MyProducers MyProducers MyConsumers MyConsumers Write LW-ID P1 D Log Memory Assume MESI protocol
13
Recording Inter-Thread Dependences
MyConsumers P2 P2 MyProducers P2 reads MyProducers P1 P2 MyConsumers MyConsumers MyProducers P1 LW-ID P1 D S Write back Logging Log Memory Assume MESI protocol
14
Recording Inter-Thread Dependences
P1 writes MyProducers MyProducers P1 P2 MyConsumers MyConsumers LW-ID P1 S P1 D Log Memory Assume MESI protocol
15
Recording Inter-Thread Dependences
Clear Dep registers P2 P1 checkpoints MyProducers MyProducers P1 P2 MyConsumers MyConsumers Clear LW-ID LW-ID LW-ID should remain set till the line is checkpointed P1 S Writebacks P1 D Logging Log Memory Assume MESI protocol
16
Lazily clearing Last Writers
Clear LW-IDs Expensive process ! Write Signature encodes all line addresses that the processor has written to (or read exclusively) in the current interval. At checkpoint, the processors clear their Write Signature Potentially stale LW-ID
17
Lazily clearing Last Writers
P1 P2 NO ! MyProducers P2 reads MyProducers WSig MyConsumers MyConsumers Addr ? Clear LW-ID Stale LW-ID P1 S Log Memory
18
Distributed Checkpointing Protocol in SW
Interaction Set [Pi]: set of producer processors (transitively) for Pi Built using MyProducers P1 P2 P3 P4 chk InteractionSet : P1 P1 initiate checkpoint
19
Distributed Checkpointing Protocol in SW
Interaction Set [Pi]: set of producer processors (transitively) for Pi Built using MyProducers P1 P2 P3 P4 chk InteractionSet : P1 P1 Ck? Ck? P2 P3 initiate checkpoint
20
Distributed Checkpointing Protocol in SW
Interaction Set [Pi]: set of producer processors (transitively) for Pi Built using MyProducers P1 P2 P3 P4 chk InteractionSet : P1 , P2, P3 P1 Ck? Ck? P2 P3 initiate checkpoint
21
Distributed Checkpointing Protocol in SW
Interaction Set [Pi]: set of producer processors (transitively) for Pi Built using MyProducers P1 P2 P3 P4 chk InteractionSet : P1 , P2, P3 P1 Accept Accept Ck? Ck? P2 P3 Ck? initiate checkpoint P4
22
Distributed Checkpointing Protocol in SW
Interaction Set [Pi]: set of producer processors (transitively) for Pi Built using MyProducers P1 P2 P3 P4 chk InteractionSet : P1 , P2, P3 P1 Accept Accept Ck? Ck? P2 P3 Ack Ck? Decline initiate checkpoint P4
23
Distributed Checkpointing Protocol in SW
Interaction Set [Pi]: set of producer processors (transitively) for Pi Built using MyProducers P1 P2 P3 P4 chk InteractionSet : P1 , P2, P3 P1 Accept Accept Ck? Ck? P2 P3 Ack Ck? Decline initiate checkpoint P4 Checkpointing is a 2-phase commit protocol.
24
Distributed Rollback Protocol in SW
Rollback handled similar to the Checkpointing protocol: - Interaction set is built transitively using MyConsumers Rollback involves Clearing the Dep. Registers and Write Signature Invalidating the processor caches Restoring the data and register context from the logs up to the latest checkpoint. No Domino Effect
25
Optimization1 : Delayed Writebacks
Time sync Checkpoint Interval I2 sync WB dirty lines Checkpoint Interval I1 Interval I2 Stall Interval I1 Stall WB dirty lines Checkpointing overhead dominated by data writebacks Delayed Writeback optimization Processors synchronize and resume execution Hardware automatically writes back dirty lines in background Checkpoint only completed when all delayed data written back Still need to record inter-thread dependences on delayed data
26
Delayed Writeback Pros/Cons
+ Significant reduction in checkpoint overhead - Additional support: Each processor has two sets of Dep. Registers and Write Signature Each cache line has a delayed bit Increased vulnerability A rollback event forces both intervals to roll back
27
Delayed Writeback protocol
MyConsumers0 P2 P1 P2 YES ! MyProducers0 MyProducers0 WSig0 xxx P2 MyConsumers0 MyConsumers0 Addr ? MyProducers1 P2 reads MyProducers1 P1 WSig1 NO ! MyConsumers1 MyConsumers1 Addr ? MyProducers1 P1 LW-ID P1 D S Write back Logging Log Memory
28
Optimization2 : Multiple Checkpoints
Problem: Fault detection is not instantaneous Checkpoint is safe only after max fault-detection latency (L) Fault Detection Latency Dep registers 1 Dep registers 2 Rollback Ckpt 1 Ckpt 2 tf Solution: Keep multiple checkpoints On fault, roll back interacting processors to safe checkpoints No Domino Effect
29
Multiple Checkpoints: Pros/Cons
+ Realistic system: supports non-instantaneous fault detection - Additional support: Each checkpoint has Dep registers Dep registers can be recycled only after fault detection latency - Need to track communication across checkpoints - Combination with Delayed Writebacks: one more Dep register set
30
Optimization3 : Hiding Chkpt behind Global Barrier
Global barriers require that all processors communicate Leads to global checkpoints Optimization: Proactively trigger a global checkpoint at a global barrier Hide checkpoint overhead behind barrier imbalance spins
31
Hiding Checkpoint behind Global Barrier
Lock count++ if(count == numProc) Iam_last = TRUE /*local var*/ Unlock If(I am_last) { count = 0 flag = TRUE … } else while(!flag) {} Update
32
Hiding Checkpoint behind Global Barrier
Lock count++ if(count == numProc) Iam_last = TRUE /*local var*/ Unlock If(I am_last) { count = 0 flag = TRUE … } else while(!flag) {} Update Processor P1 Processor P2 Processor P3 BarCK? Notify flag = TRUE ICHK = {P3} while(!flag) ICHK = {P2, P3} ICHK = {P1, P3} Update First arriving processor initiates the checkpoint Others: HW writes back data as execution proceeds to barrier Commit checkpoint as last processor arrives After the barrier: few interacting processors
33
Evaluation Setup Analysis tool using Pin + SESC cycle-acc. simulator + DRAMsim Applications: SPLASH-2 , some PARSEC, Apache Simulated CMP architecture with up to 64 threads Checkpoint interval : 5 – 8 ms Modeled several environments: Global: baseline global checkpointing Rebound: Local checkpointing scheme with delayed writeback. Rebound_NoDWB: Rebound without the delayed writebacks.
34
Avg. Interaction Set: Set of Producer Processors
Most apps: interaction set is a small set Justifies coordinated local checkpointing Averages brought up by global barriers 64 38
35
Checkpoint Execution Overhead
Rebound’s avg checkpoint execution overhead is 2% Compared to 15% for Global 15 2
36
Checkpoint Execution Overhead
Rebound’s avg checkpoint execution overhead is 2% Compared to 15% for Global Delayed Writebacks complement local checkpointing
37
Rebound Scalability Constant problem size
Rebound is scalable in checkpoint overhead Delayed Writebacks help scalability
38
Also in the Paper Delayed write backs also useful in Global
Barrier optimization is effective but not universally applicable Power increase due to hardware additions < 2% Rebound leads to only 4% increase in coherence traffic
39
Conclusions Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory Leverages directory protocol Boosts checkpointing efficiency: Delayed write-backs Multiple checkpoints Barrier optimization Avg. execution overhead for 64 procs: 2% Future work: Apply Rebound to non-hardware coherent machines Scalability to hierarchical directories
40
Rebound: Scalable Checkpointing for Coherent Shared Memory
Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.