Download presentation
Presentation is loading. Please wait.
Published byClyde Tate Modified over 9 years ago
1
Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1
2
Motivation In exascale systems, failures will further increase due to increasing number of processors Typical current approach to fault tolerance is to checkpoint in stable storage Soft errors can affect individual data blocks Multiple data blocks might be corrupted before they can be efficiently detected We focus on developing an approach that can tolerate multiple hard errors and soft errors 2
3
Fault Tolerant Data in Volatile Memory Efficient checksum-based approach to fault tolerance for data in volatile memory systems The developed scheme is applicable in multiple scenarios Online recovery of large read-only data structures with low storage overhead Online recovery from soft errors in blocked data Online recovery of read/write data via in-memory checkpointing The approach uses a logical multi-dimensional view of the data to be protected 3
4
Design Recover exact data Inspiration from Algorithm Based Fault Tolerance(ABFT) Low overhead 4
5
Checksum Design Checksum Operator XOR Multi-dimensional Checksums Increase tolerance Checksum co-located with data Reduce space overhead Distributed Checksum Reduce overhead and increase tolerance 5
6
One Dimensional Checksum 6
7
7 C C c c c c c c c c cc c c c c c c c c
8
One Dimensional Checksum 8 Recover checksum Recover data
9
Two Dimensional Checksum 9
10
Checksum and Data Distribution 10
11
Two Dimensional Checksum 11 Recovery Checksum calculation
12
Three Dimensional Checksum 12
13
Three Dimensional Checksum Distribution 13
14
Checksum Overhead –One Dimension –Two Dimension –Three Dimension –d Dimension
15
Experiments Cray XE6 system(NERSC Hopper) 6384 nodes with Gemini interconnect Peak bandwidth 8.3 GB/s per direction Twelve core 2.1 GHz AMD ‘MagnyCours’ with 24 cores per node and 32 GB DDR3 memory Intel C++ compiler 13 and Cray MPI 6.0.1
16
Checksum Calculation Time 1D, 2D and 3D 1D 3D 2D 16
17
Fault Recovery 17
18
Soft Error Soft error can change the data in memory Unit of failure is a block of data inside the process not the entire process Low overhead compared to entire process failure Less number of tolerable failures 18
19
Soft Error 19
20
Soft Error Equations 20 1D block 2D block
21
2D Soft Error Checksum 21
22
2D Soft Error Recovery 22
23
Summary In memory checkpointing, low overhead protection for read only data, recovery from soft errors XOR based checksum to recover exact data Multidimensional checksum calculation to increase fault tolerance Co-location of the checksums with the data Scalable design to ensure low space overhead 23
24
THANK YOU Questions? 24
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.