Presentation is loading. Please wait.

Presentation is loading. Please wait.

Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1.

Similar presentations


Presentation on theme: "Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1."— Presentation transcript:

1 Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1

2 Motivation In exascale systems, failures will further increase due to increasing number of processors Typical current approach to fault tolerance is to checkpoint in stable storage Soft errors can affect individual data blocks Multiple data blocks might be corrupted before they can be efficiently detected We focus on developing an approach that can tolerate multiple hard errors and soft errors 2

3 Fault Tolerant Data in Volatile Memory Efficient checksum-based approach to fault tolerance for data in volatile memory systems The developed scheme is applicable in multiple scenarios Online recovery of large read-only data structures with low storage overhead Online recovery from soft errors in blocked data Online recovery of read/write data via in-memory checkpointing The approach uses a logical multi-dimensional view of the data to be protected 3

4 Design Recover exact data Inspiration from Algorithm Based Fault Tolerance(ABFT) Low overhead 4

5 Checksum Design Checksum Operator XOR Multi-dimensional Checksums Increase tolerance Checksum co-located with data Reduce space overhead Distributed Checksum Reduce overhead and increase tolerance 5

6 One Dimensional Checksum 6

7 7 C C c c c c c c c c cc c c c c c c c c

8 One Dimensional Checksum 8 Recover checksum Recover data

9 Two Dimensional Checksum 9

10 Checksum and Data Distribution 10

11 Two Dimensional Checksum 11 Recovery Checksum calculation

12 Three Dimensional Checksum 12

13 Three Dimensional Checksum Distribution 13

14 Checksum Overhead –One Dimension –Two Dimension –Three Dimension –d Dimension

15 Experiments Cray XE6 system(NERSC Hopper) 6384 nodes with Gemini interconnect Peak bandwidth 8.3 GB/s per direction Twelve core 2.1 GHz AMD ‘MagnyCours’ with 24 cores per node and 32 GB DDR3 memory Intel C++ compiler 13 and Cray MPI 6.0.1

16 Checksum Calculation Time 1D, 2D and 3D 1D 3D 2D 16

17 Fault Recovery 17

18 Soft Error Soft error can change the data in memory Unit of failure is a block of data inside the process not the entire process Low overhead compared to entire process failure Less number of tolerable failures 18

19 Soft Error 19

20 Soft Error Equations 20 1D block 2D block

21 2D Soft Error Checksum 21

22 2D Soft Error Recovery 22

23 Summary In memory checkpointing, low overhead protection for read only data, recovery from soft errors XOR based checksum to recover exact data Multidimensional checksum calculation to increase fault tolerance Co-location of the checksums with the data Scalable design to ensure low space overhead 23

24 THANK YOU Questions? 24


Download ppt "Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1."

Similar presentations


Ads by Google