Download presentation
Presentation is loading. Please wait.
Published byKaren Bruce Modified over 9 years ago
1
www.montblanc-project.eu This project has received funding from the European Union's Seventh Framework Programme for research, technological development and demonstration under grant agreement n.610402 http://www.montblanc-project.eu May 7, 2015CSW & BR Oslo Understanding and Addressing the Resiliency Issues for Future Exascale Computing with the Mont-Blanc Prototype Ferad Zyulkyarov Barcelona Supercomputing Center May 7, 2015 1
2
Acknowledgement Javier Arias Jesus Labarta Filippo Mantovani Dani Ruiz Omer Subasi Osman Unsal Oriol Vilarrubi Gulay Yalcin CSW & BR OsloMay 7, 20152
3
About This Presentation Focus on memory resiliency First ever attempt to characterize the memory reliability of a large system which has no memory ECC Relate our numbers to the state-of-the-art Our software-based proposals to complement HW ECC May 7, 2015CSW & BR Oslo3
4
Comparing with the Related Work May 7, 2015CSW & BR Oslo State of the ArtMont-Blanc [1] Sridharan et al. "Memory Errors in Modern Systems The Good, The Bad, and The Ugly", ASPLOS'2015 [2] Sridharan and Liberty, "A study of DRAM failures in the field", SC'2012 Cielo Hopper 4
5
Technical Details May 7, 2015CSW & BR Oslo Mont-Blanc is much smaller than Cielo and Hopper. 5
6
This Study May 7, 2015CSW & BR Oslo Related WorkMont-Blanc Node hours (million)1570.65 GB hours (million)11,2501.90 This is a preliminary study and the results are not statistically strong to draw conclusions. The study on Mont-Blanc is for 917 Nodes and only 3GB per node were scanned. 6
7
Memory FIT per 1GB May 7, 2015CSW & BR Oslo LDDR3 in Mont-Blanc has very high FIT rate. We are not sure why but we attribute this for being a low-end product. 7
8
MTBF (all faults) May 7, 2015CSW & BR Oslo We definitely need ECC in hardware or MontBlanc system of the scale of Cielo may fail every 20 minutes. What if MB has same amount of memory like Cielo? What if MB has same amount of memory like Hopper? 8
9
ECC in Hardware May 7, 2015CSW & BR Oslo SECDED will not be strong enough to keep a large system operable. 9
10
Projections for Exascale May 7, 2015CSW & BR Oslo10 MTBF Mont-Blanc DRAM Chipkill uncorrectable errors The MTBF for uncorrectable errors with chipkill in a system with commodity memory like in Mont-Blanc will be between 0.3 and 5.1 days.
11
Memory Reliability CSW & BR Oslo DateTimeHostAddressExpectedActualTypeNum Cells 26/02/201510:21:25mb-30-130x2b1f1cb408000Permanent1 06/03/201508:39:51mb-38-150xb93e16081F281FATransient1 12/03/201422:39:05mb-22-30x38d6ea1010000532532Transient1 14/03/2015 11:43:12mb-21-100x775a4dbc37237723Intermittent1 14/03/2015 11:43:22mb-21-100x775a4dbc37277727Intermittent1 14/03/2015 11:43:24mb-21-100x775a4dbc37287728Intermittent1 14/03/2015 11:43:27mb-21-100x775a4dbc37297729Intermittent1 14/03/2015 11:43:29mb-21-100x775a4dbc372A772AIntermittent1 14/03/2015 11:43:31mb-21-100x775a4dbc372B772BIntermittent1 14/03/2015 11:43:34mb-21-100x775a4dbc372C772CIntermittent1 14/03/2015 11:43:36mb-21-100x775a4dbc372D772DIntermittent1 14/03/2015 11:43:38mb-21-100x775a4dbc372E772EIntermittent1 14/03/2015 11:43:41mb-21-100x775a4dbc372F772FIntermittent1 14/03/2015 11:43:43mb-21-100x775a4dbc37307730Intermittent1 14/03/2015 11:43:46mb-21-100x775a4dbc37317731Intermittent1 14/03/2015 11:43:48mb-21-100x775a4dbc37327732Intermittent1 14/03/2015 11:43:51mb-21-100x775a4dbc37337733Intermittent1 14/03/2015 11:43:53mb-21-100x775a4dbc37347734Intermittent1 14/03/2015 11:43:55mb-21-100x775a4dbc37357735Intermittent1 14/03/2015 11:43:58mb-21-100x775a4dbc37367736Intermittent1 14/03/2015 11:44:41mb-21-100x775a4dbc37487748Intermittent1 14/03/2015 11:44:43mb-21-100x775a4dbc37497749Intermittent1 14/03/2015 11:44:46mb-21-100x775a4dbc374A774AIntermittent1 18/03/201506:25:01mb-24-40xb6c069e0453DA53DATransient1 18/03/201516:00:58mb-62-60xa83cb64002Transient1 26/03/201512:17:52mb-26-120x405dfc006E61461Transient4 26/03/201513:32:17mb-24-110xb4f33000E600635858Transient9 29/03/201509:40:54mb-35-40x3f6eb824143884338Transient1 Permanent error. The most significant bit cannot be set to 1. May 7, 201511
12
Memory Reliability CSW & BR Oslo DateTimeHostAddressExpectedActualTypeNum Cells 26/02/201510:21:25mb-30-130x2b1f1cb408000Permanent1 06/03/201508:39:51mb-38-150xb93e16081F281FATransient1 12/03/201422:39:05mb-22-30x38d6ea1010000532532Transient1 14/03/2015 11:43:12mb-21-100x775a4dbc37237723Intermittent1 14/03/2015 11:43:22mb-21-100x775a4dbc37277727Intermittent1 14/03/2015 11:43:24mb-21-100x775a4dbc37287728Intermittent1 14/03/2015 11:43:27mb-21-100x775a4dbc37297729Intermittent1 14/03/2015 11:43:29mb-21-100x775a4dbc372A772AIntermittent1 14/03/2015 11:43:31mb-21-100x775a4dbc372B772BIntermittent1 14/03/2015 11:43:34mb-21-100x775a4dbc372C772CIntermittent1 14/03/2015 11:43:36mb-21-100x775a4dbc372D772DIntermittent1 14/03/2015 11:43:38mb-21-100x775a4dbc372E772EIntermittent1 14/03/2015 11:43:41mb-21-100x775a4dbc372F772FIntermittent1 14/03/2015 11:43:43mb-21-100x775a4dbc37307730Intermittent1 14/03/2015 11:43:46mb-21-100x775a4dbc37317731Intermittent1 14/03/2015 11:43:48mb-21-100x775a4dbc37327732Intermittent1 14/03/2015 11:43:51mb-21-100x775a4dbc37337733Intermittent1 14/03/2015 11:43:53mb-21-100x775a4dbc37347734Intermittent1 14/03/2015 11:43:55mb-21-100x775a4dbc37357735Intermittent1 14/03/2015 11:43:58mb-21-100x775a4dbc37367736Intermittent1 14/03/2015 11:44:41mb-21-100x775a4dbc37487748Intermittent1 14/03/2015 11:44:43mb-21-100x775a4dbc37497749Intermittent1 14/03/2015 11:44:46mb-21-100x775a4dbc374A774AIntermittent1 18/03/201506:25:01mb-24-40xb6c069e0453DA53DATransient1 18/03/201516:00:58mb-62-60xa83cb64002Transient1 26/03/201512:17:52mb-26-120x405dfc006E61461Transient4 26/03/201513:32:17mb-24-110xb4f33000E600635858Transient9 29/03/201509:40:54mb-35-40x3f6eb824143884338Transient1 Transient errors May 7, 201512
13
Memory Reliability CSW & BR Oslo DateTimeHostAddressExpectedActualTypeNum Cells 26/02/201510:21:25mb-30-130x2b1f1cb408000Permanent1 06/03/201508:39:51mb-38-150xb93e16081F281FATransient1 12/03/201422:39:05mb-22-30x38d6ea1010000532532Transient1 14/03/2015 11:43:12mb-21-100x775a4dbc37237723Intermittent1 14/03/2015 11:43:22mb-21-100x775a4dbc37277727Intermittent1 14/03/2015 11:43:24mb-21-100x775a4dbc37287728Intermittent1 14/03/2015 11:43:27mb-21-100x775a4dbc37297729Intermittent1 14/03/2015 11:43:29mb-21-100x775a4dbc372A772AIntermittent1 14/03/2015 11:43:31mb-21-100x775a4dbc372B772BIntermittent1 14/03/2015 11:43:34mb-21-100x775a4dbc372C772CIntermittent1 14/03/2015 11:43:36mb-21-100x775a4dbc372D772DIntermittent1 14/03/2015 11:43:38mb-21-100x775a4dbc372E772EIntermittent1 14/03/2015 11:43:41mb-21-100x775a4dbc372F772FIntermittent1 14/03/2015 11:43:43mb-21-100x775a4dbc37307730Intermittent1 14/03/2015 11:43:46mb-21-100x775a4dbc37317731Intermittent1 14/03/2015 11:43:48mb-21-100x775a4dbc37327732Intermittent1 14/03/2015 11:43:51mb-21-100x775a4dbc37337733Intermittent1 14/03/2015 11:43:53mb-21-100x775a4dbc37347734Intermittent1 14/03/2015 11:43:55mb-21-100x775a4dbc37357735Intermittent1 14/03/2015 11:43:58mb-21-100x775a4dbc37367736Intermittent1 14/03/2015 11:44:41mb-21-100x775a4dbc37487748Intermittent1 14/03/2015 11:44:43mb-21-100x775a4dbc37497749Intermittent1 14/03/2015 11:44:46mb-21-100x775a4dbc374A774AIntermittent1 18/03/201506:25:01mb-24-40xb6c069e0453DA53DATransient1 18/03/201516:00:58mb-62-60xa83cb64002Transient1 26/03/201512:17:52mb-26-120x405dfc006E61461Transient4 26/03/201513:32:17mb-24-110xb4f33000E600635858Transient9 29/03/201509:40:54mb-35-40x3f6eb824143884338Transient1 Intermittent errors at the same address and bit. 10 sec 3 sec 2 sec May 7, 201513
14
Memory Reliability CSW & BR Oslo DateTimeHostAddressExpectedActualTypeNum Cells 26/02/201510:21:25mb-30-130x2b1f1cb408000Permanent1 06/03/201508:39:51mb-38-150xb93e16081F281FATransient1 12/03/201422:39:05mb-22-30x38d6ea1010000532532Transient1 14/03/2015 11:43:12mb-21-100x775a4dbc37237723Intermittent1 14/03/2015 11:43:22mb-21-100x775a4dbc37277727Intermittent1 14/03/2015 11:43:24mb-21-100x775a4dbc37287728Intermittent1 14/03/2015 11:43:27mb-21-100x775a4dbc37297729Intermittent1 14/03/2015 11:43:29mb-21-100x775a4dbc372A772AIntermittent1 14/03/2015 11:43:31mb-21-100x775a4dbc372B772BIntermittent1 14/03/2015 11:43:34mb-21-100x775a4dbc372C772CIntermittent1 14/03/2015 11:43:36mb-21-100x775a4dbc372D772DIntermittent1 14/03/2015 11:43:38mb-21-100x775a4dbc372E772EIntermittent1 14/03/2015 11:43:41mb-21-100x775a4dbc372F772FIntermittent1 14/03/2015 11:43:43mb-21-100x775a4dbc37307730Intermittent1 14/03/2015 11:43:46mb-21-100x775a4dbc37317731Intermittent1 14/03/2015 11:43:48mb-21-100x775a4dbc37327732Intermittent1 14/03/2015 11:43:51mb-21-100x775a4dbc37337733Intermittent1 14/03/2015 11:43:53mb-21-100x775a4dbc37347734Intermittent1 14/03/2015 11:43:55mb-21-100x775a4dbc37357735Intermittent1 14/03/2015 11:43:58mb-21-100x775a4dbc37367736Intermittent1 14/03/2015 11:44:41mb-21-100x775a4dbc37487748Intermittent1 14/03/2015 11:44:43mb-21-100x775a4dbc37497749Intermittent1 14/03/2015 11:44:46mb-21-100x775a4dbc374A774AIntermittent1 18/03/201506:25:01mb-24-40xb6c069e0453DA53DATransient1 18/03/201516:00:58mb-62-60xa83cb64002Transient1 26/03/201512:17:52mb-26-120x405dfc006E61461Transient4 26/03/201513:32:17mb-24-110xb4f33000E600635858Transient9 29/03/201509:40:54mb-35-40x3f6eb824143884338Transient1 Multi-bit errors May 7, 201514
15
Memory Reliability CSW & BR Oslo DateTimeHostAddressExpectedActualTypeNum Cells 26/02/201510:21:25mb-30-130x2b1f1cb408000Permanent1 06/03/201508:39:51mb-38-150xb93e16081F281FATransient1 12/03/201422:39:05mb-22-30x38d6ea1010000532532Transient1 14/03/2015 11:43:12mb-21-100x775a4dbc37237723Intermittent1 14/03/2015 11:43:22mb-21-100x775a4dbc37277727Intermittent1 14/03/2015 11:43:24mb-21-100x775a4dbc37287728Intermittent1 14/03/2015 11:43:27mb-21-100x775a4dbc37297729Intermittent1 14/03/2015 11:43:29mb-21-100x775a4dbc372A772AIntermittent1 14/03/2015 11:43:31mb-21-100x775a4dbc372B772BIntermittent1 14/03/2015 11:43:34mb-21-100x775a4dbc372C772CIntermittent1 14/03/2015 11:43:36mb-21-100x775a4dbc372D772DIntermittent1 14/03/2015 11:43:38mb-21-100x775a4dbc372E772EIntermittent1 14/03/2015 11:43:41mb-21-100x775a4dbc372F772FIntermittent1 14/03/2015 11:43:43mb-21-100x775a4dbc37307730Intermittent1 14/03/2015 11:43:46mb-21-100x775a4dbc37317731Intermittent1 14/03/2015 11:43:48mb-21-100x775a4dbc37327732Intermittent1 14/03/2015 11:43:51mb-21-100x775a4dbc37337733Intermittent1 14/03/2015 11:43:53mb-21-100x775a4dbc37347734Intermittent1 14/03/2015 11:43:55mb-21-100x775a4dbc37357735Intermittent1 14/03/2015 11:43:58mb-21-100x775a4dbc37367736Intermittent1 14/03/2015 11:44:41mb-21-100x775a4dbc37487748Intermittent1 14/03/2015 11:44:43mb-21-100x775a4dbc37497749Intermittent1 14/03/2015 11:44:46mb-21-100x775a4dbc374A774AIntermittent1 18/03/201506:25:01mb-24-40xb6c069e0453DA53DATransient1 18/03/201516:00:58mb-62-60xa83cb64002Transient1 26/03/201512:17:52mb-26-120x405dfc006E61461Transient4 26/03/201513:32:17mb-24-110xb4f33000E600635858Transient9 29/03/201509:40:54mb-35-40x3f6eb824143884338Transient1 These coincide with major solar flares May 7, 201515
16
Reliability Techniques in Software Task checkpoint and restart Task replication Other ongoing activities May 7, 2015CSW & BR Oslo16
17
Advantages of Tasks for Fault Tolerance Task boundaries explicitly delimit the scope of the checkpoints The explicit task input/output declarations decrease the checkpoint state Compared to pthread-like parallel programs checkpointing does not require any complex coordination and synchronization between threads The recovery is asynchronous May 7, 2015CSW & BR Oslo17
18
Checkpoint and Recovery for Tasks Task start T1 Task end T1 Input Recover task execution from detected faults. Isolate the fault propagation within the task boundaries. Recover task execution from detected faults. Isolate the fault propagation within the task boundaries. Input Checkpoint Fault detected Recover execution Recover Inputs are known at runtime through explicit declaration. Overheads of checkpointing are minimal. Recovery is asynchronous. Inputs are known at runtime through explicit declaration. Overheads of checkpointing are minimal. Recovery is asynchronous. Limitations: does not cover the execution outside tasks. May 7, 2015CSW & BR Oslo18
19
Results: Checkpoint and Recovery for Tasks Multi-Node ScalabilitySingle-Node Scalability May 7, 2015CSW & BR Oslo19
20
Task Replication T1 Input Detect and recover from silent data corruption. Input Checkpoint T1’ Output Fault Input T1” Output Re-execute the task one more time and use the two outputs that match as the correct result. No fault 1.The task and its replica execute asynchronously. 2.No synchronization between the task and its replica (only at the end of the task execution). 3.Faults do not limit parallelism, re-execution is also asynchronously. 1.The task and its replica execute asynchronously. 2.No synchronization between the task and its replica (only at the end of the task execution). 3.Faults do not limit parallelism, re-execution is also asynchronously. May 7, 2015CSW & BR Oslo20
21
Results: Task Replication Multi-Node Scalability Single-Node Scalability May 7, 2015CSW & BR Oslo21
22
Other activities Software-based ECC To complement the lack of ECC or weak ECC in hardware Selective replication To reduce the cost of resource utilization by replicating the reliability critical code Checkpoint and restart for tasks with MPI calls To provide multi-node checkpoint restart within task-based programming model Hierarchical checkpoint restart with task checkpointing To decrease the checkpoint overheads and recovery time May 7, 2015CSW & BR Oslo22
23
Summary Preliminary memory reliability characterization Low-end comodity DRAM devices might be more susceptible to transient faults Even strong memory ECC alone may not be sufficient to mitigate transient faults in exascale computing SW-based fault tolerance which is coupled to a specific programming model might be a lightweight solution to complement HW-based ECC May 7, 2015CSW & BR Oslo23
24
May 7, 2015CSW & BR Oslo Thank you! Questions? 24
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.