Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1.

Slides:

Advertisements

Similar presentations

Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.

Advertisements

Chapter 16: Recovery System

Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team.

Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.

CSCE430/830 Computer Architecture

© 2005 Dorian C. Arnold Reliability in Tree-based Overlay Networks Dorian C. Arnold University of Wisconsin Paradyn/Condor Week March 14-18, 2005 Madison,

Uncoordinated Checkpointing The Global State Recording Algorithm.

DRAIN: Distributed Recovery Architecture for Inaccessible Nodes in Multi-core Chips Andrew DeOrio †, Konstantinos Aisopos ‡§ Valeria Bertacco †, Li-Shiuan.

G O O G L E F I L E S Y S T E M 陳仕融黃振凱林佑恩 Z 1.

Diskless Checkpointing 15 Nov Motivation  Checkpointing on Stable Storage Disk access is a major bottleneck! Incremental Checkpointing Copy-on-write.

Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures June 14 th 2014 Prashant J. Nair - Georgia Tech David A. Roberts- AMD Research.

Parallel Research at Illinois Parallel Everywhere

Chapter 3 Presented by: Anupam Mittal.  Data protection: Concept of RAID and its Components Data Protection: RAID - 2.

Availability in Globally Distributed Storage Systems

Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun,

1 Lecture 22: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.

Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.

Earth Simulator Jari Halla-aho Pekka Keränen. Architecture MIMD type distributed memory 640 Nodes, 8 vector processors each. 16GB shared memory per node.

Design of SCS Architecture, Control and Fault Handling.

Efficient Proactive Security for Sensitive Data Storage Arun Subbiah Douglas M. Blough School of ECE, Georgia Tech {arun,

Academic Year 2014 Spring. MODULE CC3005NI: Advanced Database Systems “DATABASE RECOVERY” (PART – 1) Academic Year 2014 Spring.

Redundant Array of Independent Disks

1 Chapter 12 File Management Systems. 2 Systems Architecture Chapter 12.

DISTRIBUTED ALGORITHMS Luc Onana Seif Haridi. DISTRIBUTED SYSTEMS Collection of autonomous computers, processes, or processors (nodes) interconnected.

Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

HPC system for Meteorological research at HUS Meeting the challenges Nguyen Trung Kien Hanoi University of Science Melbourne, December 11 th, 2012 High.

PFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University.

Recovery System By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany) DIRECTOR ARUNAI ENGINEERING COLLEGE TIRUVANNAMALAI.

Distributed computing using Projective Geometry: Decoding of Error correcting codes Nachiket Gajare, Hrishikesh Sharma and Prof. Sachin Patkar IIT Bombay.

RDIS: A Recursively Defined Invertible Set Scheme to Tolerate Multiple Stuck-At Faults in Resistive Memory Rami Melhem, Rakan Maddah and Sangyeun cho Computer.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.

1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.

Mehmet Can Kurt, The Ohio State University Sriram Krishnamoorthy, Pacific Northwest National Laboratory Kunal Agrawal, Washington University in St. Louis.

1 Lecture 24: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.

A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.

ERLANGEN REGIONAL COMPUTING CENTER st International Workshop on Fault Tolerant Systems, IEEE Cluster `15 Building a fault tolerant application.

Safetynet: Improving The Availability Of Shared Memory Multiprocessors With Global Checkpoint/Recovery D. Sorin M. Martin M. Hill D. Wood Presented by.

System-Directed Resilience for Exascale Platforms LDRD Proposal Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf.

Chapter 17: Recovery System

This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.

Database System Concepts ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 17: Recovery System.

A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.

Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.

Lawrence Livermore National Laboratory 1 Science & Technology Principal Directorate - Computation Directorate Scalable Fault Tolerance for Petascale Systems.

Presented by NCCS Hardware Jim Rogers Director of Operations National Center for Computational Sciences.

Improving the Reliability of Commodity Operating Systems Michael M. Swift, Brian N. Bershad, Henry M. Levy Presented by Ya-Yun Lo EECS 582 – W161.

Adding Algorithm Based Fault-Tolerance to BLIS Tyler Smith, Robert van de Geijn, Mikhail Smelyanskiy, Enrique Quintana-Ortí 1.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

VOCL-FT: Introducing Techniques for Efficient Soft Error Coprocessor Recovery Antonio J. Peña, Wesley Bland, Pavan Balaji.

CubicRing ENABLING ONE-HOP FAILURE DETECTION AND RECOVERY FOR DISTRIBUTED IN- MEMORY STORAGE SYSTEMS Yiming Zhang, Chuanxiong Guo, Dongsheng Li, Rui Chu,

Database Recovery Zheng (Godric) Gu. Transaction Concept Storage Structure Failure Classification Log-Based Recovery Deferred Database Modification Immediate.

Self-service, with applications to distributed classifier construction Michael K. Reiter and Asad Samar April 27, 2006 Properties & Related Work Self-Service.

Gorilla: A Fast, Scalable, In-Memory Time Series Database

RAID TECHNOLOGY RASHMI ACHARYA CSE(A) RG NO

Prepared by Ertuğrul Kuzan

RAID RAID Mukesh N Tekwani

Module 17: Recovery System

Recovery System.

Operating System Introduction.

Co-designed Virtual Machines for Reliable Computer Systems

CINECA HIGH PERFORMANCE COMPUTING SYSTEM

RAID RAID Mukesh N Tekwani April 23, 2019

Phoenix: A Substrate for Resilient Distributed Graph Analytics

University of Wisconsin-Madison Presented by: Nick Kirchem

Presentation transcript:

Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1

Motivation In exascale systems, failures will further increase due to increasing number of processors Typical current approach to fault tolerance is to checkpoint in stable storage Soft errors can affect individual data blocks Multiple data blocks might be corrupted before they can be efficiently detected We focus on developing an approach that can tolerate multiple hard errors and soft errors 2

Fault Tolerant Data in Volatile Memory Efficient checksum-based approach to fault tolerance for data in volatile memory systems The developed scheme is applicable in multiple scenarios Online recovery of large read-only data structures with low storage overhead Online recovery from soft errors in blocked data Online recovery of read/write data via in-memory checkpointing The approach uses a logical multi-dimensional view of the data to be protected 3

Design Recover exact data Inspiration from Algorithm Based Fault Tolerance(ABFT) Low overhead 4

Checksum Design Checksum Operator XOR Multi-dimensional Checksums Increase tolerance Checksum co-located with data Reduce space overhead Distributed Checksum Reduce overhead and increase tolerance 5

One Dimensional Checksum 6

7 C C c c c c c c c c cc c c c c c c c c

One Dimensional Checksum 8 Recover checksum Recover data

Two Dimensional Checksum 9

Checksum and Data Distribution 10

Two Dimensional Checksum 11 Recovery Checksum calculation

Three Dimensional Checksum 12

Three Dimensional Checksum Distribution 13

Checksum Overhead –One Dimension –Two Dimension –Three Dimension –d Dimension

Experiments Cray XE6 system(NERSC Hopper) 6384 nodes with Gemini interconnect Peak bandwidth 8.3 GB/s per direction Twelve core 2.1 GHz AMD ‘MagnyCours’ with 24 cores per node and 32 GB DDR3 memory Intel C++ compiler 13 and Cray MPI 6.0.1

Checksum Calculation Time 1D, 2D and 3D 1D 3D 2D 16

Fault Recovery 17

Soft Error Soft error can change the data in memory Unit of failure is a block of data inside the process not the entire process Low overhead compared to entire process failure Less number of tolerable failures 18

Soft Error 19

Soft Error Equations 20 1D block 2D block

2D Soft Error Checksum 21

2D Soft Error Recovery 22

Summary In memory checkpointing, low overhead protection for read only data, recovery from soft errors XOR based checksum to recover exact data Multidimensional checksum calculation to increase fault tolerance Co-location of the checksums with the data Scalable design to ensure low space overhead 23

THANK YOU Questions? 24