Adding Algorithm Based Fault-Tolerance to BLIS Tyler Smith, Robert van de Geijn, Mikhail Smelyanskiy, Enrique Quintana-Ortí 1.

Slides:



Advertisements
Similar presentations
An Overview of ABFT in cloud computing
Advertisements

1 Lecture 18: RAID n I/O bottleneck n JBOD and SLED n striping and mirroring n classic RAID levels: 1 – 5 n additional RAID levels: 6, 0+1, 10 n RAID usage.
- Dr. Kalpakis CMSC Dr. Kalpakis 1 Outline In implementing DBMS we need to answer How should the system store and manage very large amounts of data?
Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
1 Anatomy of a High- Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G.
LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.
 RAID stands for Redundant Array of Independent Disks  A system of arranging multiple disks for redundancy (or performance)  Term first coined in 1987.
Parallel Research at Illinois Parallel Everywhere
Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.
CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.
1 Lecture 26: Storage Systems Topics: Storage Systems (Chapter 6), other innovations Final exam stats:  Highest: 95  Mean: 70, Median: 73  Toughest.
Cse Feb-001 CSE 451 Section February 24, 2000 Project 3 – VM.
1 of 14 1 Scheduling and Optimization of Fault- Tolerant Embedded Systems Viacheslav Izosimov Embedded Systems Lab (ESLAB) Linköping University, Sweden.
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
Unit 1 Protocols Learning Objectives: Understand the need to detect and correct errors in data transmission.
1 Lecture 14: DRAM, PCM Today: DRAM scheduling, reliability, PCM Class projects.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Design of SCS Architecture, Control and Fault Handling.
MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.
Redundant Array of Independent Disks
RAID: High-Performance, Reliable Secondary Storage Mei Qing & Chaoxia Liao Nov. 20, 2003.
Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.
N-Tier Client/Server Architectures Chapter 4 Server - RAID Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept RAID – Redundant Array.
Beyond GEMM: How Can We Make Quantum Chemistry Fast? or: Why Computer Scientists Don’t Like Chemists Devin Matthews 9/25/ BLIS Retreat1.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
Ketan Patel, Igor Markov, John Hayes {knpatel, imarkov, University of Michigan Abstract Circuit reliability is an increasingly important.
Priority Research Direction (use one slide for each) Key challenges -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local.
Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.
Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1.
Eliminating Silent Data Corruptions caused by Soft-Errors Siva Hari, Sarita Adve, Helia Naeimi, Pradeep Ramachandran, University of Illinois at Urbana-Champaign,
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Fault-Tolerant Parallel and Distributed Computing for Software Engineering Undergraduates Ali Ebnenasir and Jean Mayo {aebnenas, Department.
CS717 Algorithm-Based Fault Tolerance Matrix Multiplication Greg Bronevetsky.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
Mehmet Can Kurt, The Ohio State University Sriram Krishnamoorthy, Pacific Northwest National Laboratory Kunal Agrawal, Washington University in St. Louis.
Using Loop Invariants to Detect Transient Faults in the Data Caches Seung Woo Son, Sri Hari Krishna Narayanan and Mahmut T. Kandemir Microsystems Design.
System-Directed Resilience for Exascale Platforms LDRD Proposal Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf.
Aerospace Conference ‘12 A Framework to Analyze, Compare, and Optimize High-Performance, On-Board Processing Systems Nicholas Wulf Alan D. George Ann Gordon-Ross.
Chapter 1 Section 1.5 Matrix Operations. Matrices A matrix (despite the glamour of the movie) is a collection of numbers arranged in a rectangle or an.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Low-cost Program-level Detectors for Reducing Silent Data Corruptions Siva Hari †, Sarita Adve †, and Helia Naeimi ‡ † University of Illinois at Urbana-Champaign,
Fault Tolerance and Checkpointing - Sathish Vadhiyar.
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign.
Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Software Managed Resiliency Siva Hari Lei Chen, Xin Fu, Pradeep Ramachandran, Swarup Sahoo, Rob Smolenski, Sarita Adve Department of Computer Science University.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
RAID Technology By: Adarsha A,S 1BY08A03. Overview What is RAID Technology? What is RAID Technology? History of RAID History of RAID Techniques/Methods.
RAID TECHNOLOGY RASHMI ACHARYA CSE(A) RG NO
CS Introduction to Operating Systems
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Generating Families of Practical Fast Matrix Multiplication Algorithms
Presented by: Daniel Taylor
High-Performance Matrix Multiplication
Fault Tolerance In Operating System
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab
RAID RAID Mukesh N Tekwani
Fault Tolerance Distributed Web-based Systems
Soft Error Detection for Iterative Applications Using Offline Training
2/23/2019 A Practical Approach for Handling Soft Errors in Iterative Applications Jiaqi Liu and Gagan Agrawal Department of Computer Science and Engineering.
Hardware Assisted Fault Tolerance Using Reconfigurable Logic
RAID RAID Mukesh N Tekwani April 23, 2019
Memory System Performance Chapter 3
Parallelized Analytic Placer
Facts About High-Performance Computing
Seminar on Enterprise Software
Presentation transcript:

Adding Algorithm Based Fault-Tolerance to BLIS Tyler Smith, Robert van de Geijn, Mikhail Smelyanskiy, Enrique Quintana-Ortí 1

Introduction Soft errors: – Transient hardware failures – Caused by high-energy particle incidence. – Cause crashes, and numerical errors Present-day supercomputers: – Mean time between failures (MTBF) is already quite high 2

Motivation Future supercomputers: – 3 orders of magnitude more components in exascale systems – MTBF will deteriorate – Resiliance will be a fundamental problem [3] 3

Some solutions Checkpoint and restart of entire application – Recovery from hard but not soft error Redundancy – Double redundancy to detect – Triple redundancy to correct These solutions may cost too much in terms of power budget 4

Application Based Fault-tolerance (ABFT) ABFT [11] – Low overhead – Needs to be integrated into application FLARE [10] – Fault tolerant ITXGEMM [9] Our work – Fault Tolerant BLIS 5

Outline Detecting and Correcting Errors Integrating ABFT into BLIS Performance Results 6

Detecting Errors Our GEMM operation: 7

Detecting Errors Right Checksum: 8

Detecting Errors Right Checksum: 9

Detecting Errors Right Checksum: 10

Detecting Errors Left Checksum: 11

Detecting Errors Left Checksum: 12

Detecting Errors Left Checksum: 13

Detecting Errors Error Location: 14

Detecting Errors Multiple Errors: 15

Errors in A and B Single Errors in A or B can corrupt multiple elements of C – One corrupted element in A can corrupt a whole row of C – One corrupted element in B can corrupt a whole column of C Our approach handles this 16

Correcting Errors Traditional ABFT approach: – Calculate what the error is, subtract it away – Questions about numerical stability We do checkpoint-and-rollback – Checkpoint C to main memory – If error is detected, rollback and recompute – We rollback and recompute only corrupted elements 17

Outline Detecting and Correcting Errors Integrating ABFT into BLIS Performance Results 18

Integrating ABFT into BLIS 19 Each loop here represents a different layer within BLIS Can implement ABFT at your choice of layer Tradeoff: Higher levels: Cheaper ABFT But errors are detected less soon Lower levels: Expensive ABFT Errors are caught quickly We implement ABFT at the macro- kernel level

Integrating ABFT into BLIS 20

Fault Tolerance at the Macro-kernel Level Things to add to BLIS – Right Checksum – Left Checksum – Checkpointing C – Rollback and Recovery 21

Right Checksum Must compute: – B(w) – A(Bw) – Cw Goal: Reduce extra memory movements 22

Right Checksum – B(w) – A(Bw) – Cw 23

Right Checksum – B(w) – A(Bw) – Cw 24

Right Checksum – B(w) – A(Bw) – Cw 25

Right Checksum – B(w) – A(Bw) – Cw 26

Right Checksum 27

Left Checksum Must Compute – v T A – (v T A)B – v T C 28

Left Checksum – v T A – (v T A)B – v T C 29

Left Checksum – v T A – (v T A)B – v T C 30

Left Checksum – v T A – (v T A)B – v T C 31

Left Checksum – v T A – (v T A)B – v T C 32

Left Checksum 33

Left Checksum Can perform v T A while packing Problem: (v T A) B must be performed once per macro- kernel Left checksum has a higher overhead than right Solution: Perform left checksum lazily Only perform left checksum if right checksum detects error 34

Lazy Left Checksum 35

Checkpointing 36

Checkpointing 37

Multithreading Issues Fewer loops have independent iterations – Checksum vector computation – Solved by giving each thread their own checksum vectors, doing a reduction Load imbalance – When 1 thread is busy doing recovery, other threads wait 38

Load Imbalance Solutions: – Dynamic parallelism Waiting threads can steal work from slow threads – Lazy recomputation Mark corrupted elements of C All threads cooperatively perform recovery Easy to implement in BLIS Data is cold in cache 39

Final Implementation 40

Outline Detecting and Correcting Errors Integrating ABFT into BLIS Performance Results 41

Performance Results 42 Cost of detecting errors No errors introduced Both 1 and 16 core K is set to 256

Performance Results 43 Breakdown of costs of detecting errors No errors introduced Single Core K is set to 256 Both Checksums and checkpointing exhibit similar costs

Performance Results 44 Breakdown of costs of detecting errors No errors introduced 16 Cores K is set to 256 Both Checksums and checkpointing exhibit similar costs

Performance Results 45 Detecting and correcting errors in C Single Core Square matrices Quantifying cost of corrrecting for small matrices

Performance Results 46 Detecting and correcting errors in A and B Single Core K = error in A corrupts Nc elements of C 1 error in B corrupts M elements of C

Performance Results 47 Detecting and correcting errors in A and B Multi Core K = error in A corrupts Nc elements of C 1 error in B corrupts M elements of C

Thank You! Questions? 48