VOCL-FT: Introducing Techniques for Efficient Soft Error Coprocessor Recovery Antonio J. Peña, Wesley Bland, Pavan Balaji.

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Shredder GPU-Accelerated Incremental Storage and Computation
Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
PVOCL: Power-Aware Dynamic Placement and Migration in Virtualized GPU Environments Palden Lama, Xiaobo Zhou, University of Colorado at Colorado Springs.
KMemvisor: Flexible System Wide Memory Mirroring in Virtual Environments Bin Wang Zhengwei Qi Haibing Guan Haoliang Dong Wei Sun Shanghai Key Laboratory.
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.
OpenFOAM on a GPU-based Heterogeneous Cluster
A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.
Other Disk Details. 2 Disk Formatting After manufacturing disk has no information –Is stack of platters coated with magnetizable metal oxide Before use,
An Integrated Framework for Dependable Revivable Architectures Using Multi-core Processors Weiding Shi, Hsien-Hsin S. Lee, Laura Falk, and Mrinmoy Ghosh.
CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.
OS Spring’03 Introduction Operating Systems Spring 2003.
Remus: High Availability via Asynchronous Virtual Machine Replication.
PhD/Master course, Uppsala  Understanding the interaction between your program and computer  Structuring the code  Optimizing the code  Debugging.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
I/O Systems and Storage Systems May 22, 2000 Instructor: Gary Kimura.
CSE 451: Operating Systems Winter 2010 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura.
Simplifying the Recovery Model of User- Level Failure Mitigation Wesley Bland ExaMPI ‘14 New Orleans, LA, USA November 17, 2014.
GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.
Selling the Database Edition for Oracle on HP-UX November 2000.
Accelerating Mobile Applications through Flip-Flop Replication
Fault Tolerant Runtime ANL Wesley Bland LBL Visit 3/4/14.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.
J.H.Saltzer, D.P.Reed, C.C.Clark End-to-End Arguments in System Design Reading Group 19/11/03 Torsten Ackemann.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Ashwin M. Aji, Lokendra S. Panwar, Wu-chun Feng …Virginia Tech Pavan Balaji, James Dinan, Rajeev Thakur …Argonne.
Mehmet Can Kurt, The Ohio State University Sriram Krishnamoorthy, Pacific Northwest National Laboratory Kunal Agrawal, Washington University in St. Louis.
Process Architecture Process Architecture - A portion of a program that can run independently of and concurrently with other portions of the program. Some.
Storage Systems CSE 598d, Spring 2007 Rethink the Sync April 3, 2007 Mark Johnson.
Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,
Full and Para Virtualization
11.1 Silberschatz, Galvin and Gagne ©2005 Operating System Principles 11.5 Free-Space Management Bit vector (n blocks) … 012n-1 bit[i] =  1  block[i]
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Improving the Reliability of Commodity Operating Systems Michael M. Swift, Brian N. Bershad, Henry M. Levy Presented by Ya-Yun Lo EECS 582 – W161.
Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems Sapna E. George, Ing-Ray Chen, Ying Jin Dept. of Computer Science Virginia.
FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.
An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
2015 Storage Developer Conference. © Intel Corporation. All Rights Reserved. RDMA with PMEM Software mechanisms for enabling access to remote persistent.
Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,
NFV Compute Acceleration APIs and Evaluation
Jonathan Walpole Computer Science Portland State University
Managing Multi-User Databases
CS427 Multicore Architecture and Parallel Computing
Chapter 9: Virtual Memory – Part I
Chapter 9: Virtual Memory
Section 7 Erasure Coding Overview
Fault Tolerance In Operating System
Introduction to Operating Systems
Pipeline parallelism and Multi–GPU Programming
RAID RAID Mukesh N Tekwani
CS 179 Lecture 14.
Chapter 9: Virtual-Memory Management
Fault Tolerance Distributed Web-based Systems
CSE 451: Operating Systems Winter 2009 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
CSE 451: Operating Systems Winter 2012 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
RAID RAID Mukesh N Tekwani April 23, 2019
Low-cost and Fast Failure Recovery Using In-VM Containers in Clouds
Seminar on Enterprise Software
Efficient Migration of Large-memory VMs Using Private Virtual Memory
Presentation transcript:

VOCL-FT: Introducing Techniques for Efficient Soft Error Coprocessor Recovery Antonio J. Peña, Wesley Bland, Pavan Balaji

Background ●We can see that coprocessors are a clear trend in HPC ○About 34% of the performance on the November Top500 list (18% of the systems) ●We generally accept that fault tolerance is becoming a growing issue Image Courtesy of top500.org/statistics/overtime/top500.org/statistics/overtime/ No Accelerator All Others

Fault Tolerance Research Areas Hard Errors Focus on full-system restart/recovery “Upgrades” errors to be fail-stop Often uses checkpoint/restart to roll-back to last good state Soft Errors Detects errors in CPU memory Possibly corrects errors via redundancy/ABFT methods Sometimes looks at I/O errors Little focus on coprocessors

Fault Tolerance on Coprocessors 5% of errors on Tsubame 2.5 are double-bit ECC errors in GPU memoryTsubame 2.5 Average of one every 2 weeks Other machines see rates as high as.8 per day These rates will probably continue to rise Some resilience work has started in this area: DS-CUDA performs redundant computation. Hauberk inserts SDC detectors into code via source-to-source translation and uses checkpoint/restart. Snapify provides checkpoint/restart with migration on Intel® Xeon Phi™, CheCUDA and CheCL use checkpoint/restart to handle full-system failures. None focus on automatic, in-place recovery for coprocessors.

What we propose Generic techniques to provide efficient error recovery from offload-like APIs Automatic fault detection and recovery for soft errors Automatic checkpointing in synchronization points Start with something that looks like checkpoint/restart Take advantage of optimizations to avoid unnecessary checkpoints and minimize the footprint As an example, we extend the existing VOCL library to support fault tolerance (VOCL-FT) based on uncorrected ECC error detection

Traditional Model Compute Node Physical GPU VOCL Proxy OpenCL API VOCL Model Native OpenCL Library Compute Node Virtual GPU Application VOCL Library OpenCL API Compute Node Physical GPU VOCL Proxy OpenCL API Native OpenCL Library Virtual GPU Compute Node Physical GPU Application Native OpenCL Library OpenCL API Virtual OpenCL Transparent utilization of remote GPUs Efficient GPU resource management: ●Migration (GPU / server) ●Power Management: pVOCL Virtual OpenCL Transparent utilization of remote GPUs Efficient GPU resource management: ●Migration (GPU / server) ●Power Management: pVOCL What is VOCL? 6

The Basic Idea Check for ECC errors once the code reaches a synchronization point If an uncorrected error occurs, replay the commands since the last sync. point again Specifically for accelerator memory Host memory can be protected by other, existing tools Replay Kernel Host2Dev Dev2Host Error? Yes No Epoch Sync. point

A More Complete Picture

Execution Logging Transferred data and commands logged to temporary storage Uses asynchronous transfers and pinned host buffers to reduce performance impact Only backup data that can’t be restored from user memory Data is logged as it is written Not only at synchronization points Prevents in/out buffers from being corrupted Saves an extra copy to and from the backups writeToDevice(&a); kernel(); readFromDevice(&a); checkForECCErrors(); createCheckpoint(a); ECC Error a is already corrupted

Failure Detection / Recovery Use ECC memory protection to detect errors Use NVIDIA® Management Library (NVML) to query for memory errors Provides counter of total ECC errors (no locality information) Recovery happens by replaying logs

That’s the easy part… The interesting work comes in the optimizations

Don’t Make Extra Checkpoints Only checkpoint data that’s actually changed If data hasn’t changed, don’t checkpoint it (MOD) If data is read-only, don’t checkpoint it (RO) Transform blocking writes into non-blocking writes and use temp buffer (MOD) Use temp files for non-blocking writes (MOD) Delay copying data for non-blocking writes until host data is overwritten with device data (HB)

Get More Information from the User Read-Only Buffers (EAPI) Allow the user to specify buffers as read-only per epoch. More fine-grained than specifying read-only for the lifetime of the object. Scratch Buffers (SB) Only make logs, but not full checkpoints. Don’t keep checkpoint after epoch. If we replay, use the same buffer without reloading

Reduce Synchronization Host Dirty (HD) Only write a checkpoint when host memory has been overwritten. Reduces synchronization for unnecessary checkpoints. Checkpoint Skip (CS) Skip reading data from the device when a checkpoint is loaded. Don’t write a checkpoint for every epoch.

Single Node Evaluations Simple microbenchmark test Repeated matrix multiplication NO = No Optimization, MOD = Unmodified Buffers, RO = Read-only Buffers, HB = Host Buffers, SB = Scratch Buffers, HD = Host Dirty, CS = Checkpoint Skip Failure-free Test Single Failure Recovery Time

Single Node Evaluations MiniMD Checkpoint sizes decrease dramatically with optimizations Relative recovery time is roughly static Execution overhead is under 10% with full optimizations Other apps see overheads as low as 1% (Hydro) Tradeoff between error rate and check period (CS-X) HD = Host Dirty, MOD = Unmodified Buffers, RO = Read-only Buffers, HB = Host-only Buffers, SB = Scratch Buffers, CS = Checkpoint Skip Checkpoint Size and Pace Recovery Time

Single Node Evaluations MiniMD Checkpoint sizes decrease dramatically with optimizations Relative recovery time is roughly static Execution overhead is under 10% with full optimizations Other apps see overheads as low as 1% (Hydro) Tradeoff between error rate and check period (CS-X) HD = Host Dirty, MOD = Unmodified Buffers, RO = Read-only Buffers, HB = Host-only Buffers, SB = Scratch Buffers, CS = Checkpoint Skip Error-Free Timestep Execution Time Recovery for Multiple Faults (10K Timesteps)

Multi Node Evaluations Hydro Native vs. full optimization without (x0E) and with (x1E & x5E) errors Strong scaling get much worse because of lower CPU involvement Weak scaling gets slightly worse because more communication creates more synchronization points Strong Scaling Weak Scaling

Conclusion We created an efficient, automatic protection scheme for GPU memory. Generic techniques that can be adopted by other similar offloading APIs (e.g. CUDA) and error detectors (ECC is just an example) With minimal input from the user, performance impact can be very low. With no impact to the user, performance can still be good. Recovery times can also be relatively low depending on the trade- off for number of iterations between checkpoints.

Test Systems Single Node SuperMicro® SuperServer© 7047GR-TPRF Intel® E5-2687W v2 octocore 3.4 GHz 64GB DDR3 RAM NVIDIA® Tesla© K40m Intel® S3700 SSD disks Multi Node Virginia Tech InfiniBand® QDR fabric 2x hexa-core Intel® Xeon© E GB RAM 2x NVIDIA® Tesla© M250 Intel® MPI 4.1

Questions?