1 stdchk : A Checkpoint Storage System for Desktop Grid Computing Matei Ripeanu – UBC Sudharshan S. Vazhkudai – ORNL Abdullah Gharaibeh – UBC The University.

Slides:

Advertisements

Similar presentations

Towards Automating the Configuration of a Distributed Storage System Lauro B. Costa Matei Ripeanu {lauroc, NetSysLab University of British.

Advertisements

1 StoreGPU Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems NetSysLab The University of British Columbia Samer Al-Kiswany.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Jaehyun Han 1.

High Performance Cluster Computing Architectures and Systems Hai Jin Internet and Cluster Computing Center.

Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.

1 Magnetic Disks 1956: IBM (RAMAC) first disk drive 5 Mb – Mb/in $/year 9 Kb/sec 1980: SEAGATE first 5.25’’ disk drive 5 Mb – 1.96 Mb/in2 625.

Exploiting Data Deduplication to Accelerate Live Virtual Machine Migration Xiang Zhang 1,2, Zhigang Huo 1, Jie Ma 1, Dan Meng 1 1. National Research Center.

Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Wei Zhang, Tao Yang, Gautham Narayanasamy University of California at Santa Barbara.

1 The Case for Versatile Storage System NetSysLab The University of British Columbia Samer Al-Kiswany, Abdullah Gharaibeh, Matei Ripeanu.

1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.

Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.

1 Harvesting the Opportunity of GPU-Based Acceleration for Data-Intensive Applications Matei Ripeanu Networked Systems Laboratory (NetSysLab) University.

The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.

Are P2P Data-Dissemination Techniques Viable in Today's Data- Intensive Scientific Collaborations? Samer Al-Kiswany – University of British Columbia joint.

VMFlock: VM Co-Migration Appliance for the Cloud Samer Al-Kiswany With: Dinesh Subhraveti Prasenjit Sarkar Matei Ripeanu.

Performance Evaluation of Load Sharing Policies on a Beowulf Cluster James Nichols Marc Lemaire Advisor: Mark Claypool.

Beyond Music Sharing: An Evaluation of Peer-to-Peer Data Dissemination Techniques in Large Scientific Collaborations Thesis defense: Samer Al-Kiswany.

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Matei Ripeanu.

Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.

Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.

Multi-level Selective Deduplication for VM Snapshots in Cloud Storage Wei Zhang*, Hong Tang †, Hao Jiang †, Tao Yang*, Xiaogang Li †, Yue Zeng † * University.

RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.

Frangipani: A Scalable Distributed File System C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation.

Slingshot: Deploying Stateful Services in Wireless Hotspots Ya-Yunn Su Jason Flinn University of Michigan.

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

Data Deduplication in Virtualized Environments Marc Crespi, ExaGrid Systems

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.

A Workflow-Aware Storage System Emalayan Vairavanathan 1 Samer Al-Kiswany, Lauro Beltrão Costa, Zhao Zhang, Daniel S. Katz, Michael Wilde, Matei Ripeanu.

1 The Google File System Reporter: You-Wei Zhang.

Report ： Zhen Ming Wu 2008 IEEE 9th Grid Computing Conference.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

Emalayan Vairavanathan

Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.

1 Configurable Security for Scavenged Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany, Matei Ripeanu.

Implementing Hyper-V®

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.

Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.

MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.

1 CloudVS: Enabling Version Control for Virtual Machines in an Open- Source Cloud under Commodity Settings Chung-Pan Tang, Tsz-Yeung Wong, Patrick P. C.

1 MosaStore -A Versatile Storage System Lauro Costa, Abdullah Gharaibeh, Samer Al-Kiswany, Matei Ripeanu, Emalayan Vairavanathan, (and many others from.

Data Replication and Power Consumption in Data Grids Susan V. Vrbsky, Ming Lei, Karl Smith and Jeff Byrd Department of Computer Science The University.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.

RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups Chun-Ho Ng, Patrick P. C. Lee The Chinese University of Hong Kong.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition File System Implementation.

Full and Para Virtualization

Parallel IO for Cluster Computing Tran, Van Hoai.

Tackling I/O Issues 1 David Race 16 March 2010.

Computer Performance. Hard Drive - HDD Stores your files, programs, and information. If it gets full, you can’t save any more. Measured in bytes (KB,

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)

Using Deduplicating Storage for Efficient Disk Image Deployment Xing Lin, Mike Hibler, Eric Eide, Robert Ricci University of Utah.

Lecture 17 Raid. Device Protocol Variants Status checks: polling vs. interrupts Data: PIO vs. DMA Control: special instructions vs. memory-mapped I/O.

Optimizing Distributed Actor Systems for Dynamic Interactive Services

Memory COMPUTER ARCHITECTURE

Distributed Network Traffic Feature Extraction for a Real-time IDS

A Survey on Distributed File Systems

The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.

The Google File System (GFS)

A Software-Defined Storage for Workflow Applications

The Google File System (GFS)

The Google File System (GFS)

The Google File System (GFS)

The Google File System (GFS)

The Google File System (GFS)

Efficient Migration of Large-memory VMs Using Private Virtual Memory

The Design and Implementation of a Log-Structured File System

Presentation transcript:

1 stdchk : A Checkpoint Storage System for Desktop Grid Computing Matei Ripeanu – UBC Sudharshan S. Vazhkudai – ORNL Abdullah Gharaibeh – UBC The University of British Columbia Oak Ridge National Laboratory Samer Al-Kiswany – UBC

2 Checkpointing Introduction Checkpointing uses: fault tolerance, debugging, or migration. Typically, an application running for days on hundreds of nodes (e.g. a desktop gird ) saves checkpoint images periodically. ICDCS ‘08... C C C C

3 Deployment Scenario ICDCS ‘08

4 The Challenge Although checkpointing is necessary:  It is a pure overhead from the performance point of view. Requirement: High performance, scalable, and reliable storage system optimized for checkpointing applications.  Generates a high load on the storage system Most of the time spent writing to the storage system. ICDCS ‘08 Challenge: Low cost, transparent support for checkpointing at file- system level.

5  Write intensive application ( bursty ). e.g., a job running on hundreds of nodes.  periodically checkpoints 100s of GB of data. Checkpointing Workload Characteristics  Write once, rarely read during application execution.  Potentially high similarity between consecutive checkpoints.  Applications specific checkpoint image life span. When it is safe to delete the image ? ICDCS ‘08

6 Why Checkpointing-Optimized Storage System?  Optimizing for checkpointing workload can bring valuable benefits:  High throughput through specialization.  Considerable storage space and network effort saving. through transparent support for incremental checkpointing.  Simplified data management by exploiting the particulaities of checkpoint usage scenarios.  Reduce the load on a share file-system  Can be built atop scavenged resources – low cost. ICDCS ‘08

7 stdchk A checkpointing optimized storage system built using scavenged resources. ICDCS ‘08

8 Outline  stdchk architecture  stdchk features  stdchk system evaluation ICDCS ‘08

9 stdchk Architecture Manager (Metadata management) Benefactors (Storage nodes) Client (FS interface) ICDCS ‘08

10  Simplified data management stdchk Features  POSIX file system API – as a result using stdchk does not require modifications to the application.  High-throughput for write operations  Support for transparent incremental checkpointing ICDCS ‘08  High reliability through replication

11 Optimized Write Operation Alternatives Write procedure alternatives:  Complete local write  Incremental write  Sliding window write ICDCS ‘08

12 Optimized Write Operation Alternatives Application stdchk FS Interface stdchk Disk ICDCS ‘08 Compute Node Write procedure alternatives:  Complete local write  Incremental write  Sliding window write

13 ICDCS ‘08 Optimized Write Operation Alternatives stdchk Compute Node Application stdchk FS Interface Disk Write procedure alternatives:  Complete local write  Incremental write  Sliding window write

14 Memory Disk ICDCS ‘08 Compute Node stdchk Application stdchk FS Interface Write procedure alternatives:  Complete local write  Incremental write  Sliding window write Optimized Write Operation Alternatives

15 Write Operation Evaluation ICDCS ‘08 Testbed: 28 machines Each machine has : two 3.0GHz Xeon processors, 1 GB RAM, two 36.5GB SCSI disks.

16 Achieved Storage Bandwidth The average ASB over a 1 Gbps testbed. Sliding Window write achieves high bandwidth (110 MBps)  Saturates the 1 Gbps link ICDCS ‘08

17  Checkpointing optimized data management stdchk Features  POSIX file system interface – no required modification to the application  High throughput write operation  Transparent incremental checkpointing ICDCS ‘08

18 Transparent Incremental Checkpointing But : How much similarity is there between consecutive checkpoints ? How can we detect similarities between checkpoints? Is this fast enough? Incremental checkpointing may bring valuable benefits:  Lower network effort.  Less storage space used. ICDCS ‘08

19 Checkpoint T0 X Y Z Hashing X Y Z T0 ICDCS ‘08 Similarity Detection Mechanism – Compare-by-Hash

20 W Y Z X Y Z T0 W T1 Hashing Checkpoint T1 Will store T1 ICDCS ‘08 Similarity Detection Mechanism – Compare-by-Hash

21  How to divide the file into blocks?  Fixed-size blocks + compare-by-Hash (FsCH)  Content-based blocks + compare-by-Hash (CbCH) ICDCS ‘08 Similarity Detection Mechanism

22 FsCH Insertion Problem Checkpoint i Checkpoint i+1 Result: Lower similarity detection ratio. B1 B2 B3 B4 B5 B1 B2 B3 B4 B5 B6 ICDCS ‘08

23 Content-based Compare-by-Hash (CbCH) ICDCS ‘08 Checkpoint i Hashing B1 B2 B3 B4 HashValue K = 0 ? m bytes k bits offset

24 Result: Higher similarity detection ratio. Content-based Compare-by-Hash (CbCH) Checkpoint i Checkpoint i+1 But: Computationally intensive. B1 BX B3 B4 ICDCS ‘08 B1 B2 B3 B4

25 Evaluating Similarity Between Consecutive Checkpoints Avg. Checkpoint size Number of checkpoints Type 2.4 MB100Application level 450 MB  1200 System level - BLCR 1 GB  400 Virtual machine level - Xen The Applications : BMS* and BLAST Checkpointing interval: 1, 5 and 15 minutes ICDCS ‘08 * Checkpoints by Pratul Agarwal (ORNL)

26 Similarity Ratio and Detection Throughput The table presents the average rate of detected similarity and the throughput in MB/s (in brackets) for each heuristic. ICDCS ‘08 But: Using the GPU, CbCH achieves over 190 MBps throughput !! - StoreGPU: Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems, S. Al-Kiswany, A. Gharaibeh, E. Santos- Neto, G. Yuan, M. Ripeanu, HPDC, BLASTBMS Technique  XenBLCRApp 5 or 15 min15 min5 min1 min Interval  0.0% [110]6.3% [113]23.4% [109]0.0% [108]1MBFsCH 0.0% [28.4]70% [26.4]82% [26.6]0.0% [28.4] no- overlap m=20B, k=14b CbCH BLASTBMS Technique  XenBLCRApp 5 or 15 min15 min5 min1 min Interval  0.0%6.3%23.4%0.0%1MBFsCH 0.0%70%82%0.0% no- overlap m=20B, k=14b CbCH

27 Compare-by-Hash Results Achieved Storage Bandwidth ICDCS ‘08 FsCH slightly degrades achieved bandwidth. But reduces the storage space used and network effort by 24%

28 Outline  stdchk architecture  stdchk features  stdchk overall system evaluation ICDCS ‘08

29 stdchk Scalability 7 clients: Each client writes 100 files (100MB each). Total of 70GB. stdchk pool of 20 benefactor nodes. ICDCS ‘08 Nodes Join Nodes Leave Steady stdchk sustains high loads :  Number of nodes  Workload

30 Experiment with Real Application Improvement stdchk Local disk 27.0%16,49722,733Checkpointing time (s) 69.0% Data size (TB) 1.3%455,894462,141Total execution time (s) Application : BLAST Execution time: > 5 days Checkpointing interval : 30s Stripe width : 4 benefactors Client machine: two 3.0GHz Xeon processors, SCSI disks. ICDCS ‘08

31 Summary stdchk : A checkpointing optimized storage system built using scavenged resources. stdchk features:  High throughput write operation  Saves considerable disk space and network effort.  Checkpointing optimized data management  Easy to adopt – implements a POSIX file system interface  Inexpensive - built atop scavenged resources Consequently, stdchk :  Offloads the checkpointing workload from the shared FS.  Speeds up the checkpointing operations (reduces checkpointing overhead) ICDCS ‘08

32 Thank you netsyslab.ece.ubc.ca ICDCS ‘08