Presented by Robust Storage Management in the Machine Room and Beyond Sudharshan Vazhkudai Computer Science Research Group Computer Science and Mathematics.

Slides:



Advertisements
Similar presentations
Tableau Software Australia
Advertisements

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
High Performance Cluster Computing Architectures and Systems Hai Jin Internet and Cluster Computing Center.
Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey DATACLOUD 2013.
The TickerTAIP Parallel RAID Architecture P. Cao, S. B. Lim S. Venkatraman, J. Wilkes HP Labs.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
CS-550: Distributed File Systems [SiS]1 Resource Management in Distributed Systems: Distributed File Systems.
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY Coupling Prefix Caching and Collective Downloads for.
Distributed Processing, Client/Server, and Clusters
Positioning Dynamic Storage Caches for Transient Data Sudharshan VazhkudaiOak Ridge National Lab Douglas ThainUniversity of Notre Dame Xiaosong Ma North.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Other File Systems: LFS and NFS. 2 Log-Structured File Systems The trend: CPUs are faster, RAM & caches are bigger –So, a lot of reads do not require.
NWfs A ubiquitous, scalable content management system with grid enabled cross site data replication and active storage. R. Scott Studham.
1 stdchk : A Checkpoint Storage System for Desktop Grid Computing Matei Ripeanu – UBC Sudharshan S. Vazhkudai – ORNL Abdullah Gharaibeh – UBC The University.
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Matei Ripeanu.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
Mass RHIC Computing Facility Razvan Popescu - Brookhaven National Laboratory.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Object-based Storage Long Liu Outline Why do we need object based storage? What is object based storage? How to take advantage of it? What's.
1 The Google File System Reporter: You-Wei Zhang.
Chapter 10 : Designing a SQL Server 2005 Solution for High Availability MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.
1 FreeLoader: borrowing desktop resources for large transient data Vincent Freeh 1 Xiaosong Ma 1,2 Stephen Scott 2 Jonathan Strickland 1 Nandan Tammineedi.
Pooja Shetty Usha B Gowda.  Network File Systems (NFS)  Drawbacks of NFS  Parallel Virtual File Systems (PVFS)  PVFS components  PVFS application.
Virtualization. Virtualization  In computing, virtualization is a broad term that refers to the abstraction of computer resources  It is "a technique.
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
Distributed File Systems
The Center for Autonomic Computing is supported by the National Science Foundation under Grant No NSF CAC Seminannual Meeting, October 5 & 6,
1 Configurable Security for Scavenged Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany, Matei Ripeanu.
Just-in-time Staging of Large Input Data for Supercomputing Jobs Henry Monti, Ali R. ButtSudharshan S. Vazhkudai.
Large Scale Test of a storage solution based on an Industry Standard Michael Ernst Brookhaven National Laboratory ADC Retreat Naples, Italy February 2,
Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.
Reliable Data Movement using Globus GridFTP and RFT: New Developments in 2008 John Bresnahan Michael Link Raj Kettimuthu Argonne National Laboratory and.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
N. GSU Slide 1 Chapter 05 Clustered Systems for Massive Parallelism N. Xiong Georgia State University.
Introduction to DFS. Distributed File Systems A file system whose clients, servers and storage devices are dispersed among the machines of a distributed.
1 FreeLoader: Lightweight Data Management for Scientific Visualization Vincent Freeh 1 Xiaosong Ma 1,2 Nandan Tammineedi 1 Jonathan Strickland 1 Sudharshan.
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY FreeLoader: Scavenging Desktop Storage Resources for.
CENTER FOR HIGH PERFORMANCE COMPUTING Introduction to I/O in the HPC Environment Brian Haymore, Sam Liston,
VMware vSphere Configuration and Management v6
Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Lawrence Livermore National Laboratory 1 Science & Technology Principal Directorate - Computation Directorate Scalable Fault Tolerance for Petascale Systems.
09/13/04 CDA 6506 Network Architecture and Client/Server Computing Peer-to-Peer Computing and Content Distribution Networks by Zornitza Genova Prodanoff.
Tackling I/O Issues 1 David Race 16 March 2010.
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY The stagesub tool Sudharshan S. Vazhkudai Computer Science Research Group CSMD Oak Ridge National.
Virtual Machine Movement and Hyper-V Replica
Chapter Five Distributed file systems. 2 Contents Distributed file system design Distributed file system implementation Trends in distributed file systems.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Presented by Robust Storage Management On Desktop, in Machine Room, and Beyond Xiaosong Ma Computer Science and Mathematics Oak Ridge National Laboratory.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
DISTRIBUTED FILE SYSTEM- ENHANCEMENT AND FURTHER DEVELOPMENT BY:- PALLAWI(10BIT0033)
High Performance Storage System (HPSS) Jason Hick Mass Storage Group HEPiX October 26-30, 2009.
PHD Virtual Technologies “Reader’s Choice” Preferred product.
Grid Computing.
Introduction to Networks
University of Technology
Hadoop Technopoints.
Optimizing End-User Data Delivery Using Storage Virtualization
Specialized Cloud Architectures
Database System Architectures
Lecture 29: Distributed Systems
Presentation transcript:

Presented by Robust Storage Management in the Machine Room and Beyond Sudharshan Vazhkudai Computer Science Research Group Computer Science and Mathematics Division In collaboration with ORNL: John Cobb, Greg Pike North Carolina State University: Xiaosong Ma, Zhe Zhang, Chao Wang, Frank Mueller The University of British Columbia: Matei Ripeanu, Samer Al Kiswany Virginia Tech: Ali Butt

2 Vazhkudai_FreeLoader_SC07 Problem space: Petascale storage crisis  Data staging, offloading, and checkpointing are all affected by data unavailability and I/O bandwidth bottleneck issues:  Compute time wasted on staging at the beginning of the job.  Early staging and late offloading waste scratch space.  Delayed offloading renders result data vulnerable to purging.  Checkpointing terabytes of data to a traditional file system results in an I/O bottleneck.  Storage failure:  Significant contributor to system downtime and CPU underutilization (during RAID reconstruction).  Failures per year: 3–7% disks, 3–16% controllers, and up to 12% SAN switches; 10x the rate expected from vendor specification sheets (J. Gray and C.V. Ingen, "Empirical measurements of disk failure rates and error rates," Technical Report MSR-TR , Microsoft, December 2005.)!  Upshot:  Uptime low:  Due to job resubmissions.  Since checkpoints and restarts are expensive.  Increased job wait times due to staging/offloading and storage errors.  Poor end-user data delivery options.

3 Vazhkudai_FreeLoader_SC07 Approach  If you cannot afford a balanced system, develop management strategies to compensate.  Exploit opportunities throughout the HEC I/O stack:  Parallel file system.  Many unused resources: Memory, cluster node-local storage, desktop idle storage (both in machine room and client-side).  Disparate storage entities including archives and remote sources.  Concerted use of aforementioned:  Can be brought to bear upon urgent supercomputing issues, such as staging, offloading, prefetching, checkpointing, data recovery, I/O bandwidth bottleneck, and end-user data delivery.

4 Vazhkudai_FreeLoader_SC07 Support for Existing Storage Elements Global coordination layer StagingOffloadingCheckpointingPrefetching SchedulerCenter-wide storage/processing Service agents Storage abstractions layer Node-local disks Workstation/storage server Memory resources Data communicatio n pathway Novel aggregate storage Parallel file systems Tape archives Scalable I/O across the center Collective downloads Data sessions Data shuffling Data sieving IBP Approach (cont’d.)  View the entire HPC center as a system.  New ways to optimize this system’s performance and availability.

5 Vazhkudai_FreeLoader_SC07 Global coordination  Motivation: Lack of global coordination between the storage hierarchy and system software.  As a start, need coordination between staging, offloading, and computation:  Problems with manual and scripted staging:  Human operational cost, wasted compute time/storage, and increased wait time due to resubmissions.  How?  Explicit specification of I/O activities alongside computation in a job script.  Zero-charge data transfer queue.  Planning and orchestration.

6 Vazhkudai_FreeLoader_SC07 Coordinating data and computation  Specification of I/O activities in PBS job script:  BEGIN STAGEIN  retry=3; interval=20  hsi -A keytab -k MyKeytab -l user ‘‘get /scratch/user/Destination: Input’’  END STAGEIN  BEGIN COMPUTATION  #PBS...  END COMPUTATION  BEGIN STAGEOUT... END STAGEOUT  Separate data transfer queue: Zero charge:  Queuing up and scheduling data transfers.  Treats data transfers as “data jobs.”  Data transfers can now be charged, if need be!  Planning and orchestration  Parsing into individual stage-in, compute, and stage-out jobs.  Dependency setup and management using resource manager primitives.

7 Vazhkudai_FreeLoader_SC07 Seamless data path  Motivation: Standard data availability techniques designed with persistent data in mind:  RAID techniques can be time consuming; a 160-GB disk takes order of dozens of minutes.  Multiple disk failure within a RAID group can be crippling.  I/O node failovers are not always possible (thousands of nodes).  Need novel mechanisms to address “transient data availability” that complement existing approaches!  What makes it feasible?  Natural data redundancy in the staged job data.  Job input data usually immutable.  Network costs drastically decreasing each year.  Better bulk transfer tools with support for partial data fetches.  How?  Augmenting FS metadata with “recovery hints” from job script.  On-the-fly data reconstruction on another object storage target (OST).  Patching from data source.

8 Vazhkudai_FreeLoader_SC07 Data recovery  Embedding recovery metadata about transient job data into the Lustre parallel file system:  Extend Lustre metadata to include recovery hints.  Metadata extracted from job script.  “Source” and “sink” information becomes an integral part of transient data.  Failure detection to check for unavailable OSTs.  Reconstruction:  Locate substitute OSTs and allocate objects.  Mark data dirty in metadata directory service (MDS).  Recover URI from MDS.  Compute missing data range.  Remote patching:  Reducing multiple authentication costs per dataset.  Automated interactive session with “Expect” for single sign-on.  Protocols: hsi, GridFTP, NFS.

9 Vazhkudai_FreeLoader_SC07 Results  Better use of users’ compute time allocation and decreased job turnaround time.  Optimal use of center’s scratch space, avoiding too early stage in and delayed offloading.  Reduces resubmissions due to result-data loss.  Reduces wait time: Trace-driven simulation of LANL operational data + LCF scratch data. Mean wait time (s) Stripe count ,000, ,000 10,000 1, Without reconstruction With reconstruction Data source /scratch parallel file system /home Machine room Head nodeCompute nodes End user Staging/offloading NFS access hsi access Archive Seamless I/O access via “data path” Regular I/O access to staged data I/O nodes Source copy of dataset accessed using ftp, scp, GridFTP Job queue Data queue 1,2 3 after 1, 4 after 3 Job script Planner 1.Stage data 2.Checkpoint setup 3.Compute job 4.Offload data

10 Vazhkudai_FreeLoader_SC07 New storage abstractions  Checkpointing TB of data cumbersome; need better tools to address the storage bandwidth bottleneck.  Options:  Machines can potentially be provisioned with solid-state memory.  ~ 200 TB of aggregate memory in the PF machine; potential “residual memory”:  Tier 1 Apps (GTC, S3D, POP, CHIMERA) seldom use all memory.  Checkpoint size is ~ 10% of the memory footprint.  Many applications oversubscribe for processors in an attempt to plan for failure:  Use the oversubscribed processors!  Checkpoint to memory can expedite these operations.  Previous solutions are not concerted in their memory usage, creating artificial load imbalance:  Examples: J.S. Plank, K. Li, and M.A. Puening, “Diskless Checkpointing,” IEEE Transactions on Parallel and Distributed Systems, (10): p L.M. Silva, and J.G. Silva, “Using two-level stable storage for efficient checkpointing,” IEE Proceedings - Software, (6): p

11 Vazhkudai_FreeLoader_SC07 stdchk: A checkpoint-friendly storage  Aggregate memory-based storage abstraction:  Split checkpoint images into chunks and stripe them.  Parallel I/O across distributed memory.  Redundantly mounted on PEs for FS-like access.  Optimized, relaxed POSIX I/O interfaces to the storage.  Lazy migration of images to local disk/archives, creating a seamless data pathway.  Unused/underutilized processors can perform this operation.  Incremental checkpointing and pruning of checkpoint files.  Compare chunk hashes from two successive intervals.  Initial experiments suggest a 10–25% reduction in size for BLCR checkpoints.  Purge images from previous interval once the current image is safely stored.  File system is unable to perform such optimizations.  Specification of checkpoint preparation in a job script.

12 Vazhkudai_FreeLoader_SC07 End-user data delivery: An architecture for eager offloading of result data  Offloading result data equally important for local visualization and interpretation.  Storage system failure and purging of scratch space can cause loss of result data; end-user resource may not be available for offload.  Eager offloading:  Equivalent to data reconstruction.  Transparent data migration using “sink”/destination metadata as part of job submission (done as part of global coordination).  Data offloading can be overlapped with computation.  Can failover to intermediate storage/archives for planned transfers in the future:  Intermediate nodes specified by the user in the job script.  A one-to-many distribution followed by a many-to-one download.  A combination of bullet + landmarks (p2p + staged).  Erasure coded chunks for redundancy and fault tolerance.  Monitor offloads for bandwidth degradation and choose alternate paths accordingly.

13 Vazhkudai_FreeLoader_SC07 FreeLoader: Improving end-user data delivery with client-side collaborative caching  Enabling trends:  Unused storage: More than 50% desktop storage unused.  Immutable data: Data are usually write-once, read-many with remote source copies.  Connectivity: Well-connected, secure LAN settings.  FreeLoader aggregate storage cache:  Scavenges O(GB) of contributions from desktops.  Parallel I/O environment across loosely connected workstations, aggregating I/O as well as network BW.  NOT a file system, but a low-cost, local storage solution enabling client-side caching and locality MB4 GB32 GB64 GB Dataset size Throughput (MB/s) FreeLoader PVFS HPSS-Hot HPSS-Cold RemoteNFS waet-nchi

14 Vazhkudai_FreeLoader_SC07 Virtual cache: Impedance matching on steroids!  Can we host partial copies of datasets and yet serve client accesses to the entire dataset?  ~ FileSystem-BufferCache:Disk :: FreeLoader:RemoteDataSource.  Benefits:  Bootstrapping the download process.  Store more datasets.  Allows for efficient cache management.  Persistent storage and BW-only donors. Data source Persistent storage + BW donors Stripe width (w) Patch width (p) Parallel get () dataset-1 Parallel get () dataset-2 BW donors p R w Transparent patching Source copy of dataset-N accessed gsiftp://source-N/dataset-2 Source copy of dataset-1 accessed Bootstrapping downloads using “seeds” across w Scientific Data Cache

15 Vazhkudai_FreeLoader_SC07 Contact Sudharshan Vazhkudai Computer Science Research Group Computer Science and Mathematics Division (865) Vazhkudai_FreeLoader_SC07