Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Integration Experiences and Performance Studies of A COTS Parallel Archive.

Slides:



Advertisements
Similar presentations
Cloud Computing: Theirs, Mine and Ours Belinda G. Watkins, VP EIS - Network Computing FedEx Services March 11, 2011.
Advertisements

Tag line, tag line Protection Manager 4.0 Customer Strategic Presentation March 2010.
Phillip Dickens, Department of Computer Science, University of Maine. In collaboration with Jeremy Logan, Postdoctoral Research Associate, ORNL. Improving.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.
Netscape Application Server Application Server for Business-Critical Applications Presented By : Khalid Ahmed DS Fall 98.
Summary Role of Software (1 slide) ARCS Software Architecture (4 slides) SNS -- Caltech Interactions (3 slides)
Cacti Workshop Tony Roman Agenda What is Cacti? The Origins of Cacti Large Installation Considerations Automation The Current.
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
Copyright 2009 FUJITSU TECHNOLOGY SOLUTIONS PRIMERGY Servers and Windows Server® 2008 R2 Benefit from an efficient, high performance and flexible platform.
Web Server Hardware and Software
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.
What is it? Hierarchical storage software developed in collaboration with five US department of Energy Labs since 1992 Allows storage management of 100s.
Installing Windows XP Professional Using Attended Installation Slide 1 of 41Session 2 Ver. 1.0 CompTIA A+ Certification: A Comprehensive Approach for all.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Object-based Storage Long Liu Outline Why do we need object based storage? What is object based storage? How to take advantage of it? What's.
Word Wide Cache Distributed Caching for the Distributed Enterprise.
1 The Google File System Reporter: You-Wei Zhang.
Pooja Shetty Usha B Gowda.  Network File Systems (NFS)  Drawbacks of NFS  Parallel Virtual File Systems (PVFS)  PVFS components  PVFS application.
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D The Case for Monitoring and Testing David.
PARMON A Comprehensive Cluster Monitoring System A Single System Image Case Study Developer: PARMON Team Centre for Development of Advanced Computing,
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
© 2006 IBM Corporation Flash Copy Solutions im Windows Umfeld TSM for Copy Services Wolfgang Hitzler Technical Sales Tivoli Storage Management
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
Large Scale Test of a storage solution based on an Industry Standard Michael Ernst Brookhaven National Laboratory ADC Retreat Naples, Italy February 2,
Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.
GStore: GSI Mass Storage ITEE-Palaver GSI Horst Göringer, Matthias Feyerabend, Sergei Sedykh
Hierarchical storage management
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
D C a c h e Michael Ernst Patrick Fuhrmann Tigran Mkrtchyan d C a c h e M. Ernst, P. Fuhrmann, T. Mkrtchyan Chep 2003 Chep2003 UCSD, California.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Oct 15, PETASCALE DATA STORAGE INSTITUTE The Drive to Petascale Computing Faster computers need more data, faster. --
 CASTORFS web page - CASTOR web site - FUSE web site -
Server to Server Communication Redis as an enabler Orion Free
CLASS Information Management Presented at NOAATECH Conference 2006 Presented by Pat Schafer (CLASS-WV Development Lead)
CENTER FOR HIGH PERFORMANCE COMPUTING Introduction to I/O in the HPC Environment Brian Haymore, Sam Liston,
Welcome to the PVFS BOF! Rob Ross, Rob Latham, Neill Miller Argonne National Laboratory Walt Ligon, Phil Carns Clemson University.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
CASTOR project status CASTOR project status CERNIT-PDP/DM October 1999.
Tackling I/O Issues 1 David Race 16 March 2010.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
© 2014 VMware Inc. All rights reserved. Cloud Archive for vCloud ® Air™ High-level Overview August, 2015 Date.
DDN Web Object Scalar for Big Data Management Shaun de Witt, Roger Downing (STFC) Glenn Wright (DDN)
1© Copyright 2015 EMC Corporation. All rights reserved. © Copyright 2015 EMC Corporation. All rights reserved. Under NDA 2 TIERS TM Architecture POSIX-LIKE.
IHEP Computing Center Site Report Gang Chen Computing Center Institute of High Energy Physics 2011 Spring Meeting.
An Introduction to GPFS
Presented by Robust Storage Management On Desktop, in Machine Room, and Beyond Xiaosong Ma Computer Science and Mathematics Oak Ridge National Laboratory.
Predrag Buncic CERN Data management in Run3. Roles of Tiers in Run 3 Predrag Buncic 2 ALICEALICE ALICE Offline Week, 01/04/2016 Reconstruction Calibration.
High Performance Storage System (HPSS) Jason Hick Mass Storage Group HEPiX October 26-30, 2009.
PHD Virtual Technologies “Reader’s Choice” Preferred product.
Compute and Storage For the Farm at Jlab
Parallel Virtual File System (PVFS) a.k.a. OrangeFS
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
Netscape Application Server
BDII Performance Tests
Chapter 1: Introduction
Ákos Frohner EGEE'08 September 2008
Cloud based Open Source Backup/Restore Tool
Kirill Lozinskiy NERSC Storage Systems Group
Using the Parallel Universe beyond MPI
Real IBM C exam questions and answers
Migration Strategies – Business Desktop Deployment (BDD) Overview
XenData SX-550 LTO Archive Servers
DUCKS – Distributed User-mode Chirp-Knowledgeable Server
Chapter 2: Operating-System Structures
Introduction to Operating Systems
Prepared by Jaroslav makovski
Chapter 2: Operating-System Structures
IBM Tivoli Storage Manager
Presentation transcript:

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Integration Experiences and Performance Studies of A COTS Parallel Archive System - A New Parallel Archive Storage System Concept and Implementation Hsing-bung (HB) Chen, Gary Grider, Cody Scott, Milton Turley Aaron Torres, Kathy Sanchez, John Bremer Los Alamos National Laboratory Los Alamos, New Mexico 87545, USA September 22 nd, 2010 IEEE International Conference on Cluster Computing 2010 Heraklion,Crete, Greece The University of California operates Los Alamos National Laboratory for the National Nuclear Security & Administration of the United States Department of Energy. LANL Document Number LA-UR

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Present and future Archive Storage Systems have been challenged to (a) scale to very high bandwidths, (b) scale in metadata performance, (c) support policy-based hierarchical storage management capability, (d) scale in supporting changing needs of very large data sets, (e) support standard interface, and (f) utilize commercial-off-the-shelf (COTS) hardware. Parallel file systems have also been demanded to perform the same manner but at one or more orders of magnitude faster in performance. Archive systems continue to improve substantially comparable to file systems in their design due to the need for speed and bandwidth, especially metadata searching speeds such as more caching and less robust semantics. Currently, the number of extreme highly scalable parallel archive solutions is very limited especially for moving a single large striped parallel disk file onto many tapes in parallel. We believe that a hybrid storage approach of using COTS components and an innovative software technology can bring new capabilities into a production environment for the HPC community. This solution is much faster than the approach of creating and maintaining a complete end-to-end unique parallel archive software solution. We relay our experience of integrating a global parallel file system and a standard backup/archive product with an innovative parallel software code to construct a scalable and parallel archive storage system. Our solution has a high degree of overlap with current parallel archive products including (a) doing parallel movement to/from tape for a single large parallel file, (b) hierarchical storage management, (c) ILM features, (d) high volume (non-single parallel file) archives for backup/archive/content management, and (e) leveraging all free file movement tools in Linux such as copy, move, ls, tar, etc. We have successfully applied our working COTS Parallel Archive System to the current world’s first petaflop/s computing system, LANL’s Roadrunner machine, and demonstrated its capability to address requirements of future archival storage systems. Now this new Parallel Archive System is used on the LANL’s Turquoise Network Abstract

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D 1. Background 2. Issues, Motivation, and Leverage of using COTS Parallel Archive System 3. Proposed COTS Parallel Archive System 4. PERFORMANCE STUDIES ON LANL’S ROADRUNNER OPEN SCIENCE PROJECTS 5. Experience and observed issues of our COTS Parallel Archive System 6. Summary and Future Works Agenda

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D The DOE Advanced Strategic Computing Initiative Program published this Kiviat diagram that shows parallel file systems scaling performance at an order of magnitude faster than parallel archives

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Parallel File Systems & Parallel I/O HSM Hierarchical Storage Management (HSM) ILM – Information Life cycle Management Non-Parallel vs. Parallel Archive Systems Parallel Archives That Do Not Leverage Parallel File Systems as Their First Tier of Storage Parallel Archives That Do Leverage Parallel File Systems as Their First Tier of Storage Background

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Archives That Do Not Leverage Parallel File Systems

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Parallel Archives That Leverage Parallel File Systems NFS Archive Path : Read PFS write NFS Scalable Storage Area Network PFS Cluster A + Cluster B Global Parallel File System – scratch file system Migration Path Global Parallel File System + Parallel Tape Archive System- Disks + Tapes  HSM File Transfer Agent

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D More leverage of parallel file systems to provide parallel archive is possible and makes sense Can we leverage parallel file system and non parallel archive COTS solutions that are highly leveragable to build a highly leveraged parallel archive with very creative and unique code needed to provide the parallel archive service? If this can be realized, a huge cost savings in providing this kind of parallel data movement service could possibly be realized Motivation - 1

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Disk is becoming more competitive with tape over time for a larger portion of archival data, Moderate and growing volume Global Parallel File Systems market, Scalable bandwidth and metadata Growing use of Global Parallel File Systems for moderate scale HPC HSM and ILM features in file systems and archives, High volume (non single parallel file) archives for backup/archive/content mgmt, and Leverage all free file movement/management tools in Linux, copy, move, ls, tar, etc. a well known file management environment get scp, sftp, and web/gui file management for free etc. Motivation - 2

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D (a) scale to very high bandwidths, (b) scale in metadata performance, (c) support policy-based hierarchical storage management capability, (d) scale in supporting changing needs of very large data sets, (e) support standard interface, and (f) utilize commercial-off-the-shelf (COTS) hardware. Challenging for Parallel Archive System

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Build a parallel tree walker and copy user space utility, Add storage pool (stgpool) support (using file system API), Create an efficient ordered file retrieval utility (using dmapi API and back end tape system query), Add support for ILM stgpool features, Add support for ILM stgpool and co-location features in the archive back-end, and Use FUSE to break up enormous files into pieces that can be migrated and recalled in parallel to/from the back end tape system Proposed COTS Parallel Archive System

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Proposed Parallel Archive System - PFTOOL Storage Area Network Scalable FTA Cluster Parallel Data Movers PFTOOL PFS Parallel & Scalable I/O Networking System PFS - Parallel File System I/O Scalable FTA (File transfer agent) Cluster:  Mounts site Global File System and other site shared file system  Runs commercial ILM enabled Parallel File System  Runs one or multiple copies of commercial backup archive  Runs HSM  Submits job to FTA cluster for data optimized data movement to/from archive PFS Scratch Global Parallel File System Cluster B Parallel Tape Archive System Global Parallel File System/ILM Cluster A

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D PFTOOL’s Software Architecture MPI Message Passing Manager – The conductor -Coordinates parallel tree walk -Balancing File Tree walk vs. Parallel Data Moving -Manage various queues operations -Arranges copy jobs to workers -Issues ouput/display request -Generates final statistics report OutPutProc WatchDog TapeProcsReadDirProcs Workers – file stat, file copy, tape file restore DirQ NameQ TapeQ CopyQ TapeCQ Message Queues

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Manager process: Conductor OutPutProc process: Display process WatchDog process: System status monitor ReadDir process: Explore directory and sub-directory TapeProc process: Tape data mover Worker process: Parallel data mover to and from File systems PFTOOL’s MPI processes

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Parallel File System Tree Walker Slide 15

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Continue - Slide 16

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D PFTOOL’s run time environment LoadManager – generate runtime MPI machine list periodically PFTOOL utilities – pfls, pfcp, pfcm PFTOOL – RunTime Environemnt 11 Manager MPI process 21 OutPutProc MPI process 3One or more ReadDirProc MPI process(s) 4One or more Worker MPI process(s) 5Zero or more TapeProc MPI process(s) 6One WatchDog MPI process NumProc(MPI machine list) = Sum(All MPI processes) Note: Number TapeProc is set to 0, when in archive process, giving more worker for copying data RunTime Tunning parameters – NumProcs, NumTapeProcs, ChunkSize, StoragePool info, Fuse ChunkSize, CopySize FTA Cluster RunTime Status : On/Off, Upgrate, Testing GPFS/HSM/ILM/MySQL Query Service – Run timeData migration and restoring status ArchiveFUSE file system – Convert a vary large file “N-to-1” copy into a N-toN copies for scaling and performance improvement File Transfer Agent Cluster – GPFS Client/Fuse Client

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D LoadManager – Selecting available processes running on machines based on machines’ current CPU workload status Tape optimization – reduce tape-trashing overhead (mounting and unmounting tape drives), line-up data for tape optimized sequential archiving A single large file parallel copy – Parallel I/O data movement on a single large file Very large file parallel copies – FUSE enhanced implementation (conversion of n-to-1 to a n-to-n copy) Runtime tunable parameters for adjusting PFTOOL commands runtime performance – size of data chunk for copying, number of MPI processes, size of FUSE file selection, number of Tape Drives used, PFTOOL’s runtime activities

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Pftool – lines C/MPI code GPFS dsm api code + MySQL database Pftool commands – PERL scripts, Python scripts Pftool loadmanger – PERL scripts Trashcan – open source Python scripts + modification Reusing/Modifying GNU ‘s Coreutils software code – rm, copy,…… PFTOOL Software System

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Less Aggressive MPI Polling implementation in PFTOOL while(1) { // main receiving loop MPI_Recv( message fromProc ) …. Processing message ……. } Figure 8-1: A typical AP based MPI main receiving loop int msgready = 0; while(1) { // main receiving loop // polling control enhancement while (msgready == 0) { // message is not ready yet MPI_Iprobe(fromProc,tag, comm, &msgready, &mpistatus) usleep(n micro-seconds); } MPI_Recv( message fromProc ) …. Processing message ……. } Figure 8-2: An enhancement LAP based polling control with MPI_Iprobe checking

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D pfls – using parallel file tree walker and list files in parallel pfcp – using parallel file tree walker and copy files in parallel, and pfcm – using parallel file tree walker and compare source and destination files in terms of byte content comparison. Users use it to verify data integrity of files after data copy. Commands supported in PFTOOL

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Top Level view of PFTOOL’s System

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Parallel Archive Setup for RoadRunner ‘s Open Science Project Two 10Gige links One TSM Server 10 GPFS nodes (parallel data mover) run PFTOOL Mounting /panfs & /gpfs One 10 GiGE Switch RoadRunner Cluster One PetaFlop/s Scratch File System 4PetaBytes capacity /panfs Multiple 10GiGE Switches FC switch (FC-4) Six DS4800 Fast Disk pool - 200TB Five NSD node with slow disk pool - 200TB LTO4 x 24 Tape atchive Over 4 PetaBytes

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Number of files per archive copy job

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Number of Mega Bytes copy per job

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Data bandwidth (MB/sec) copy per job

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Average File size copy per job

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D MPI Polling comparison studies – CPU occupancy

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D MPI Polling comparison studies – data rate

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Small File Tape Performance Aggregation of small files, which consists of bundling these small files into larger aggregates better suited to getting the tape drive up to full speed, and then writing the aggregate to tape Tape Optimization/Smart Recall ensure that all files in a tape-recall request are handled by the same machine (Tape Trashing problem) Limitations of the Synchronous Deleter built-in synchronous delete function between GPFS and TSM Single TSM Server Considering Fail-over using multiple TSM servers Experience and observed issues of our COTS Parallel Archive System

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Doing parallel movement to/from tape for a single large parallel file, Hierarchical storage management, ILM features, High volume (non-single parallel file) archives for backup / archive / content management, and Leveraging all free file movement & management tools in Linux such as copy, move, compare, ls, etc. Summary & Future works

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Currently we are trying to generalize the PFTOOL software and make it accommodate most of parallel file systems such as PVFSv2, GFS, Ceph, Lustre, pNFS etc. We plan to incorporate additional parallel data movement commands to PFTOOL such as parallel version of chown, chmod, chgrp, find, touch, and grep. Contiune -

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Thanks Q & A

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Slide 34