Presentation is loading. Please wait.

Presentation is loading. Please wait.

Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Integration Experiences and Performance Studies of A COTS Parallel Archive.

Similar presentations


Presentation on theme: "Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Integration Experiences and Performance Studies of A COTS Parallel Archive."— Presentation transcript:

1 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Integration Experiences and Performance Studies of A COTS Parallel Archive System - A New Parallel Archive Storage System Concept and Implementation Hsing-bung (HB) Chen, Gary Grider, Cody Scott, Milton Turley Aaron Torres, Kathy Sanchez, John Bremer Los Alamos National Laboratory Los Alamos, New Mexico 87545, USA September 22 nd, 2010 IEEE International Conference on Cluster Computing 2010 Heraklion,Crete, Greece The University of California operates Los Alamos National Laboratory for the National Nuclear Security & Administration of the United States Department of Energy. LANL Document Number LA-UR-10-06115

2 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Present and future Archive Storage Systems have been challenged to (a) scale to very high bandwidths, (b) scale in metadata performance, (c) support policy-based hierarchical storage management capability, (d) scale in supporting changing needs of very large data sets, (e) support standard interface, and (f) utilize commercial-off-the-shelf (COTS) hardware. Parallel file systems have also been demanded to perform the same manner but at one or more orders of magnitude faster in performance. Archive systems continue to improve substantially comparable to file systems in their design due to the need for speed and bandwidth, especially metadata searching speeds such as more caching and less robust semantics. Currently, the number of extreme highly scalable parallel archive solutions is very limited especially for moving a single large striped parallel disk file onto many tapes in parallel. We believe that a hybrid storage approach of using COTS components and an innovative software technology can bring new capabilities into a production environment for the HPC community. This solution is much faster than the approach of creating and maintaining a complete end-to-end unique parallel archive software solution. We relay our experience of integrating a global parallel file system and a standard backup/archive product with an innovative parallel software code to construct a scalable and parallel archive storage system. Our solution has a high degree of overlap with current parallel archive products including (a) doing parallel movement to/from tape for a single large parallel file, (b) hierarchical storage management, (c) ILM features, (d) high volume (non-single parallel file) archives for backup/archive/content management, and (e) leveraging all free file movement tools in Linux such as copy, move, ls, tar, etc. We have successfully applied our working COTS Parallel Archive System to the current world’s first petaflop/s computing system, LANL’s Roadrunner machine, and demonstrated its capability to address requirements of future archival storage systems. Now this new Parallel Archive System is used on the LANL’s Turquoise Network Abstract

3 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D 1. Background 2. Issues, Motivation, and Leverage of using COTS Parallel Archive System 3. Proposed COTS Parallel Archive System 4. PERFORMANCE STUDIES ON LANL’S ROADRUNNER OPEN SCIENCE PROJECTS 5. Experience and observed issues of our COTS Parallel Archive System 6. Summary and Future Works Agenda

4 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D The DOE Advanced Strategic Computing Initiative Program published this Kiviat diagram that shows parallel file systems scaling performance at an order of magnitude faster than parallel archives

5 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Parallel File Systems & Parallel I/O HSM Hierarchical Storage Management (HSM) ILM – Information Life cycle Management Non-Parallel vs. Parallel Archive Systems Parallel Archives That Do Not Leverage Parallel File Systems as Their First Tier of Storage Parallel Archives That Do Leverage Parallel File Systems as Their First Tier of Storage Background

6 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Archives That Do Not Leverage Parallel File Systems

7 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Parallel Archives That Leverage Parallel File Systems NFS Archive Path : Read PFS write NFS Scalable Storage Area Network PFS Cluster A + Cluster B Global Parallel File System – scratch file system Migration Path Global Parallel File System + Parallel Tape Archive System- Disks + Tapes  HSM File Transfer Agent

8 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D More leverage of parallel file systems to provide parallel archive is possible and makes sense Can we leverage parallel file system and non parallel archive COTS solutions that are highly leveragable to build a highly leveraged parallel archive with very creative and unique code needed to provide the parallel archive service? If this can be realized, a huge cost savings in providing this kind of parallel data movement service could possibly be realized Motivation - 1

9 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Disk is becoming more competitive with tape over time for a larger portion of archival data, Moderate and growing volume Global Parallel File Systems market, Scalable bandwidth and metadata Growing use of Global Parallel File Systems for moderate scale HPC HSM and ILM features in file systems and archives, High volume (non single parallel file) archives for backup/archive/content mgmt, and Leverage all free file movement/management tools in Linux, copy, move, ls, tar, etc. a well known file management environment get scp, sftp, and web/gui file management for free etc. Motivation - 2

10 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D (a) scale to very high bandwidths, (b) scale in metadata performance, (c) support policy-based hierarchical storage management capability, (d) scale in supporting changing needs of very large data sets, (e) support standard interface, and (f) utilize commercial-off-the-shelf (COTS) hardware. Challenging for Parallel Archive System

11 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Build a parallel tree walker and copy user space utility, Add storage pool (stgpool) support (using file system API), Create an efficient ordered file retrieval utility (using dmapi API and back end tape system query), Add support for ILM stgpool features, Add support for ILM stgpool and co-location features in the archive back-end, and Use FUSE to break up enormous files into pieces that can be migrated and recalled in parallel to/from the back end tape system Proposed COTS Parallel Archive System

12 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Proposed Parallel Archive System - PFTOOL Storage Area Network Scalable FTA Cluster Parallel Data Movers PFTOOL PFS Parallel & Scalable I/O Networking System PFS - Parallel File System I/O Scalable FTA (File transfer agent) Cluster:  Mounts site Global File System and other site shared file system  Runs commercial ILM enabled Parallel File System  Runs one or multiple copies of commercial backup archive  Runs HSM  Submits job to FTA cluster for data optimized data movement to/from archive PFS Scratch Global Parallel File System Cluster B Parallel Tape Archive System Global Parallel File System/ILM Cluster A

13 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D PFTOOL’s Software Architecture MPI Message Passing Manager – The conductor -Coordinates parallel tree walk -Balancing File Tree walk vs. Parallel Data Moving -Manage various queues operations -Arranges copy jobs to workers -Issues ouput/display request -Generates final statistics report OutPutProc WatchDog TapeProcsReadDirProcs Workers – file stat, file copy, tape file restore DirQ NameQ TapeQ CopyQ TapeCQ Message Queues

14 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Manager process: Conductor OutPutProc process: Display process WatchDog process: System status monitor ReadDir process: Explore directory and sub-directory TapeProc process: Tape data mover Worker process: Parallel data mover to and from File systems PFTOOL’s MPI processes

15 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Parallel File System Tree Walker Slide 15

16 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Continue - Slide 16

17 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D PFTOOL’s run time environment LoadManager – generate runtime MPI machine list periodically PFTOOL utilities – pfls, pfcp, pfcm PFTOOL – RunTime Environemnt 11 Manager MPI process 21 OutPutProc MPI process 3One or more ReadDirProc MPI process(s) 4One or more Worker MPI process(s) 5Zero or more TapeProc MPI process(s) 6One WatchDog MPI process NumProc(MPI machine list) = Sum(All MPI processes) Note: Number TapeProc is set to 0, when in archive process, giving more worker for copying data RunTime Tunning parameters – NumProcs, NumTapeProcs, ChunkSize, StoragePool info, Fuse ChunkSize, CopySize FTA Cluster RunTime Status : On/Off, Upgrate, Testing GPFS/HSM/ILM/MySQL Query Service – Run timeData migration and restoring status ArchiveFUSE file system – Convert a vary large file “N-to-1” copy into a N-toN copies for scaling and performance improvement File Transfer Agent Cluster – GPFS Client/Fuse Client

18 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D LoadManager – Selecting available processes running on machines based on machines’ current CPU workload status Tape optimization – reduce tape-trashing overhead (mounting and unmounting tape drives), line-up data for tape optimized sequential archiving A single large file parallel copy – Parallel I/O data movement on a single large file Very large file parallel copies – FUSE enhanced implementation (conversion of n-to-1 to a n-to-n copy) Runtime tunable parameters for adjusting PFTOOL commands runtime performance – size of data chunk for copying, number of MPI processes, size of FUSE file selection, number of Tape Drives used, PFTOOL’s runtime activities

19 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Pftool – 7000+ lines C/MPI code GPFS dsm api code + MySQL database Pftool commands – PERL scripts, Python scripts Pftool loadmanger – PERL scripts Trashcan – open source Python scripts + modification Reusing/Modifying GNU ‘s Coreutils software code – rm, copy,…… PFTOOL Software System

20 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Less Aggressive MPI Polling implementation in PFTOOL while(1) { // main receiving loop MPI_Recv( message fromProc ) …. Processing message ……. } Figure 8-1: A typical AP based MPI main receiving loop int msgready = 0; while(1) { // main receiving loop // polling control enhancement while (msgready == 0) { // message is not ready yet MPI_Iprobe(fromProc,tag, comm, &msgready, &mpistatus) usleep(n micro-seconds); } MPI_Recv( message fromProc ) …. Processing message ……. } Figure 8-2: An enhancement LAP based polling control with MPI_Iprobe checking

21 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D pfls – using parallel file tree walker and list files in parallel pfcp – using parallel file tree walker and copy files in parallel, and pfcm – using parallel file tree walker and compare source and destination files in terms of byte content comparison. Users use it to verify data integrity of files after data copy. Commands supported in PFTOOL

22 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Top Level view of PFTOOL’s System

23 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Parallel Archive Setup for RoadRunner ‘s Open Science Project Two 10Gige links One TSM Server 10 GPFS nodes (parallel data mover) run PFTOOL Mounting /panfs & /gpfs One 10 GiGE Switch RoadRunner Cluster One PetaFlop/s Scratch File System 4PetaBytes capacity /panfs Multiple 10GiGE Switches FC switch (FC-4) Six DS4800 Fast Disk pool - 200TB Five NSD node with slow disk pool - 200TB LTO4 x 24 Tape atchive Over 4 PetaBytes

24 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Number of files per archive copy job

25 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Number of Mega Bytes copy per job

26 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Data bandwidth (MB/sec) copy per job

27 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Average File size copy per job

28 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D MPI Polling comparison studies – CPU occupancy

29 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D MPI Polling comparison studies – data rate

30 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Small File Tape Performance Aggregation of small files, which consists of bundling these small files into larger aggregates better suited to getting the tape drive up to full speed, and then writing the aggregate to tape Tape Optimization/Smart Recall ensure that all files in a tape-recall request are handled by the same machine (Tape Trashing problem) Limitations of the Synchronous Deleter built-in synchronous delete function between GPFS and TSM Single TSM Server Considering Fail-over using multiple TSM servers Experience and observed issues of our COTS Parallel Archive System

31 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Doing parallel movement to/from tape for a single large parallel file, Hierarchical storage management, ILM features, High volume (non-single parallel file) archives for backup / archive / content management, and Leveraging all free file movement & management tools in Linux such as copy, move, compare, ls, etc. Summary & Future works

32 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Currently we are trying to generalize the PFTOOL software and make it accommodate most of parallel file systems such as PVFSv2, GFS, Ceph, Lustre, pNFS etc. We plan to incorporate additional parallel data movement commands to PFTOOL such as parallel version of chown, chmod, chgrp, find, touch, and grep. Contiune -

33 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Thanks Q & A

34 Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Slide 34


Download ppt "Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Integration Experiences and Performance Studies of A COTS Parallel Archive."

Similar presentations


Ads by Google