IO Best Practices For Franklin Katie Antypas User Services Group NERSC User Group Meeting September 19, 2007.

Slides:



Advertisements
Similar presentations
An Exercise in Improving SAS Performance on Mainframe Processors
Advertisements

A PLFS Plugin for HDF5 for Improved I/O Performance and Analysis Kshitij Mehta 1, John Bent 2, Aaron Torres 3, Gary Grider 3, Edgar Gabriel 1 1 University.
Katie Antypas NERSC User Services Lawrence Berkeley National Lab NUG Meeting 1 February 2012 Best Practices for Reading and Writing Data on HPC Systems.
Efficient I/O on the Cray XT Jeff Larkin With Help Of: Gene Wagenbreth.
Parallel I/O Performance Study Christian Chilan The HDF Group September 9, 2008SPEEDUP Workshop - HDF5 Tutorial1.
Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory
HDF4 and HDF5 Performance Preliminary Results Elena Pourmal IV HDF-EOS Workshop September
Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.
HDF5 collective chunk IO A Working Report. Motivation for this project ► Found extremely bad performance of parallel HDF5 when implementing WRF- Parallel.
Parallel HDF5 Introductory Tutorial May 19, 2008 Kent Yang The HDF Group 5/19/20081SCICOMP 14 Tutorial.
1 High level view of HDF5 Data structures and library HDF Summit Boeing Seattle September 19, 2006.
HDF5 A new file format & software for high performance scientific data management.
The HDF Group Parallel HDF5 Design and Programming Model May 30-31, 2012HDF5 Workshop at PSI 1.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010.
Using HDF5 in WRF Part of MEAD - an alliance expedition.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab.
1 HDF5 Life cycle of data Boeing September 19, 2006.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
Using IOR to Analyze the I/O Performance
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package MuQun Yang, Christian Chilan, Albert Cheng, Quincey Koziol, Mike.
Review CS File Systems - Partitions What is a hard disk partition?
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Model-driven Data Layout Selection for Improving Read Performance Jialin Liu 1, Bin Dong 2, Surendra Byna 2, Kesheng Wu 2, Yong Chen 1 Texas Tech University.
ANALYSIS TRAIN ON THE GRID Mihaela Gheata. AOD production train ◦ AOD production will be organized in a ‘train’ of tasks ◦ To maximize efficiency of full.
Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.
An Advanced Simulation & Computing (ASC) Academic Strategic Alliances Program (ASAP) Center at The University of Chicago The Center for Astrophysical Thermonuclear.
The HDF Group Introduction to HDF5 Session ? High Performance I/O 1 Copyright © 2010 The HDF Group. All Rights Reserved.
Threads prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University 1July 2016Processes.
Introduction to Operating Systems Concepts
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Parallel Virtual File System (PVFS) a.k.a. OrangeFS
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Chapter 4: Threads.
Jonathan Walpole Computer Science Portland State University
Achieving the Ultimate Efficiency for Seismic Analysis
Memory COMPUTER ARCHITECTURE
Introduction to HDF5 Session Five Reading & Writing Raw Data Values
FileSystems.
HDF5 October 8, 2017 Elena Pourmal Copyright 2016, The HDF Group.
HDF5 Metadata and Page Buffering
Chapter 9 – Real Memory Organization and Management
I/O Resource Management: Software
CPSC 315 – Programming Studio Spring 2012
Operating System Structure
What is FITS? FITS = Flexible Image Transport System
Filesystems.
A Closer Look at Instruction Set Architectures
Cache Memory Presentation I
Chapter 11: File System Implementation
Chapter 4: Threads.
Filesystems 2 Adapted from slides of Hank Levy
Lock Ahead: Shared File Performance Improvements
Chapter 4: Threads.
Portability CPSC 315 – Programming Studio
Chapter 2: System Structures
Ch 4. The Evolution of Analytic Scalability
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Virtual Memory Hardware
Multithreaded Programming
Chapter 2: Operating-System Structures
Outline Chapter 2 (cont) OS Design OS structure
CENG 351 Data Management and File Structures
System calls….. C-program->POSIX call
Chapter 2: Operating-System Structures
COMP60611 Fundamentals of Parallel and Distributed Systems
Presentation transcript:

IO Best Practices For Franklin Katie Antypas User Services Group NERSC User Group Meeting September 19, 2007

NERSC User Group Meeting, September 17, Outline Goals and scope of tutorial IO Formats Parallel IO strategies Striping Recommendations Thanks to Julian Borrill, Hongzang Shan, John Shalf and Harvey Wasserman for slides and data, Nick Cardo for Franklin/Lustre tutorials and NERSC-IO group for feedback

NERSC User Group Meeting, September 17, Goals Very high level answer question of “how should I do my IO on Franklin?” With X GB of data to output running on Y processors -- do this.

NERSC User Group Meeting, September 17, Axis of IO Striping Total Output Size Number of Processors IO Library File Size Per Processor Chunking Number of Files per Ouput Dump Blocksize Transfer Size File System Hints Strided or Contiguous Access Collective vs Independent Weak vs Strong Scaling This is why IO is complicated…..

NERSC User Group Meeting, September 17, Axis of IO Striping Total Output Size Number of Processors IO Library File Size Per Processor Chunking Number of Files per Ouput Dump Blocksize Transfer Size File System Hints Strided or Contiguous Access Collective vs Independent Weak vs Strong Scaling

NERSC User Group Meeting, September 17, Axis of IO Striping Total File Size Number of Processors IO Library File Size Per Processor Number of Writers Blocksize Transfer Size Primarily large block IO, transfer size same as blocksize Used HDF5 Some Basic Tips Strong Scaling

NERSC User Group Meeting, September 17, Parallel I/O: A User Perspective Wish List –Write data from multiple processors into a single file –File can be read in the same manner regardless of the number of CPUs that read from or write to the file. (eg. want to see the logical data layout… not the physical layout) –Do so with the same performance as writing one-file-per- processor (only writing one-file-per-processor because of performance problems) –And make all of the above portable from one machine to the next

NERSC User Group Meeting, September 17, I/O Formats

NERSC User Group Meeting, September 17, Common Storage Formats ASCII: –Slow –Takes more space! –Inaccurate Binary –Non-portable (eg. byte ordering and types sizes) –Not future proof –Parallel I/O using MPI-IO Self-Describing formats –NetCDF/HDF4, HDF5, Parallel NetCDF –Example in HDF5: API implements Object DB model in portable file –Parallel I/O using: pHDF5/pNetCDF (hides MPI-IO) Community File Formats –FITS, HDF-EOS, SAF, PDB, Plot3D –Modern Implementations built on top of HDF, NetCDF, or other self-describing object-model API Many NERSC users at this level. We would like to encourage users to transition to a higher IO library

NERSC User Group Meeting, September 17, HDF5 Library Can store data structures, arrays, vectors, grids, complex data types, text Can use basic HDF5 types integers, floats, reals or user defined types such as multi- dimensional arrays, objects and strings Stores metadata necessary for portability - endian type, size, architecture HDF5 is a general purpose library and file format for storing scientific data

NERSC User Group Meeting, September 17, HDF5 Data Model Groups –Arranged in directory hierarchy –root group is always ‘/’ Datasets –Dataspace –Datatype Attributes –Bind to Group & Dataset References –Similar to softlinks –Can also be subsets of data “/” ( root ) “Dataset0” type,space “Dataset1” type, space “subgrp” “time”= “validity”=None “author”=Jane Doe “Dataset0.1” type,space “Dataset0.2” type,space “date”=10/24/2006

NERSC User Group Meeting, September 17, A Plug for Self Describing Formats... Application developers shouldn’t care about about physical layout of data Using own binary file format forces user to understand layers below the application to get optimal IO performance Every time code is ported to a new machine or underlying file system is changed or upgraded, user is required to make changes to improve IO performance Let other people do the work –HDF5 can be optimized for given platforms and file systems by HDF5 developers –User can stay with the high level But what about performance?

NERSC User Group Meeting, September 17, IO Library Overhead Data from Hongzhang Shan Very little, if any overhead from HDF5 for one file per processor IO compared to Posix and MPI-IO

NERSC User Group Meeting, September 17, Ways to do Parallel IO

NERSC User Group Meeting, September 17, Serial I/O File processors Each processor sends its data to the master who then writes the data to a file Advantages Simple May perform ok for very small IO sizes Disadvantages Not scalable Not efficient, slow for any large number of processors or data sizes May not be possible if memory constrained 5

NERSC User Group Meeting, September 17, Parallel I/O Multi-file File processors Each processor writes its own data to a separate file Advantages Simple to program Can be fast -- (up to a point) Disadvantages Can quickly accumulate many files With Lustre, hit metadata server limit Hard to manage Requires post processing Difficult for storage systems, HPSS, to handle many small files 5 File

NERSC User Group Meeting, September 17, Flash Center IO Nightmare… Large 32,000 processor run on LLNL BG/L Parallel IO libraries not yet available Intensive I/O application –checkpoint files.7 TB, dumped every 4 hours, 200 dumps used for restarting the run full resolution snapshots of entire grid –plotfiles - 20GB each, 700 dumps coarsened by a factor of two averaging single precision subset of grid variables –particle files 1400 particle files 470MB each 154 TB of disk capacity 74 million files! Unix tool problems 2 Years Later still trying to sift though data, sew files together

NERSC User Group Meeting, September 17, Parallel I/O Single-file File processors Each processor writes its own data to the same file using MPI-IO mapping Advantages Single file Manageable data Disadvantages Lower performance than one file per processor at some concurrencies 5

NERSC User Group Meeting, September 17, Parallel IO single file processors array of data Each processor writes to a section of a data array. Each must know its offset from the beginning of the array and the number of elements to write

NERSC User Group Meeting, September 17, Trade offs It isn’t hard to have speed, portability or usability. It is hard to have speed, portability and usability in the same implementation Ideally users want speed, portability and usability –speed - one file per processor –portability - high level IO library –usability single shared file and own file format or community file format layered on top of high level IO library

NERSC User Group Meeting, September 17, Benchmarking Methodology and Results

NERSC User Group Meeting, September 17, Disclaimer IO runs done during production time Rates dependent on other jobs running on the system Focus on trends rather than one or two outliers Some tests ran twice, others only once

NERSC User Group Meeting, September 17, Peak IO Performance on Franklin Expectation that IO rates will continue to rise linearly Back end saturated around ~250 processors Weak scaling IO, ~300 MB/proc Peak performance ~11GB/Sec (5 DDNs * ~2GB/sec) Image from Julian Borrill

NERSC User Group Meeting, September 17, Description of IOR Developed by LLNL used for purple procurement Focuses on parallel/sequential read/write operations that are typical in scientific applications Can exercise one file per processor or shared file access for common set of testing parameters Exercises array of modern file APIs such as MPI- IO, POSIX (shared or unshared), HDF5 and parallel-netCDF Parameterized parallel file access patterns to mimic different application situations

NERSC User Group Meeting, September 17, Benchmark Methodology File processors File processors 5 File Focus on performance difference between single shared and one file per processor

NERSC User Group Meeting, September 17, Benchmark Methodology MB1 GB10 GB100 GB1 TB Using IOR HDF5 Interface Contiguous IO Not intended to be a scaling study Blocksize and transfer size always the same but vary from run to run Goal is to fill out opposite chart with best IO strategy Aggregate Output Size Processors

NERSC User Group Meeting, September 17, Small Aggregate Output Sizes 100 MB - 1GB One File per Processor vs Shared File - GB/Sec Aggregate File Size 100 MB Clearly the ‘one file per processor’ strategy wins in the low concurrency cases correct? Aggregate File Size 1 GB Peak performance line - Anything greater than this is due to caching effect or timer granularity

NERSC User Group Meeting, September 17, Small Aggregate Output Sizes 100 MB - 1GB One File per Processor vs Shared File - Time Aggregate File Size 100 MB But when looking at absolute time, the difference doesn’t seem so big... Aggregate File Size 1 GB

NERSC User Group Meeting, September 17, Aggregate Output Size 100GB One File per Processor vs Shared File Rate: GB/Sec Time: Seconds Is there anything we can do to improve the performance of the 4096 processor shared file case ? 2.5 mins 390 MB/proc 24 MB/proc Peak performance line

NERSC User Group Meeting, September 17, Hybrid Model File processors 5 File Examine 4096 processor case more closely Group subsets of processors to write to separate shared files Try grouping 64, 256, 512, 1024, and 2048 processors to see performance difference from file per processor case vs single shared file case

NERSC User Group Meeting, September 17, Effect of Grouping Processors into Separate Smaller Shared Files 1 file per proc Single Shared File 512 procs write to single file 64 procs write to single file 2048 procs write to single file 100GB Aggregate Output Size on 4096 procs User gains some from grouping files Since very little data is written per processor, overhead for synchronization dominates Each processor writes out 24MB Only difference between runs is number of files to which processors are grouped Created a new MPI communicator in IOR for multiple shared files Number of Files

NERSC User Group Meeting, September 17, Aggregate Output Size 1TB One File per Processor vs Shared File Rate: GB/Sec Time: Seconds ~ 3 mins Is there anything we can do to improve the performance of the 4096 processor shared file case ? 976 MB/proc 244 MB/proc

NERSC User Group Meeting, September 17, file per proc Single Shared File 512 procs write to single file 64 procs write to single file 2048 procs write to single file Each processor writes out 244MB Only difference between runs is number of files to which processors are grouped Created a new MPI communicator in IOR for multiple shared files Effect from grouping files is fairly substantial But do users want to do this? Important to show hdf5 developers to make splitting files easier in API. Effect of Grouping Processors into Separate Smaller Shared Files

NERSC User Group Meeting, September 17, file per proc Single Shared File 512 procs write to single file 64 procs write to single file Each processor writes out 488MB Only difference between runs is number of files to which processors are grouped Created a new MPI communicator in IOR for multiple shared files Effect of Grouping Processors into Separate Smaller Shared Files

NERSC User Group Meeting, September 17, What is Striping? Lustre file system on Franklin made up of an underlying set of file systems calls Object Storage Targets (OSTs), essentially a set of parallel IO servers File is said to be striped when read and write operations access multiple OSTs concurrently Striping can be a way to increase IO performance since writing or reading from multiple OSTs simultaneously increases the available IO bandwidth

NERSC User Group Meeting, September 17, What is Striping? File striping will most likely improve performance for applications which read or write to a single (or multiple) large shared files Striping will likely have little effect for the following type of IO patterns –Serial IO where a single processor performs all the IO –Multiple node perform IO, but access files at different times –Multiple nodes perform IO simultaneously to different files that are small (each < 100 MB) –One file per processor

NERSC User Group Meeting, September 17, Striping Commands Striping can be set at a file or directory level Set striping on an directory then all files created in that directory with inherit striping level of the directory Moving a file into a directory with a set striping will NOT change the striping of that file stripe-size - –Number of bytes in each stripe (multiple of 64k block) OST offset - –Always keep this -1 –Choose starting OST in round robin stripe count - –Number of OSTs to stripe over –-1 stripe over all OSTs –1 stripe over one OST lfs setstripe

NERSC User Group Meeting, September 17, Stripe-Count Suggestions Franklin Default Striping –1MB stripe size –Round robin starting OST (OST Offset -1) –Stripe over 4 OSTs (Stripe count 4) Many small files, one file per proc –Use default striping –Or 0 -1, 1 Large shared files –Stripe over all available OSTs ( ) –Or some number larger than 4 (0 -1 X) Stripe over odd numbers? Prime numbers?

NERSC User Group Meeting, September 17, Recommendations N/A MB1 GB10 GB100 GB1 TB Aggregate File Size Processors Single Shared File, Default or No Striping Single Shared File, Stripe over many OSTs Single Shared File, Stripe over many OSTs OR File per processor with default striping Benefits to mod n shared files Single Shared File, Stripe over some OSTs (~10) Legend

NERSC User Group Meeting, September 17, Recommendations Think about the big picture –Run time vs Post Processing trade off –Decide how much IO overhead you can afford –Data Analysis –Portability –Longevity H5dump works on all platforms Can view an old file with h5dump If you use your own binary format you must keep track of not only your file format version but the version of your file reader as well –Storability

NERSC User Group Meeting, September 17, Recommendations Use a standard IO format, even if you are following a one file per processor model One file per processor model really only makes some sense when writing out very large files at high concurrencies, for small files, overhead is low If you must do one file per processor IO then at least put it in a standard IO format so pieces can be put back together more easily Splitting large shared files into a few files appears promising –Option for some users, but requires code changes and output format changes –Could be implemented better in IO library APIs Follow striping recommendations Ask the consultants, we are here to help!

NERSC User Group Meeting, September 17, Questions?