Download presentation
1
Using IOR to Analyze the I/O Performance
Hongzhang Shan, John Shalf NERSC
2
HPC community has started to build the petaflop platforms.
Motivation HPC community has started to build the petaflop platforms. System: I/O scalability: Handling exponentially increasing concurrency scale proportionally to flops ? Programming Interface How to make programming of increasingly complex file system easily accessible to users Application: Workload survey/characterization (what applications dominate our workload) Understanding I/O requirements of key applications Develop or adopt microbenchmarks that reflect those requirements Set performance expectations (now) and targets (future)
3
Analyzing the NERSC workload
Outline Analyzing the NERSC workload Selecting benchmark to reflect workload requirements (eg. Why IOR ?) Using IOR to assess system performance Using IOR to predict I/O performance for full applications
4
Identify Application Requirements
Identify users with demanding I/O requirements Study NERSC allocations (ERCAP) Study NERSC user surveys Approached sampling of top I/O users Astrophysics (Cactus, FLASH, CMB/MadCAP) Materials AMR framework (Chombo), etc.
5
Size of I/O Transaction Typical Strategies for I/O
Survey Results Access Pattern: Sequential I/O patterns dominate Writes dominate (exception: out-of-core CMB) Size of I/O Transaction Broad Range: 1KB - tens of MB Typical Strategies for I/O Run all I/O through one processor (serial) One file per processor (multi-file parallel I/O) MPI-IO to single file (single-file parallel I/O) pHDF5 and parallelNetCDF (advanced self-describing, platform-neutral file formats) Using one processor to handle all IO is still common Each process reads/writes its own files (“one file per processor”) is also common File management nightmare (millions of files) Bad for archival storage systems Parallel I/O to single file is slowly emerging, such as MPI-IO Motivated by need for fewer files Also movement to high-level file formats HDF5, PNETCDF Motivated by portability & provenance concerns Concerns about overhead of advanced file formats
6
Run all I/O through one processor
Potential Problems Run all I/O through one processor Potential performance bottleneck Does not fit distributed memory One file per High overhead for metadata management A recent FLASH run on BG/L generates 75 million files Bad for archival storage (lots of small files) Bad for metadata servers (lots of file creates) Need to use shared files or new interface Bad for metadata servers (100 file creates/s)
7
Migration to Parallel I/O
Parallel I/O to single file is slowly emerging Used to imply MPI-IO for correctness, but concurrent Posix also works (now) Motivated by need for fewer files Simplifies data analysis, visualization Simplifies archival storage Modest migration to high-level file formats pHDF5, parallelNetCDF Motivated by portability & provenance concerns Concerns about overhead of advanced file formats
8
Benchmark Requirements
Need to develop or adopt benchmark that reflects application requirements Access Pattern File Type Programming Interface File Size Transaction Size Concurrency
9
Synthetic Benchmarks Most synthetic benchmarks cannot be related to observed application IO patterns Iozone, Bonnie, Self-Scaling benchmark, SDSC I/O benchmark, Effective I/O Bandwidth, IOR, etc Deficiencies Access pattern not realistic for HPC Limited programming interface Serial only List of benchmarks studied. Not parallel, Just sequential read/write and random read/write (not very reflective of application IO patterns that we studies) No coverage of modern parallel IO interfaces (MPI-IO, NetCDF, HDF).
10
LLNL IOR Benchmark Developed by LLNL, used for purple procurement Focuses on parallel/sequential read/write operations that are typical in scientific applications Can exercise one file per processor or shared file accesses for common set of testing parameters (differential study) Exercises array of modern file APIs such as MPI-IO, POSIX (shared or unshared), pHDF5, parallelNetCDF Parameterized parallel file access patterns to mimic different application situations
11
IOR Design (shared file)
File Structure: Distributed Memory: transferSize … Segment blockSize (data for P0) blockSize (data for Pn) (data for P0) transferSize P0 Pn … time step, or field Important Parameters blockSize transferSize API Concurrency fileType Relate segment to time steps or variable names Sequence Pn One file processor Segment = data sets
12
IOR Design (shared file)
File Structure: Distributed Memory: transferSize … Segment blockSize (data for P0) blockSize (data for Pn) (data for P0) transferSize P0 Pn … time step, dataset Important Parameters blockSize transferSize API Concurrency fileType Relate segment to time steps or variable names Sequence Pn One file processor Segment = data sets Datasets in HDF5 and NetCDF nomenclature
13
IOR Design (One file per processor)
File Structure: Distributed Memory: File for P0 transferSize … Segment transferSize P0 blockSize File for Pn transferSize … Segment transferSize Pn blockSize
14
Using IOR to study system performance
Outline Why IOR ? Using IOR to study system performance Using IOR to predict I/O performance for application
15
Platforms 18 DDN 9550 couplets on Jaguar, each delivers 2.3 - 3 GB/s
Machine Name Parallel File System Proc Arch Inter-connect Peak IO BW Max Node BW to IO Jaguar Lustre Opteron SeaStar 18*2.3GB/s = 42GB 3.2GB/s (1.2GB/s) Bassi GPFS Power5 Federation 6*1GB/s = ~6.0GB/s 4.0GB/s (1.6GB/s) 18 DDN 9550 couplets, each deliver about GB/s Formula 18 DDN 9550 couplets on Jaguar, each delivers GB/s Bassi has 6 VSDs with 8 non-redundant FC2 channels per VSD to achieve ~1GB/s per VSD. (2x redundancy of FC) Effective unidirectional bandwidth in parenthesis
16
Caching Effects Machine Name Mem Per Node Node Size Mem/ Proc Jaguar 8GB 2 4GB Bassi 32GB 8 Caching Effect On Bassi, file Size should be at least 256MB/ proc to avoid caching effect On Jaguar, we have not observed caching effect, 2GB/s for stable output Chooose large enough IO file size to eliminate the caching effect Rerun 8GB case 256MB/processor for Bassi, jaguar does not observe cache effect, 2GB is for Stable output
17
Large transfer size is critical on Jaguar to achieve performance
Transfer Size (P = 8) HPC Speed DSL Speed Large transfer size is critical on Jaguar to achieve performance The effect on Bassi is not as significant
18
Scaling (No. of Processors)
Add stripe = 144 1 file per processor Parameters Label figures The I/O performance peaks at: P = 256 on Jaguar (lstripe=144), Close to peaks at P = 64 on Bassi The peak of I/O performance can often be achieved at relatively low concurrency
19
Shared vs. One file Per Proc
Lfs set to default Parameters Put bullet with metadata server bench results. (true for 1-x processors) The performance of using a shared file is very close to using one file per processor Using a shared file performs even better on Jaguar due to less metadata overhead
20
Programming Interface
MPI-IO is close to POSIX performance Concurrent POSIX access to single-file works correctly MPI-IO used to be required for correctness, but no longer HDF5 (v1.6.5) falls a little behind, but tracks MPI-IO performance parallelNETCDF (v1.0.2pre) performs worst, and still has 4GB dataset size limitation (due to limits on per-dimension sizes on latest version)
21
Programming Interface
Version numbers Parameters POSIX, MPI-IO, HDF5 (v1.6.5) offer very similar scalable performance parallelNetCDF (v1.0.2.pre): flat performance
22
Using IOR to study system performance
Outline Why IOR ? Using IOR to study system performance Using IOR to predict I/O performance for application
23
Important parameters related with IO:
Madbench Astrophysics application, used to analyze the massive Cosmic Microwave Background datasets Important parameters related with IO: Pixels: matrix size = pixels * pixels Bins: number of matrices IO Behavior Out-of-core app. Matrix Write/Read Weak scaling problem Pixels/Proc = 25K/16 CMB pictureData sizes quoted here applicable to the ESA/NASA Planck satellite mission due to launch in 2007 Angular power spectrum captures the strength of the correlations on various angular scales With the advent of E- and B-mode polarization measurements, we are moving from a single temp spectrum (top black line) to 3 auto (TT, EE, BB) and 3 cross (TE, TB=0, EB=0) spectra Spectral plot shows the 4 non-zero spectra, and notes some of the key physics we hope to measure: Reionization history of Universe shows up as a TE bump on large scales (low spherical harmonic multipoles) while the holy grail BB spectrum has the potential to show us signals from gravity waves generated during inflation (10^{-35} secs after Big Bang) as well as from gravitational lensing by intervening masses (e.g. clusters of galaxies)
24
I/O Performance Prediction for Madbench
Underprediction Overprediction Lstripe size 144 Underprediction Absoulte performance IOR parameters: TransferSize=16MB, blockSize=64MB, segmentCount=1, P=64
25
Summary Surveyed the I/O requirements of NERSC applications and selected IOR as the synthetic benchmark to study the I/O performance I/O Performance Highly affected by file size, I/O transaction size, concurrency Peaks at relatively low concurrency The overhead of using HDF5 and MPI-IO is low, but pNETCDF is high IOR could be used effectively for I/O performance prediction for some applications
26
Extra Material
27
Chombo Chombo is a tool package to solve the PDE problems on block-structured adaptively refined regular grids I/O is used to read/write the hierarchical grid structure at the end of each time step Test Problem: grid size = 400 * 400, 1 time step
28
Chombo I/O Behavior P0 P1 P2 In Memory: In File: HDF5 interface
3 4 7 1 5 6 2 9 8 P P P2 In File: In Memory: HDF5 interface Block size varies substantially, from 1KB - 10MB
29
I/O Performance Prediction for Chombo
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.