National Energy Research Scientific Computing Center (NERSC) I/O Patterns from NERSC Applications NERSC Center Division, LBNL.

Slides:

Advertisements

Similar presentations

RAID (Redundant Arrays of Independent Disks). Disk organization technique that manages a large number of disks, providing a view of a single disk of High.

Advertisements

1 Memory hierarchy and paging Electronic Computers M.

CSCE430/830 Computer Architecture

A PLFS Plugin for HDF5 for Improved I/O Performance and Analysis Kshitij Mehta 1, John Bent 2, Aaron Torres 3, Gary Grider 3, Edgar Gabriel 1 1 University.

Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.

Katie Antypas NERSC User Services Lawrence Berkeley National Lab NUG Meeting 1 February 2012 Best Practices for Reading and Writing Data on HPC Systems.

Efficient I/O on the Cray XT Jeff Larkin With Help Of: Gene Wagenbreth.

Chapter 11: File System Implementation

I/O Analysis and Optimization for an AMR Cosmology Simulation Jianwei LiWei-keng Liao Alok ChoudharyValerie Taylor ECE Department Northwestern University.

Other Disk Details. 2 Disk Formatting After manufacturing disk has no information –Is stack of platters coated with magnetizable metal oxide Before use,

File System Implementation CSCI 444/544 Operating Systems Fall 2008.

I/O Optimization for ENZO Cosmology Simulation Using MPI-IO Jianwei Li12/06/2001.

Introduction to Computers and Programming. Some definitions Algorithm: –A procedure for solving a problem –A sequence of discrete steps that defines such.

1 stdchk : A Checkpoint Storage System for Desktop Grid Computing Matei Ripeanu – UBC Sudharshan S. Vazhkudai – ORNL Abdullah Gharaibeh – UBC The University.

Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory

1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.

Physical Storage Organization. Advanced DatabasesPhysical Storage Organization2 Outline Where and How data are stored? –physical level –logical level.

CSE 451: Operating Systems Winter 2010 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura.

ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.

Chapter 3 Memory Management: Virtual Memory

CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.

CSC271 Database Systems Lecture # 30.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Lecture 9 Methodology – Physical Database Design for Relational Databases.

HDF5 A new file format & software for high performance scientific data management.

Prof. Yousef B. Mahdy , Assuit University, Egypt File Organization Prof. Yousef B. Mahdy Chapter -4 Data Management in Files.

Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010.

Using HDF5 in WRF Part of MEAD - an alliance expedition.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Scientific Computing Division A tutorial Introduction to Fortran Siddhartha Ghosh Consulting Services Group.

SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.

 Three-Schema Architecture Three-Schema Architecture  Internal Level Internal Level  Conceptual Level Conceptual Level  External Level External Level.

Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab.

The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.

Physical Storage Organization. Advanced DatabasesPhysical Storage Organization2 Outline Where and How data are stored? –physical level –logical level.

"1"1 Introduction to Managing Data " Describe problems associated with managing large numbers of disks " List requirements for easily managing large amounts.

MEMORY ORGANIZTION & ADDRESSING Presented by: Bshara Choufany.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Methodology – Physical Database Design for Relational Databases.

F. Douglas Swesty, DOE Office of Science Data Management Workshop, SLAC March Data Management Needs for Nuclear-Astrophysical Simulation at the Ultrascale.

The concept of RAID in Databases By Junaid Ali Siddiqui.

I/O for Structured-Grid AMR Phil Colella Lawrence Berkeley National Laboratory Coordinating PI, APDEC CET.

Using IOR to Analyze the I/O Performance

Jay Lofstead Input/Output APIs and Data Organization for High Performance Scientific Computing November.

Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.

Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package MuQun Yang, Christian Chilan, Albert Cheng, Quincey Koziol, Mike.

Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.

Lec 5 part2 Disk Storage, Basic File Structures, and Hashing.

SDM Center Parallel I/O Storage Efficient Access Team.

C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.

Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber.

The HDF Group Introduction to HDF5 Session Two Data Model Comparison HDF5 File Format 1 Copyright © 2010 The HDF Group. All Rights Reserved.

Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

Model-driven Data Layout Selection for Improving Read Performance Jialin Liu 1, Bin Dong 2, Surendra Byna 2, Kesheng Wu 2, Yong Chen 1 Texas Tech University.

IO Best Practices For Franklin Katie Antypas User Services Group NERSC User Group Meeting September 19, 2007.

File Systems and Disk Management

Module 11: File Structure

CHP - 9 File Structures.

Introduction to HDF5 Session Five Reading & Writing Raw Data Values

Chapter 9 – Real Memory Organization and Management

Parallel Algorithm Design

CSCE 990: Advanced Distributed Systems

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Lock Ahead: Shared File Performance Improvements

Operating Systems.

Mark Zbikowski and Gary Kimura

Memory management Explain how memory is managed in a typical modern computer system (virtual memory, paging and segmentation should be described.

PVFS: A Parallel File System for Linux Clusters

Presentation transcript:

National Energy Research Scientific Computing Center (NERSC) I/O Patterns from NERSC Applications NERSC Center Division, LBNL

User Requirements Write data from multiple processors into a single file Undo the “domain decomposition” required to implement parallelism File can be read in the same manner regardless of the number of CPUs that read from or write to the file –we want to see the logical data layout… not the physical layout Do so with the same performance as writing one-file- per-processor –Use one-file-per-processor because of performance problems (would prefer one-file per application)

Usage Model (focus here is on append-only I/O) Checkpoint/Restart –Most users don’t do hero applications: tolerate failure by submitting more jobs (and that includes apps that are targeting hero-scale applications) –Most people doing “hero applications” have written their own restart systems and file formats –Typically close to memory footprint of code per dump Must dump memory image ASAP! Not as much need to remove the domain decomposition (recombiners for MxN problem) not very sophisticated about recalculating derived quantities (stores all large arrays) Might go back more than one checkpoint, but only need 1-2 of them online (staging) Typically throw the data away if CPR not required Data Analysis Dumps –Time-series data most demanding Typically run with coarse-grained time dumps If something interesting happens, resubmit job with higher output rate (and take a huge penalty for I/O rates) Async I/O would make 50% I/O load go away, but nobody uses it! (cause it rarely works) –Optimization or boundary-value problems typically have flexible output requirements (typically diagnostic)

Common Storage Formats ASCII: (pitiful… this is still common… even for 3D I/O… and you want an exaflop??) –Slow –Takes more space! –Inaccurate Binary –Nonportable (eg. byte ordering and types sizes) –Not future proof –Parallel I/O using MPI-IO Self-Describing formats –NetCDF/HDF4, HDF5, Silo –Example in HDF5: API implements Object DB model in portable file –Parallel I/O using: pHDF5/pNetCDF (hides MPI-IO) Community File Formats –FITS, HDF-EOS, SAF, PDB, Plot3D –Modern Implementations built on top of HDF, NetCDF, or other self- describing object-model API

Common Data Models/Schemas Structured Grids: –1D-6D domain decomposed mesh data –Reversing Domain Decomposition results in strided disk access pattern –Multiblock grids often stored in chunked fashion Particle Data –1D lists of particle data (x,y,z location + physical properties of each particle) –Often non-uniform number of particles per processor –PIC often requires storage of Structured Grid together with cells Unstructured Cell Data –1D array of cell types –1D array of vertices (x,y,z locations) –1D array of cell connectivity –Domain decomposition has similarity with particles, but must handle ghost cells AMR Data –Chombo: Each 3D AMR grid occupies distinct section of 1D array on disk (one array per AMR level). –Enzo (Mike Norman, UCSD): One file per processor (each file contains multiple grids) –BoxLib: One file per grid (each grid in the AMR hierarchy is stored in a separate,cleverly named, file) Increased need for processing data from terrestrial sensors (read-oriented) –NERSC is now a net importer of data

Physical Layout Tends to Result in Handful of I/O Patterns 2D-3D I/O patterns (small-block strided I/O) –1 file per processor (Raw Binary and HDF5) Raw binary assesses peak performance HDF5 determines overhead of metadata, data encoding, and small accesses associated with storage of indices and metadata –1-file reverse domain decomp (Raw MPI-IO and pHDF5) MPI-IO is baseline (peak performance) Assess pHDF5 or pNetCDF implementation overhead –1-file chunked (looks like 1D I/O pattern) 1D I/O patterns (large-block strided I/O) –Same as above, but for 1D data layouts –1-file per processor is same in both cases –Often difficult to ensure alignment to OST boundaries

Common Themes for Storage Patterns (alternative to prev slide) Diversity of I/O data schemas derive down two handful of I/O patterns at disk level 1D I/O –Examples: GTC, VORPAL particle I/O, H5Part, ChomboHDF5, FLASH-AMR –Interleaved I/O operations with large transaction size (hundreds of kilobytes to megabytes) –Three categories Equal sized transactions per processor Slightly unequal sized transactions per processor (not load-imbalanced, but difficult to align) Unequal sized transactions with load-imbalance (not focus of our attention) 2D, 3D, >3D I/O pattern –Examples: Cactus, Flash (unchunked), VORPAL 3D-IO –Reverse domain decomposition Use chunking to increase transaction sizes Chunking looks like 1D I/O case –Results in interleaved output with very small transaction sizes (kilobyte sized) Out-of-Core I/O –Examples: MadCAP, MadBench, OOCore –Very large-block transactions (multimegabyte or multigigabyte) –Intense for both read and write operations

Common Physical Layouts For Parallel I/O One File Per Process –Terrible for HPSS! –Difficult to manage Parallel I/O into a single file –Raw MPI-IO –pHDF5 pNetCDF Chunking into a single file –Saves cost of reorganizing data –Depend on API to hide physical layout –(eg. expose user to logically contiguous array even though it is stored physically as domain-decomposed chunks)

3D (reversing the domain decomp)

3D (reversing the decomp) Logical Physical

3D (block alignment issues) 720 bytes Logical Physical 8192 bytes Block updates require mutual exclusion Block thrashing on distributed FS I/O efficiency for sparse updates! (8k block required for 720 byte I/O operation Unaligned block accesses can kill performance! (but are necessary in practical I/O solutions) Writes not aligned to block boundaries

Interleaved Data Issues Accelerator Modeling Data Point data –Electrons or protons –Millions or billions in a simulation –Distribution is non-uniform Fixed distribution at start of simulation Change distribution (load balancing) each iteration Attributes of a point –Location: (double) x,y,z –Phase: (double) mx,my,mz –ID: (int64) id –Other attributes Interleaving and transaction sizes are similar to AMR and chunked 3D Data.

Nonuniform Interleaved Data (accelerator modeling) Storage Format... X Y Z … Laid out sequentially on disk but view is interleaved on per-processor basis X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 XnXn 0 NX-1 NXNX + NY-1 NX + NY Y1Y1 Y2Y2 YnYn

Nonuniform Interleaved Data (accelerator modeling) Storage Format X Y Z … P1P2 1.2Megs0.9M P3 1.1Megs...X1X1 X2X2 X3X3 X4X4 X5X5 X6X6..XnXn Y1Y1 Y2Y2 YnYn Slight load imbalance, but not substantial nonuniform alignment has huge penalty!

Nonunform interleaved Calculate Offsets using Collective (AllGather) X Y Z … P1P2 1.2Megs0.9M P3 1.2Megs Offset 0off=1.1Moff=2.1M Then write to mutually exclusive sections of array Still suffers from alignment issues… One array at a time

Performance Experiences (navigating a seemingly impossible minefield of constraints)

Good performance if transaction is even multiple of stripe size

Even better if you make #stripes equal to #compute processes Performance islands more pronounced. Typical (Non- OST-sized) cases worse.

Performance falls dramatically if you offset start of file by small increment (64k)

Impractical to aim for such small “performance islands” Transfer size for interleaved I/O must always match OST stripe width –Difficult to constrain domain-decomposition to granularity of I/O –Compromises load balancing for particle codes –Not practical for AMR codes (load-balanced, but not practical to have exactly identical domain sizes) Every compute node must write exactly aligned to OST boundary –How is this feasible if users write metadata or headers to their files? –Difficult for high-level self-describing file formats –Not practical when domain-sizes are slightly non-uniform (such as AMR, particle load balancing, outer-boundary conditions for 3D grids)