Project 4 SciDAC All Hands Meeting March 26-27, 2002 PIs:Alok Choudhary, Wei-keng Liao Grad Students:Avery Ching, Kenin Coloma, Jianwei Li ANL Collaborators:Bill.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

More on File Management
1 Computer Science, University of Warwick Accessing Irregularly Distributed Arrays Process 0’s data arrayProcess 1’s data arrayProcess 2’s data array Process.
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
University of Chicago Department of Energy The Parallel and Grid I/O Perspective MPI, MPI-IO, NetCDF, and HDF5 are in common use Multi TB datasets also.
File Consistency in a Parallel Environment Kenin Coloma
A PLFS Plugin for HDF5 for Improved I/O Performance and Analysis Kshitij Mehta 1, John Bent 2, Aaron Torres 3, Gary Grider 3, Edgar Gabriel 1 1 University.
Phillip Dickens, Department of Computer Science, University of Maine. In collaboration with Jeremy Logan, Postdoctoral Research Associate, ORNL. Improving.
File Systems.
Parallel I/O Performance Study Christian Chilan The HDF Group September 9, 2008SPEEDUP Workshop - HDF5 Tutorial1.
I/O Analysis and Optimization for an AMR Cosmology Simulation Jianwei LiWei-keng Liao Alok ChoudharyValerie Taylor ECE Department Northwestern University.
NetCDF An Effective Way to Store and Retrieve Scientific Datasets Jianwei Li 02/11/2002.
File System Implementation CSCI 444/544 Operating Systems Fall 2008.
I/O Optimization for ENZO Cosmology Simulation Using MPI-IO Jianwei Li12/06/2001.
1 File Management in Representative Operating Systems.
Memory Management Chapter 5.
Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory
Chapter 4.1 Interprocess Communication And Coordination By Shruti Poundarik.
Grid IO APIs William Gropp Mathematics and Computer Science Division.
NetCDF Ed Hartnett Unidata/UCAR
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Alok 1Northwestern University Access Patterns, Metadata, and Performance Alok Choudhary and Wei-Keng Liao Department of ECE,
Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.
1 High level view of HDF5 Data structures and library HDF Summit Boeing Seattle September 19, 2006.
HDF5 A new file format & software for high performance scientific data management.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
The HDF Group HDF5 Datasets and I/O Dataset storage and its effect on performance May 30-31, 2012HDF5 Workshop at PSI 1.
1 Parallel and Grid I/O Infrastructure Rob Ross, Argonne National Lab Parallel Disk Access and Grid I/O (P4) SDM All Hands Meeting March 26, 2002.
Sensitivity of Cluster File System Access to I/O Server Selection A. Apon, P. Wolinski, and G. Amerson University of Arkansas.
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
Integrated Grid workflow for mesoscale weather modeling and visualization Zhizhin, M., A. Polyakov, D. Medvedev, A. Poyda, S. Berezin Space Research Institute.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Opportunities in Parallel I/O for Scientific Data Management Rajeev Thakur and Rob Ross Mathematics and Computer Science Division Argonne National Laboratory.
Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad.
1/30/2003 BARC1 Profile-Guided I/O Partitioning Yijian Wang David Kaeli Electrical and Computer Engineering Department Northeastern University {yiwang,
Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab.
The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.
MEMORY ORGANIZTION & ADDRESSING Presented by: Bshara Choufany.
1 HDF5 Life cycle of data Boeing September 19, 2006.
ROOT and Federated Data Stores What Features We Would Like Fons Rademakers CERN CC-IN2P3, Nov, 2011, Lyon, France.
I/O on Clusters Rajeev Thakur Argonne National Laboratory.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.
I/O for Structured-Grid AMR Phil Colella Lawrence Berkeley National Laboratory Coordinating PI, APDEC CET.
Lecture 18 Windows – NT File System (NTFS)
Processes and Virtual Memory
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package MuQun Yang, Christian Chilan, Albert Cheng, Quincey Koziol, Mike.
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
Parallel NetCDF Rob Latham Mathematics and Computer Science Division Argonne National Laboratory
SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.
SDM Center Parallel I/O Storage Efficient Access Team.
PIDX PIDX - a parallel API to capture the data models used by HPC application and write it out in an IDX format. PIDX enables simulations to write out.
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.
Other Projects Relevant (and Not So Relevant) to the SODA Ideal: NetCDF, HDF, OLE/COM/DCOM, OpenDoc, Zope Sheila Denn INLS April 16, 2001.
Muen Policy & Toolchain
Memory Management.
Jonathan Walpole Computer Science Portland State University
Module 11: File Structure
File System Implementation
Database Performance Tuning and Query Optimization
CSCE 990: Advanced Distributed Systems
Chapter 11 Database Performance Tuning and Query Optimization
PVFS: A Parallel File System for Linux Clusters
NCL variable based on a netCDF variable model
Presentation transcript:

Project 4 SciDAC All Hands Meeting March 26-27, 2002 PIs:Alok Choudhary, Wei-keng Liao Grad Students:Avery Ching, Kenin Coloma, Jianwei Li ANL Collaborators:Bill Gropp, Rob Ross, Rajeev Thakur Enabling High Performance Application I/O Wei-keng Liao Northwestern University

Outline 1.Design of parallel netCDF APIs –Using MPI-IO underlying (student: Jianwei Li) –Collaborators: Bill Gropp, Rob Ross, Rajeev Thakur (ANL) 2.Non-contiguous data access on PVFS –Design of non-contiguous access APIs (student: Avery Ching) –Interfaces to the MPI-IO (student: Kenin Coloma) –Applications: FLASH, tiled visualization –Collaborators: Bill Gropp, Rob Ross, Rajeev Thakur (ANL) 3.High level data access patterns –ENZO astrophysics application –Access patterns of an AMR application

NetCDF Overview NetCDF (network Common Data Form) is an interface for array-oriented data access. It defines a machine-independent file format for representing multi- dimensional arrays with ancillary data, and provide I/O library for creation, access, and sharing of array-oriented data. Each netCDF file is a dataset, which contains a set of named arrays. Dataset Component Dimensions name, length –Fixed dimension –UNLIMITED dimension Variables: named arrays name, type, shape, attributes, array data –Fixed sized variables: array of fixed dimensions –Record variables: array with its most- significant dimension UNLIMITED –Coordinate variables: 1-D array with the same name as its dimension Attributes name, type, values, length –Variable attributes –Global attributes netCDF example { // CDL notation for netCDF dataset dimensions: // dimension names and lengths lat = 5, lon = 10, level = 4, time = unlimited; variables: // var types, names, shapes, attributes float temp(time,level,lat,lon); temp:long_name = "temperature"; temp:units = "celsius"; float rh(time,lat,lon); rh:long_name = "relative humidity"; rh:valid_range = 0.0, 1.0; // min and max int lat(lat), lon(lon), level(level), time(time); lat:units = "degrees_north"; lon:units = "degrees_east"; level:units = "millibars"; time:units = "hours since "; // global attributes: :source = "Fictional Model Output"; data: // optional data assignments level = 1000, 850, 700, 500; lat = 20, 30, 40, 50, 60; lon = -160,-140,-118,-96,-84,-52,-45,-35,-25,-15; time = 12; rh =.5,.2,.4,.2,.3,.2,.4,.5,.6,.7,.1,.3,.1,.1,.1,.1,.5,.7,.8,.8,.1,.2,.2,.2,.2,.5,.7,.8,.9,.9,.1,.2,.3,.3,.3,.3,.7,.8,.9,.9, 0,.1,.2,.4,.4,.4,.4,.7,.9,.9; // 1 record allocated }

Design Parallel netCDF APIs Goal –Maintain exactly the same original netCDF file format –Provide parallel I/O functionalities On top of MPI-IO High level parallel APIs –Minimize the argument list change of netCDF APIs –For legacy codes with minimal changes Low level parallel APIs –Using MPI-IO components, e.g. derived data types –For MPI-IO experienced users

NetCDF File Structure ● Header (dataset definition, extendable) - Number of records allocated - Dimension list - Global attribute list - Variable list ● Data (row-major, big-endian, 4 byte aligned) - Fixed-sized(non-record) data data for each variable is stored contiguously in defined order - Record data (non-contiguous between records of a var) a variable number of fixed-size records, each of which contains one record for each of the record variables in defined order.

NetCDF APIs Dataset APIs -- create/open/close a dataset, set the dataset to define/data mode, and synchronize dataset changes to disk Input: path, mode for create/open; dataset ID for opened dataset Output: dataset ID for create/open Define mode APIs -- define dataset: add dimensions, variables Input: opened dataset ID; dimension name and length to define dimension; or variable name, number of dimensions, shape to define variable Output: dimension ID; or variable ID Attribute APIs -- add, change, and read attributes of datasets Input: opened dataset ID; attribute No. or attribute name to access attribute; or attribute name, type, and value to add/change attribute Output: attribute value for read attribute Inquiry APIs -- inquire dataset metadata (in memory): dim(id, name, len), var(name, ndims, shape, id) Input: opened dataset id; dim name or id, or variable name or id Output: dimension info, or variable info Data mode APIs – read/write variable (access method: single value, whole array, subarray, strided subarray, sampled subarray) Input: opened dataset ID; variable id; element start index, count, stride, index map.

Design of Parallel APIs Two file descriptors –NetCDF file descriptor: For header I/O (reuse of old code) Performed only by process 0 –MPI_File handle: For data array I/O Performed by all processes Implicit MPI file handle and communicator –Added into the internal data structure –MPI communicator passed as an argument in create/open I/O implementation using MPI-IO –File view and offsets are computed from metadata in header and user-provided arguments (start, count, stride) –Users choose either collective or non-collective I/O calls

Collective/Non-collective APIs Dataset APIs –Collective calls over the communicator passed into the create or open call –All processes collectively switches between define and data mode Define mode, attribute, inquiry APIs –Collective or non-collective calls –Operate in local memory (all processes have identical header structures) Data mode APIs –Collective or non-collective calls –Access method: single value, whole array, subarray, strided subarray

Changes in High-level Parallel APIs Original netCDF APIs Parallel APIs Argument changed Need MPI-IO Dataset nc_create Add MPI_Comm yes nc_open nc_enddef No change nc_redef nc_close nc_sync Define mode, Attribute, Inquiry allNo change no Data mode nc_put_var_ type * nc_put_var_ type No changeyes nc_get_var_ type nc_get_var_ type_all * type = text | uchar | schar | short | int | long | float | double

Example Code - Write Create a dataset –Collective –The input arguments should be the same among processes –The returned ncid is different among processes (but refers the same dataset) –All processes put in define mode Define dimensions –Non-collective –All processes should have the same definitions Define variables –Non-collective –All processes should have the same definitions Add attributes –Non-collective –All processes should have put the same attributes End define –Collective –All processes switch from define mode to data mode Write variable data –All processes do a number of collective write to write the data for each variable –Can do independent write, if you like –Each process provide different argument values which are set locally Close the dataset –Collective status = nc_create(comm, "test.nc", NC_CLOBBER, &ncid); /* dimension */ status = nc_def_dim(ncid, "x", 100L, &dimid1); status = nc_def_dim(ncid, "y", 100L, &dimid2); status = nc_def_dim(ncid, "z", 100L, &dimid3); status = nc_def_dim(ncid, "time", NC_UNLIMITED, &udimid); square_dim[0] = cube_dim[0] = xytime_dim[1] = dimid1; square_dim[1] = cube_dim[1] = xytime_dim[2] = dimid2; cube_dim[2] = dimid3; xytime_dim[0] = udimid; time_dim[0] = udimid; /* variable */ status = nc_def_var (ncid, "square", NC_INT, 2, square_dim, &square_id); status = nc_def_var (ncid, "cube", NC_INT, 3, cube_dim, &cube_id); status = nc_def_var (ncid, "time", NC_INT, 1, time_dim, &time_id); status = nc_def_var (ncid, "xytime", NC_INT, 3, xytime_dim, &xytime_id); /* attributes */ status = nc_put_att_text (ncid, NC_GLOBAL, "title", strlen(title), title); status = nc_put_att_text (ncid, square_id, "description", strlen(desc), desc); status = nc_enddef(ncid); /* variable data */ nc_put_vara_int_all(ncid, square_id, square_start, square_count, buf1); nc_put_vara_int_all(ncid, cube_id, cube_start, cube_count, buf2); nc_put_vara_int_all(ncid, time_id, time_start, time_count, buf3); nc_put_vara_int_all(ncid, xytime_id, xytime_start, xytime_count, buf4); status = nc_close(ncid); The only change

Example Code - Read status = nc_open(comm, filename, 0, &ncid); status = nc_inq(ncid, &ndims, &nvars, &ngatts, &unlimdimid); /* global attributes */ for (i = 0; i < ngatts; i++) { status = nc_inq_attname(ncid, NC_GLOBAL, i, name); status = nc_inq_att (ncid, NC_GLOBAL, name, &type, &len); status = nc_get_att_text(ncid, NC_GLOBAL, name, valuep); } /* variables */ for (i = 0; i < nvars; i++) { status = nc_inq_var(ncid, i, name, vartypes+i, varndims+i, vardims[i], varnatts+i); /* variable attributes */ for (j = 0; j < varnatts[i]; j++) { status = nc_inq_attname(ncid, varids[i], j, name); status = nc_inq_att (ncid, varids[i], name, &type, &len); status = nc_get_att_text(ncid, varids[i], name, valuep); } /* variable data */ for (i = 0; i < NC_MAX_VAR_DIMS; i++) start[i] = 0; for (i = 0; i < nvars; i++) { varsize = 1; /* dimensions */ for (j = 0; j < varndims[i]; j++) { status = nc_inq_dim(ncid, vardims[i][j], name, shape + j); if (j == 0) { shape[j] /= nprocs; start[j] = shape[j] * rank; } varsize *= shape[j]; } status = nc_get_vara_int_all(ncid, i, start, shape, (int *)valuep); } status = nc_close(ncid); Open the dataset –Collective –The input arguments should be the same among processes –The returned ncid is different among processes (but refers the same dataset) –All processes put in data mode Dataset inquiries –Non-collective –Count, name, len, datatype Read variable data –All processes do a number of collective read to read the data from each variable in (B, *, *) manner –Can do independent read, if you like –Each process provide different argument values which are set locally Close the dataset –Collective The only change

Non-contiguous Data Access on PVFS Problem definition Design approaches –Multiple I/O –Data sieving –PVFS list_io Integration into MPI-IO Experimental results –Artificial benchmark –FLASH application I/O –Tile visualization

Non-contiguous Data Access Data access that is not adjacent in memory or file –Non-contiguous in memory, contiguous in file –Non-contiguous in file, contiguous in memory –Non-contiguous in file, non- contiguous in memory Two applications –FLASH astrophysics application –Tile visualization Non-contiguous in file Contiguous in memory Non-contiguous in memory Memory File Memory File Memory File Non-contiguous in memory Contiguous in file Non-contiguous in file

Multiple I/O Requests Application Contiguous Data Region Contiguous Data Region Contiguous Data Region I/O Request I/O Server I/O Server I/O Request I/O Request Intuitive strategy –One I/O request per contiguous data segment Large number of I/O requests to the file system –Communication costs between applications and I/O servers become significant which can dominates the I/O time I/O Server First read request Second read request File

Data Sieving I/O Reads a contiguous chunk frm the file into a temporary buffer Extract/update the requested portions –Number of requests reduced –I/O amount increased –Number of I/O requests depends on the size of sieving buffer Write back to file (for write operations) Application Contiguous Data Region I/O Request I/O Server I/O Server I/O Server I/O Request Contiguous Data Region Contiguous Data Region Contiguous Data Region First I/O requestSecond I/O request File

PVFS List_io Combine non-contiguous I/O requests into a single request Client support –APIs pvfs_list_read, pvfs_list_write –I/O request -- a list of file offsets and file lengths I/O server support –Wait for trailing list of file offsets and lengths following I/O request

Artificial Benchmark Contiguous in memory, non-contiguous in file Parameters: –Number of accesses –Number of processors –Stride size = file size / number of accesses –Block size = stride size / number of processors File Memory Proc 0Proc 1Proc 2 Stride 4 accesses

Benchmark Results Parameter configurations –8 clients –8 I/O servers –1 Gigabyte file size 300 Number of Accesses Data Sieving 40k List_io Multiple I/O 800k20k600k400k200k100k80k60k Time (in seconds) Read k20k30k40k50k60k70k80k90k Number of Accesses Time (seconds) Write Multiple I/O List_io Avoid caching effect at I/O servers –Read/write 4 files alternatively since each I/O server has 512 MB memory

FLASH Application An astrophysics application developed at University of Chicago –Simulate the accretion of matter onto a compact star, and the subsequent stellar evolution, including nuclear burning either on the surface of the compact star, or in its interior I/O benchmark measures the performance of the FLASH output: produces checkpoint files, plot-files –A typical large production run generates ~ 0.5 Tbytes (100 checkpoint files and 1,000 plot-files) This image, the interior of an exploding star, depicts the distribution of pressure during a star explosion

FLASH -- I/O Access Pattern Each processor has 80 cubes –Each has guard cells and a sub- cube which holds the data to be output Each element in the cube contains 24 variables, each is of type double (8 bytes) –Each variable is partitioned among all processors Output pattern –All variables are saved into a single file, one after another

FLASH I/O Results Access patterns: In memory –Each contiguous segment is small, 8 bytes –Stride size between two segments is small, 192 bytes From memory to file –Multiple I/O: 8*8*8*80*24 = 983,040 request per processors –Data sieving: 24 requests per processor –List_io: 8*8*8*80*24/64 = 15,360 requests per processor (64 is the max number of offset-length pairs) In file –Each contiguous segment is of size 8*8*8*8 = 4096 bytes written by each processor –The output file is of size 8 MB * number of procs

Tile Visualization Preprocess “frames” into streams of tiles by staging tile data on visualization nodes Read operations only Each node reads one sub-tile Each sub-tile has ghost regions overlapped with other sub-tiles The noncontiguous nature of this file access becomes apparent in its logical file representation Tile 1Tile 2Tile 3 Tile 4Tile 5Tile 6 Example layout 3x2 display Frame size of 2532x1408 pixels Tile size of 1024x768 w/ overlap 3 byte RGB pixels Each frame is stored as a file of size 10MB... Single node’s file view Proc 0 Proc 1 Proc 2

Integrate List_io to ROMIO Filetype offsets & lengths Datatype offsets & lengths... FileMemory pvfs_read_list(Memory offsets/lengths, File offsets/lengths) Then, using the list, ROMIO steps through both file and memory addresses ROMIO generates memory and file offsets and lengths to pass through pvfs_list_io ROMIO calls pvfs_list_io after all data has been read, or the set max array size has been reached, in which case a new list is generated ROMIO uses the internal ADIO function flatten to break both the filetypes and datatypes down into a list of offset and length pairs

Tile I/O Results Collective data sieving Collective read_listNon-collective data sieving Non-collective read_list accumulated time 4 compute nodes 108 MB compute nodes 40 MB compute nodes 108 MB compute nodes 40 MB compute nodes 108 MB io nodes 16 compute nodes 40 MB io nodes

Analysis of Tile I/O Results Collective operations theoretically should be faster, but … Hardware problem –Fast Ethernet: overhead in the collective I/O takes too long to catch back up with the independent I/O requests Software problem –A lot of extra data movement in ROMIO collectives -- the aggregation isn't as smart as it could be Plans to do –Use MPE logging facilities to figure out the problem –Study of the ROMIO implementation, find bottlenecks in the collectives and try to weed them out

High Level Data Access Patterns Study of file access patterns of astrophysics applications –FLASH from University of Chicago –ENZO from NCSA Design of data management framework using XML and database –Essential metadata collection –Trigger rules for automatic I/O optimization

ENZO Application Simulate the formation of a cluster of galaxies starting near the big bang until the present day It is used to test theories of how galaxy forms by comparing the results with what is really observed in the sky today File I/O using HDF-4 Dynamic load balance using MPI Data partitioning: Adaptive Mesh Refinement (AMR)

AMR Data Access Pattern Adaptive Mesh Refinement partitions problem domain into sub-domains recursively an dynamically A grid can only be owned by a processor but one processor can have many grids. Check-pointing is performed –Each grid is written to a separate file (independent writes) During re-start –Sub-domain hierarchy need not be re- constructed –Grids at the same time stamp are read altogether During visualization –All grids are combined into a top grid

AMR Hierarchy Represented in XML AMR hierarchy is naturally mapped into XML hierarchy XML is embedded in a relational database Metadata queries/update through the database Database can handle multiple queries simultaneously – ideal for parallel applications element attribute element type key 1 attribute node 2 null parent key 8attributelevel6 cdata element attribute element attribute GridRank name value Producer id Grid name DataSet 0 "grid" "ENZO" grid.xml table

File System Based XML File system is used to support the decomposition of XML documents into files and directories This representation consists of an arbitrary hierarchy of directories and files, and preserves the XML philosophy of being textual in representation but requires no further use of an XML parser to process the document Metadata locates near to the scientific data

Summary List_io API incorporated into PVFS for non- contiguous data access –Read operation is completed –Write operation in progress Parallel netCDF APIs –High-level APIs --- will completed soon –Low-level APIs --- interfaces already defined –Validater High level data access patterns –Access patterns of AMR applications –Other types of applications