Project 4 SciDAC All Hands Meeting March 26-27, 2002 PIs:Alok Choudhary, Wei-keng Liao Grad Students:Avery Ching, Kenin Coloma, Jianwei Li ANL Collaborators:Bill.

Project 4 SciDAC All Hands Meeting March 26-27, 2002 PIs:Alok Choudhary, Wei-keng Liao Grad Students:Avery Ching, Kenin Coloma, Jianwei Li ANL Collaborators:Bill Gropp, Rob Ross, Rajeev Thakur Enabling High Performance Application I/O Wei-keng Liao Northwestern University

Outline 1.Design of parallel netCDF APIs –Using MPI-IO underlying (student: Jianwei Li) –Collaborators: Bill Gropp, Rob Ross, Rajeev Thakur (ANL) 2.Non-contiguous data access on PVFS –Design of non-contiguous access APIs (student: Avery Ching) –Interfaces to the MPI-IO (student: Kenin Coloma) –Applications: FLASH, tiled visualization –Collaborators: Bill Gropp, Rob Ross, Rajeev Thakur (ANL) 3.High level data access patterns –ENZO astrophysics application –Access patterns of an AMR application

NetCDF Overview NetCDF (network Common Data Form) is an interface for array-oriented data access. It defines a machine-independent file format for representing multi- dimensional arrays with ancillary data, and provide I/O library for creation, access, and sharing of array-oriented data. Each netCDF file is a dataset, which contains a set of named arrays. Dataset Component Dimensions name, length –Fixed dimension –UNLIMITED dimension Variables: named arrays name, type, shape, attributes, array data –Fixed sized variables: array of fixed dimensions –Record variables: array with its most- significant dimension UNLIMITED –Coordinate variables: 1-D array with the same name as its dimension Attributes name, type, values, length –Variable attributes –Global attributes netCDF example { // CDL notation for netCDF dataset dimensions: // dimension names and lengths lat = 5, lon = 10, level = 4, time = unlimited; variables: // var types, names, shapes, attributes float temp(time,level,lat,lon); temp:long_name = "temperature"; temp:units = "celsius"; float rh(time,lat,lon); rh:long_name = "relative humidity"; rh:valid_range = 0.0, 1.0; // min and max int lat(lat), lon(lon), level(level), time(time); lat:units = "degrees_north"; lon:units = "degrees_east"; level:units = "millibars"; time:units = "hours since 1996-1-1"; // global attributes: :source = "Fictional Model Output"; data: // optional data assignments level = 1000, 850, 700, 500; lat = 20, 30, 40, 50, 60; lon = -160,-140,-118,-96,-84,-52,-45,-35,-25,-15; time = 12; rh =.5,.2,.4,.2,.3,.2,.4,.5,.6,.7,.1,.3,.1,.1,.1,.1,.5,.7,.8,.8,.1,.2,.2,.2,.2,.5,.7,.8,.9,.9,.1,.2,.3,.3,.3,.3,.7,.8,.9,.9, 0,.1,.2,.4,.4,.4,.4,.7,.9,.9; // 1 record allocated }

Design Parallel netCDF APIs Goal –Maintain exactly the same original netCDF file format –Provide parallel I/O functionalities On top of MPI-IO High level parallel APIs –Minimize the argument list change of netCDF APIs –For legacy codes with minimal changes Low level parallel APIs –Using MPI-IO components, e.g. derived data types –For MPI-IO experienced users

NetCDF File Structure ● Header (dataset definition, extendable) - Number of records allocated - Dimension list - Global attribute list - Variable list ● Data (row-major, big-endian, 4 byte aligned) - Fixed-sized(non-record) data data for each variable is stored contiguously in defined order - Record data (non-contiguous between records of a var) a variable number of fixed-size records, each of which contains one record for each of the record variables in defined order.

NetCDF APIs Dataset APIs -- create/open/close a dataset, set the dataset to define/data mode, and synchronize dataset changes to disk Input: path, mode for create/open; dataset ID for opened dataset Output: dataset ID for create/open Define mode APIs -- define dataset: add dimensions, variables Input: opened dataset ID; dimension name and length to define dimension; or variable name, number of dimensions, shape to define variable Output: dimension ID; or variable ID Attribute APIs -- add, change, and read attributes of datasets Input: opened dataset ID; attribute No. or attribute name to access attribute; or attribute name, type, and value to add/change attribute Output: attribute value for read attribute Inquiry APIs -- inquire dataset metadata (in memory): dim(id, name, len), var(name, ndims, shape, id) Input: opened dataset id; dim name or id, or variable name or id Output: dimension info, or variable info Data mode APIs – read/write variable (access method: single value, whole array, subarray, strided subarray, sampled subarray) Input: opened dataset ID; variable id; element start index, count, stride, index map.

Design of Parallel APIs Two file descriptors –NetCDF file descriptor: For header I/O (reuse of old code) Performed only by process 0 –MPI_File handle: For data array I/O Performed by all processes Implicit MPI file handle and communicator –Added into the internal data structure –MPI communicator passed as an argument in create/open I/O implementation using MPI-IO –File view and offsets are computed from metadata in header and user-provided arguments (start, count, stride) –Users choose either collective or non-collective I/O calls

Collective/Non-collective APIs Dataset APIs –Collective calls over the communicator passed into the create or open call –All processes collectively switches between define and data mode Define mode, attribute, inquiry APIs –Collective or non-collective calls –Operate in local memory (all processes have identical header structures) Data mode APIs –Collective or non-collective calls –Access method: single value, whole array, subarray, strided subarray

Changes in High-level Parallel APIs Original netCDF APIs Parallel APIs Argument changed Need MPI-IO Dataset nc_create Add MPI_Comm yes nc_open nc_enddef No change nc_redef nc_close nc_sync Define mode, Attribute, Inquiry allNo change no Data mode nc_put_var_ type * nc_put_var_ type No changeyes nc_get_var_ type nc_get_var_ type_all * type = text | uchar | schar | short | int | long | float | double

Example Code - Write Create a dataset –Collective –The input arguments should be the same among processes –The returned ncid is different among processes (but refers the same dataset) –All processes put in define mode Define dimensions –Non-collective –All processes should have the same definitions Define variables –Non-collective –All processes should have the same definitions Add attributes –Non-collective –All processes should have put the same attributes End define –Collective –All processes switch from define mode to data mode Write variable data –All processes do a number of collective write to write the data for each variable –Can do independent write, if you like –Each process provide different argument values which are set locally Close the dataset –Collective status = nc_create(comm, "test.nc", NC_CLOBBER, &ncid); /* dimension */ status = nc_def_dim(ncid, "x", 100L, &dimid1); status = nc_def_dim(ncid, "y", 100L, &dimid2); status = nc_def_dim(ncid, "z", 100L, &dimid3); status = nc_def_dim(ncid, "time", NC_UNLIMITED, &udimid); square_dim[0] = cube_dim[0] = xytime_dim[1] = dimid1; square_dim[1] = cube_dim[1] = xytime_dim[2] = dimid2; cube_dim[2] = dimid3; xytime_dim[0] = udimid; time_dim[0] = udimid; /* variable */ status = nc_def_var (ncid, "square", NC_INT, 2, square_dim, &square_id); status = nc_def_var (ncid, "cube", NC_INT, 3, cube_dim, &cube_id); status = nc_def_var (ncid, "time", NC_INT, 1, time_dim, &time_id); status = nc_def_var (ncid, "xytime", NC_INT, 3, xytime_dim, &xytime_id); /* attributes */ status = nc_put_att_text (ncid, NC_GLOBAL, "title", strlen(title), title); status = nc_put_att_text (ncid, square_id, "description", strlen(desc), desc); status = nc_enddef(ncid); /* variable data */ nc_put_vara_int_all(ncid, square_id, square_start, square_count, buf1); nc_put_vara_int_all(ncid, cube_id, cube_start, cube_count, buf2); nc_put_vara_int_all(ncid, time_id, time_start, time_count, buf3); nc_put_vara_int_all(ncid, xytime_id, xytime_start, xytime_count, buf4); status = nc_close(ncid); The only change

Example Code - Read status = nc_open(comm, filename, 0, &ncid); status = nc_inq(ncid, &ndims, &nvars, &ngatts, &unlimdimid); /* global attributes */ for (i = 0; i < ngatts; i++) { status = nc_inq_attname(ncid, NC_GLOBAL, i, name); status = nc_inq_att (ncid, NC_GLOBAL, name, &type, &len); status = nc_get_att_text(ncid, NC_GLOBAL, name, valuep); } /* variables */ for (i = 0; i < nvars; i++) { status = nc_inq_var(ncid, i, name, vartypes+i, varndims+i, vardims[i], varnatts+i); /* variable attributes */ for (j = 0; j < varnatts[i]; j++) { status = nc_inq_attname(ncid, varids[i], j, name); status = nc_inq_att (ncid, varids[i], name, &type, &len); status = nc_get_att_text(ncid, varids[i], name, valuep); } /* variable data */ for (i = 0; i < NC_MAX_VAR_DIMS; i++) start[i] = 0; for (i = 0; i < nvars; i++) { varsize = 1; /* dimensions */ for (j = 0; j < varndims[i]; j++) { status = nc_inq_dim(ncid, vardims[i][j], name, shape + j); if (j == 0) { shape[j] /= nprocs; start[j] = shape[j] * rank; } varsize *= shape[j]; } status = nc_get_vara_int_all(ncid, i, start, shape, (int *)valuep); } status = nc_close(ncid); Open the dataset –Collective –The input arguments should be the same among processes –The returned ncid is different among processes (but refers the same dataset) –All processes put in data mode Dataset inquiries –Non-collective –Count, name, len, datatype Read variable data –All processes do a number of collective read to read the data from each variable in (B, *, *) manner –Can do independent read, if you like –Each process provide different argument values which are set locally Close the dataset –Collective The only change

Non-contiguous Data Access on PVFS Problem definition Design approaches –Multiple I/O –Data sieving –PVFS list_io Integration into MPI-IO Experimental results –Artificial benchmark –FLASH application I/O –Tile visualization

Non-contiguous Data Access Data access that is not adjacent in memory or file –Non-contiguous in memory, contiguous in file –Non-contiguous in file, contiguous in memory –Non-contiguous in file, non- contiguous in memory Two applications –FLASH astrophysics application –Tile visualization Non-contiguous in file Contiguous in memory Non-contiguous in memory Memory File Memory File Memory File Non-contiguous in memory Contiguous in file Non-contiguous in file

Multiple I/O Requests Application Contiguous Data Region Contiguous Data Region Contiguous Data Region I/O Request I/O Server I/O Server I/O Request I/O Request Intuitive strategy –One I/O request per contiguous data segment Large number of I/O requests to the file system –Communication costs between applications and I/O servers become significant which can dominates the I/O time I/O Server First read request Second read request File

Data Sieving I/O Reads a contiguous chunk frm the file into a temporary buffer Extract/update the requested portions –Number of requests reduced –I/O amount increased –Number of I/O requests depends on the size of sieving buffer Write back to file (for write operations) Application Contiguous Data Region I/O Request I/O Server I/O Server I/O Server I/O Request Contiguous Data Region Contiguous Data Region Contiguous Data Region First I/O requestSecond I/O request File

PVFS List_io Combine non-contiguous I/O requests into a single request Client support –APIs pvfs_list_read, pvfs_list_write –I/O request -- a list of file offsets and file lengths I/O server support –Wait for trailing list of file offsets and lengths following I/O request

Artificial Benchmark Contiguous in memory, non-contiguous in file Parameters: –Number of accesses –Number of processors –Stride size = file size / number of accesses –Block size = stride size / number of processors File Memory Proc 0Proc 1Proc 2 Stride 4 accesses

Benchmark Results Parameter configurations –8 clients –8 I/O servers –1 Gigabyte file size 300 Number of Accesses 200 100 Data Sieving 40k List_io Multiple I/O 800k20k600k400k200k100k80k60k 600 500 400 0 Time (in seconds) Read 0 50 100 150 200 250 300 350 400 10k20k30k40k50k60k70k80k90k Number of Accesses Time (seconds) Write Multiple I/O List_io Avoid caching effect at I/O servers –Read/write 4 files alternatively since each I/O server has 512 MB memory

FLASH Application An astrophysics application developed at University of Chicago –Simulate the accretion of matter onto a compact star, and the subsequent stellar evolution, including nuclear burning either on the surface of the compact star, or in its interior I/O benchmark measures the performance of the FLASH output: produces checkpoint files, plot-files –A typical large production run generates ~ 0.5 Tbytes (100 checkpoint files and 1,000 plot-files) This image, the interior of an exploding star, depicts the distribution of pressure during a star explosion

FLASH -- I/O Access Pattern Each processor has 80 cubes –Each has guard cells and a sub- cube which holds the data to be output Each element in the cube contains 24 variables, each is of type double (8 bytes) –Each variable is partitioned among all processors Output pattern –All variables are saved into a single file, one after another

FLASH I/O Results Access patterns: In memory –Each contiguous segment is small, 8 bytes –Stride size between two segments is small, 192 bytes From memory to file –Multiple I/O: 8*8*8*80*24 = 983,040 request per processors –Data sieving: 24 requests per processor –List_io: 8*8*8*80*24/64 = 15,360 requests per processor (64 is the max number of offset-length pairs) In file –Each contiguous segment is of size 8*8*8*8 = 4096 bytes written by each processor –The output file is of size 8 MB * number of procs

Tile Visualization Preprocess “frames” into streams of tiles by staging tile data on visualization nodes Read operations only Each node reads one sub-tile Each sub-tile has ghost regions overlapped with other sub-tiles The noncontiguous nature of this file access becomes apparent in its logical file representation Tile 1Tile 2Tile 3 Tile 4Tile 5Tile 6 Example layout 3x2 display Frame size of 2532x1408 pixels Tile size of 1024x768 w/ overlap 3 byte RGB pixels Each frame is stored as a file of size 10MB... Single node’s file view Proc 0 Proc 1 Proc 2

Integrate List_io to ROMIO Filetype offsets & lengths Datatype offsets & lengths... FileMemory pvfs_read_list(Memory offsets/lengths, File offsets/lengths) Then, using the list, ROMIO steps through both file and memory addresses ROMIO generates memory and file offsets and lengths to pass through pvfs_list_io ROMIO calls pvfs_list_io after all data has been read, or the set max array size has been reached, in which case a new list is generated ROMIO uses the internal ADIO function flatten to break both the filetypes and datatypes down into a list of offset and length pairs

Tile I/O Results Collective data sieving Collective read_listNon-collective data sieving Non-collective read_list accumulated time 4 compute nodes 108 MB 0 1 2 3 4 5 6 7 8 9 10 481216 4 compute nodes 40 MB 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 481216 8 compute nodes 108 MB 481216 8 compute nodes 40 MB 481216 16 compute nodes 108 MB 481216 io nodes 16 compute nodes 40 MB 481216 io nodes 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9

Analysis of Tile I/O Results Collective operations theoretically should be faster, but … Hardware problem –Fast Ethernet: overhead in the collective I/O takes too long to catch back up with the independent I/O requests Software problem –A lot of extra data movement in ROMIO collectives -- the aggregation isn't as smart as it could be Plans to do –Use MPE logging facilities to figure out the problem –Study of the ROMIO implementation, find bottlenecks in the collectives and try to weed them out

High Level Data Access Patterns Study of file access patterns of astrophysics applications –FLASH from University of Chicago –ENZO from NCSA Design of data management framework using XML and database –Essential metadata collection –Trigger rules for automatic I/O optimization

ENZO Application Simulate the formation of a cluster of galaxies starting near the big bang until the present day It is used to test theories of how galaxy forms by comparing the results with what is really observed in the sky today File I/O using HDF-4 Dynamic load balance using MPI Data partitioning: Adaptive Mesh Refinement (AMR)

AMR Data Access Pattern Adaptive Mesh Refinement partitions problem domain into sub-domains recursively an dynamically A grid can only be owned by a processor but one processor can have many grids. Check-pointing is performed –Each grid is written to a separate file (independent writes) During re-start –Sub-domain hierarchy need not be re- constructed –Grids at the same time stamp are read altogether During visualization –All grids are combined into a top grid

AMR Hierarchy Represented in XML AMR hierarchy is naturally mapped into XML hierarchy XML is embedded in a relational database Metadata queries/update through the database Database can handle multiple queries simultaneously – ideal for parallel applications element attribute element type key 1 attribute 3 4 5 6 7 0 node 2 null 0 2 0 4 4 6 parent key 8attributelevel6 cdata element attribute element attribute GridRank name value Producer id Grid name DataSet 0 "grid" 0 3 0 "ENZO" grid.xml table

File System Based XML File system is used to support the decomposition of XML documents into files and directories This representation consists of an arbitrary hierarchy of directories and files, and preserves the XML philosophy of being textual in representation but requires no further use of an XML parser to process the document Metadata locates near to the scientific data

Summary List_io API incorporated into PVFS for non- contiguous data access –Read operation is completed –Write operation in progress Parallel netCDF APIs –High-level APIs --- will completed soon –Low-level APIs --- interfaces already defined –Validater High level data access patterns –Access patterns of AMR applications –Other types of applications

Project 4 SciDAC All Hands Meeting March 26-27, 2002 PIs:Alok Choudhary, Wei-keng Liao Grad Students:Avery Ching, Kenin Coloma, Jianwei Li ANL Collaborators:Bill.

Similar presentations

Presentation on theme: "Project 4 SciDAC All Hands Meeting March 26-27, 2002 PIs:Alok Choudhary, Wei-keng Liao Grad Students:Avery Ching, Kenin Coloma, Jianwei Li ANL Collaborators:Bill."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Project 4 SciDAC All Hands Meeting March 26-27, 2002 PIs:Alok Choudhary, Wei-keng Liao Grad Students:Avery Ching, Kenin Coloma, Jianwei Li ANL Collaborators:Bill.

Similar presentations

Presentation on theme: "Project 4 SciDAC All Hands Meeting March 26-27, 2002 PIs:Alok Choudhary, Wei-keng Liao Grad Students:Avery Ching, Kenin Coloma, Jianwei Li ANL Collaborators:Bill."— Presentation transcript:

Similar presentations

About project

Feedback