DoE I/O characterizations, infrastructures for wide-area collaborative science, and future opportunities Jeffrey S. Vetter, Micah Beck, Philip Roth Future.

DoE I/O characterizations, infrastructures for wide-area collaborative science, and future opportunities Jeffrey S. Vetter, Micah Beck, Philip Roth Future Technologies Group @ ORNL University of Tennessee, Knoxville

2 JV Cray XT3 Jaguar 5294 2.4GHz 11TB Memory Cray X1E Phoenix 1024.5GHz 2 TB Memory SGI Altix Ram (256) 1.5GHz 2TB Memory IBM SP4 Cheetah (864) 1.3GHz 1.1TB Memory IBM Linux NSTG (56) 3GHz 76GB Memory Visualization Cluster (128) 2.2GHz 128GB Memory IBM HPSS Many Storage Devices Supported Network Routers UltraScience 10 GigE 1 GigE Control Network Shared Disk 120TB 7 Systems September 2005 Summary Scientific Visualization Lab 32TB 36TB 9TB 5PB Supercomputers 7,622 CPUs 16TB Memory 45 TFlops 238.5 TB 5 PB 27 projector Power Wall 4.5TB Backup Storage 5TB Test Systems 96 processor Cray XT3 32 processor Cray X1E* 16 Processor SGI Altix Evaluation Platforms 144 processor Cray XD1 with FPGAs SRC Mapstation Clearspeed BlueGene (at ANL) NCCS Resources

3 JV Implications for Storage, Data, Networking  Decommissioning very important storage systems in GPFS –IBM SP3 decommissioned in June –IBM p690 cluster disabled on Oct 4  Next generation systems bring new storage architectures –Lustre on 25TF XT3, lightweight kernel [limited number of os services] –ADIC StorNext on 18 TF X1E  Visualization and Analysis systems –SGI Altix, Viz cluster, 35 Megapixel Powerwall  Networking –Connected to ESnet, Internet2, TeraGrid, UltraNet, etc. –ORNL Internal network being upgraded now  End-to-end solutions underway – Klasky’s talk

4 JV Initial Lustre Performance on Cray XT3  Acceptance criteria for XT3 is 5 GBps BW –32 OSS, 64 OST configuration hit 6.7 GBps write, 6.2 GBps read

5 JV Preliminary I/O Survey (Spring 2005) Mechanism Example Problem AppVersionUse Fortran I/O MPI-IONetCDFSchemeSizeFrequency GYRO3.0.0Read input filesxF: Rank 0< 1 MBInitialization Write checkpoint filesxxF: Rank 087.5 MBOnce per 1000 time-steps Write logging/debug filesxRank 0~150 KBFile-dependent POP (standalone) 1.4.3/2.0Read input filesxx (2.0 only) Parallel, rank 0Initialization Read forcing filesxxParallel, rank 0Every few time- steps Write 3d field filesxxParallel, rank 01.4 GBSeveral per simulation-month CAM (standalone) 3.0Read input filesxRank 0~300 MBInitialization Write checkpoint filesxRank 0Once per simulation-day Write output filesxRank 0~110 MBTermination AORSA2DRead input filesxAll ranks~26 MBInitialization Write output filesxRank 0~10 MBTermination TSIRead input filesxRank 0Initialization Write timestep filesxx (vh1)All ranks, post- processed into single NetCDF file 28 GB / timestep Hundreds per run Need I/O access patterns.

6 JV Initial Observations  Most users have limited I/O capability because the libraries and runtime systems are inconsistent across platforms –Little use of [Parallel] NetCDF or HDF5 –Seldom direct use of MPI-IO  Widely varying file size distribution –1MB, 10MB, 100MB, 1GB, 10GB  Comments from community –POP: baseline parallel I/O works; not clear new decomposition scheme is easily parallelized –TSI: would like to use higher-level libraries (parallel netcdf, hdf5) but they are not implemented or perform poorly on target architectures

7 JV Preliminary I/O Performance of VH-1 on X1E using Different I/O Strategies

8 JV Learning From Experience with VH-1  The fast way to write (today) –Each rank writes a separate architecture-specific file Native mode MPI I/O file-per-rank O(1000) MB/s –Manipulating such a dataset in this form is awkward  Two solutions –Write one portable file using collective I/O Native mode MPIO single file < 100 MB/s (no external32 on Phoenix) –Post-process sequentially, rewriting in portable format –In either case the computing platform is doing a lot of work in order to generate convenient metadata structure in the FS

Future Opportunities

10 JV Future Directions  Exposing structure to applications  High performance and fault tolerance in data intensive scientific work  Analysis and benchmarking of data intensive scientific codes  Advanced object storage devices  Adapting parallel I/O and parallel FS to wide area collaborative science

11 JV Exposing File Structure to Applications  VH-1 experiences motivate some issues  Expensive compute platforms perform I/O optimized to their own architecture and configuration –Metadata describes organization and encoding, using a portable schema –Processing is performed between writer and reader  File systems are local managers of structure & metadata –File description metadata can be managed outside a FS –In some cases, it may be possible to bypass file systems and access Object Storage Devices directly from the application –Exposing resources enables application autonomy.

12 JV High Performance and Fault Tolerance in Data Intensive Scientific Work  In complex workflows, performance and fault tolerance are defined in terms of global results, not locally  Logistical Networking have proved valuable to application and tool builders  We will continue collaborating in and investigating the design of SDM tools that incorporate LN functionality. –Caching and distribution of massive datasets –Interoperation with GridFTP-based infrastructure –Robustness through the use of widely distributed resources

13 JV Science Efforts Currently Leveraging Logistical Networking  Transfers over long networks decrease latency seen by TCP flows by storing, forwarding at intermediate point –Producer and consumer are buffered, transfer accelerated  Data movement within Terascale Supernova Initiative, Fusion Energy Simulation projects  Storage model for implementation of SRM interface for Open Science Grid project at Vanderbilt ACCRE  America View distribution of MODIS satellite data to a national audience

14 JV Continue Analysis and Benchmarking of Data Intensive Scientific Codes  Scientists may/should have abstract idea of I/O characteristics (esp. access patterns, access sizes), which are sensitive to systems and libraries  Instrumentation at the source level is difficult and may be misleading  Standard benchmarks and tools must be developed as a basis for comparison. –Some tools exist: ORNL’s mpiP provides statistics about MPI-IO runtime –I/O performance metrics should be routinely collected –Parallel I/O behavior should be accessible to users –Non-determinism in performance must be addressed

15 JV Move Toward Advanced Object Storage Devices  OSD is an emerging trend in file and storage systems (eg. Lustre, Panasas)  Current architecture is evolutionary from SCSI –Need to develop advanced forms of OSD that can be accessed directly by multiple users –Active OSD technology may enable pre- and post- processing at network storage nodes to offload hosts  These nodes must fit into larger middleware framework and workflow scheme

16 JV Adapt Parallel I/O and Parallel FS to Wide Area Collaborative Science  Emergence of massive digital libraries and shared workspaces requires common tools that provide more control than file transfer or distributed FS solutions  Direct control over tape, wide area transfer and local caching are important elements of application optimization  New standards are required for expressing file system concepts interoperably in a heterogeneous wide area environment.

17 JV Enable Uniform Component Coupling  Major components of scientific workflows interact through asynchronous file I/O interfaces –Granularity was traditionally the output of a complete run. –Today, as in Klasky & Bhat’s data streaming, granularity is one time step due to increased scale of computation  Flexible management of state required for customization of component interactions –localization (e.g.. caching) –fault tolerance (e.g. redundancy) –optimization (e.g. point-to-multipoint)

Questions?

Bonus Slides

20 JV Important Attributes of a Common Interface  Primitive, generic in order to serve many purposes –Sufficient to implement application requirements  Well layered in order to allow for diversity –Not imposing the costs of complex high layers on users of the lower layer functionality  Easily ported to new platforms and widely acceptable within the developer community –Who are the designers? What is the process?

21 JV Developer Conversations / POP  Parallel Ocean Program –Already have working parallel I/O scheme –Initial look at MPI-IO seemed to indicate impedance mismatch with POP’s decomposition scheme –NetCDF option doesn’t use parallel I/O –I/O is a low priority compared to other performance issues

22 JV Developer Conversations / TSI  Terascale Supernova Initiative –Would like to use Parallel NetCDF or HDF5, but they are unavailable or perform poorly on platform of choice (Cray X1) –Negligible performance impact of each rank writing individual timestep file, at least up to 140 PEs  Closer investigation of VH-1 shows –Performance impact of writing file-per-rank is not negligible –Major costs are imposed by writing architecture- independent files, forming a single timestep file –Parallel file systems address only some of these issues

23 JV Interfaces Used in Logistical Networking  On the network side –Sockets/TCP link clients to server –XML Metadata Schema  On the client side –Procedure calls in C/Fortran/Java/… –Application layer I/O libraries  End user tools –Command line –GUI implemented in TCL

DoE I/O characterizations, infrastructures for wide-area collaborative science, and future opportunities Jeffrey S. Vetter, Micah Beck, Philip Roth Future.

Similar presentations

Presentation on theme: "DoE I/O characterizations, infrastructures for wide-area collaborative science, and future opportunities Jeffrey S. Vetter, Micah Beck, Philip Roth Future."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DoE I/O characterizations, infrastructures for wide-area collaborative science, and future opportunities Jeffrey S. Vetter, Micah Beck, Philip Roth Future.

Similar presentations

Presentation on theme: "DoE I/O characterizations, infrastructures for wide-area collaborative science, and future opportunities Jeffrey S. Vetter, Micah Beck, Philip Roth Future."— Presentation transcript:

Similar presentations

About project

Feedback