Scalable Parallel I/O Alternatives1 Scalable Parallel I/O Alternatives for Massively Parallel Partitioned Solver Systems Jing Fu, Ning Liu, Onkar Sahni,

Scalable Parallel I/O Alternatives1 Scalable Parallel I/O Alternatives for Massively Parallel Partitioned Solver Systems Jing Fu, Ning Liu, Onkar Sahni, Ken Jansen, Mark Shephard, Chris Carothers Computer Science Department Scientific Computation Research Center (SCOREC) Rensselaer Polytechnic Institute chrisc@cs.rpi.edu Acknowledgments: Partners: Simmetrix, Acusim, Kitware, IBM NSF PetaApps, DOE INCITE, ITR, CTS; DOE: SciDAC-ITAPS, NERI; AFOSR Industry:IBM, Northrup Grumman, Boeing, Lockheed Martin, Motorola Computer Resources: TeraGrid, ANL, NERSC, RPI-CCNI

Scalable Parallel I/O Alternatives2 Outline Motivating application: CFD Blue Gene Platforms I/O Alternatives –POSIX –PMPIO –syncIO –“reduce blocking” rbIO Blue Gene Results Summary

Scalable Parallel I/O Alternatives3 PHASTA Flow Solver Parallel Paradigm Time-accurate, stabilized FEM flow solver Input partitioned on a per-processor basis Unstructured mesh “parts” mapped to cores Two types of work: Equation formation O(40) peer-to-peer non-blocking comms Overlapping comms with comp Scales well on many machines Implicit, iterative equation solution Matrix assembled on processor ONLY Each Krylov vector is: q=Ap (matrix-vector product) Same peer-to-peer comm of q PLUS Orthogonalize against prior vectors REQUIRES NORMS=>MPI_Allreduce This sets up a cycle of global comms. separated by modest amount of work P1 P2P3

Scalable Parallel I/O Alternatives4 Parallel Implicit Flow Solver – Incompressible Abdominal Aorta Aneurysm (AAA) Cores (avg. elems./core) IBM BG/L RPI-CCNI t (secs.)scale factor 512 (204800)2119.71 (base) 1024 (102400)1052.41.01 2048 (51200)529.11.00 4096 (25600)267.00.99 8192 (12800)130.51.02 16384 (6400)64.51.03 32768 (3200)35.60.93 32K parts shows modest degradation due to 15% node imbalance

Scalable Parallel I/O Alternatives5 AAA Adapted to 10 9 Elements: Scaling on Blue Gene /P #of coresRgn imbVtx imbTime (s)Scaling 32k1.72%8.11%112.430.987 128k5.49%17.85%31.350.885 New: @ 294,912 cores  82% scaling But getting I/O done is a challenge…

Scalable Parallel I/O Alternatives6 Blue Gene /L Layout CCNI “fen” 32K cores/ 16 racks 12 TB / 8 TB usable RAM ~1 PB of disk over GPFS Custom OS kernel

Scalable Parallel I/O Alternatives7 Blue Gene /P Layout ALCF/ANL “Intrepid” 163K cores/ 40 racks ~80TB RAM ~8 PB of disk over GPFS Custom OS kernel

Scalable Parallel I/O Alternatives8 Blue Gene/ P (vs. BG/L)

Scalable Parallel I/O Alternatives9 Blue Gene I/O Archiectures Blue Gene/L @ CCNI –1 2-core I/O node per 32 compute nodes –32K system has 512, 1 Gbit/sec network interfaces –I/O nodes connected 48 GPFS file servers –Servers 0, 2, 4, and 6 are metadata servers –Server 0 does RAS and other duties –800 TB of storage from 26 IBM DS4200 storage arrays –Split into 240 LUNs, each server has 10 LUNs (7 @ 1MB and 3 @ 128KB) –Peak bandwidth is ~8GB/sec read and 4 GB/sec write Blue Gene/P @ ALCF –Similar I/O node to compute node ratio –128 dual core fileservers over Myrinet w/ 4MB GPFS block size –Metadata can be done by any server –16x DDN 9900  7.5 PB (raw) storage w/ peak bandwidth of 60 GB/sec.

Scalable Parallel I/O Alternatives10 Non-Parallel I/O: A Bad Approach… Sequential I/O: –All processes send data to rank 0, and 0 writes it to the file Data 0Data N-1Data 2Data 1 012N-1 Block 0Block N-1Block 2Block 1 Lacks scaling and results in excessive memory use on rank 0 Must think parallel from the start, but that implies data/file partitioning…

Scalable Parallel I/O Alternatives11 1 POSIX File Per Processor (1PFPP) Pros: –parallelism, high performance at small core counts Cons: –lots of small files to manage –LOTS OF METADATA – stress parallel filesystem –difficult to read back data from different number of processes –@ 300K cores yields 600K files @ JSC  kernel panic!! –PHASTA currently uses this approach…

Scalable Parallel I/O Alternatives12 New Partitioned Solver Parallel I/O Format Assumes data accessed in a coordinated manner File: master header + series of data blocks Each data block has header and data Ex: 4 parts w/ 2 fields per part Allows for different processor config: (1 core @ 4 parts), (2 core @ 2 parts) (4 cores @ 1 part) Allows for 1 to many files to control metadata overheads

Scalable Parallel I/O Alternatives13 MPI_File alternatives: PMPIO PMPIO  “poor man’s parallel I/O” from “silo” mesh and field library Divides app into groups of writers w/i a group only 1 writer at a time to a file Passing of a “token” ensures synchronization w/i a group Support for HDF5 file format Uses MPI_File_read/write_at routines

Scalable Parallel I/O Alternatives14 Flexible design allows variable number files and procs/writers per file Within a file, can be configured to write on “block size boundries” which are typically 1 to 4MB. Implemented using collective I/O routines : e.g., MPI_File_write_at_all_begin MPI_File alternatives: syncIO

Scalable Parallel I/O Alternatives15 Rb  “reduced blocking” Targets “checkpointing” Divides application into workers and writers with 1 writer MPI task per group of workers. Workers send I/O to writers over MPI_Isend and are free to continue – –e.g., hides the latency of blocking parallel I/O Writers then perform blocking MPI_File_write_at operation using MPI_SELF communicator MPI_File alternatives: rbIO

Scalable Parallel I/O Alternatives16 BG/L: 1PFPP w/ 7.7 GB data

Scalable Parallel I/O Alternatives17 BG/L: PMPIO w/ 7.7 GB data HDF5 Peak: 600MB/sec RAW MPI-IO Peak: 900 MB/sec

Scalable Parallel I/O Alternatives18 BG/L: syncIO w/ 7.7 GB data Read Performance Peak: 6.6 GB/sec Write Performance Peak: 1.3 GB/sec

Scalable Parallel I/O Alternatives19 BG/P: syncIO w/ ~60 GB data

Scalable Parallel I/O Alternatives20 BG/L: rbIO actual BW w/ 7.7 GB data

Scalable Parallel I/O Alternatives21 BG/L: rbIO perceived BW w/ 7.7 GB data ~22 TB/sec ~11 TB/sec

Scalable Parallel I/O Alternatives22 BG/P: rbIO actual BW w/ ~60 GB data ~17.9 GB/sec

Scalable Parallel I/O Alternatives23 BG/P: rbIO perceived BW w/ ~60 GB data ~21 TB/sec

Scalable Parallel I/O Alternatives24 Related Work A. Nisar, W. Liao, and A. Choudhary, “Scaling Parallel I/O Performance through I/O Delegate and Caching System,” in Proceedings of the 2008 ACM/IEEE conference on Supercomputing, 2008. –Performance “rbIO” inside MPI via threads and using upto 10% compute cores as I/O workers Benchmark studies (hightlight just a few…) –H Yu et al [18] – BG/L: 2 GB/sec @ 1K –Saini et al [19] – 512 NEC SX-8 cores – I/O was not scalable when all processors access a shared file. –Larkin et al [17] – large performance drop at 2K core count for CrayXT3/XT4 –Lang et al [30] – large I/O study across many benchmarks on Intrepid/BG-P. Found 60 GB/s read and 45 GB/s write. In practice, Intrepid has a peak I/O rate of around 35 GB/sec

Scalable Parallel I/O Alternatives25 Summary and Future Work We examine several parallel I/O approaches.. –1 POSIX File per Proc: < 1 GB/sec on BG/L –PMPIO: < 1 GB/sec on BG/L –syncIO – all processors write as groups to different files BG/L: 6.6 GB/sec read, 1.3 GB/sec write BG/P: 11.6 GB/sec read, 25 GB/sec write –rbIO – gives up 3 to 6% of compute nodes to hide latency of blocking parallel I/O. BG/L: 2.3 GB/sec actual write, 22 TB/sec perceived write BG/P: ~18 GB/sec actual write, ~22 TB/sec perceived write Good trade-off on Blue Gene All procs to 1 file does not yield good performance even if aligned. Performance “sweet spot” for syncIO depends significantly on I/O architecture and so file format must be tuned accordingly –BG/L @ CCNI has a metadata bottleneck and must adjust # of files according – e.g., 32 to 128 writers –BG/P @ ALCF can sustain much higher performance, but requires more files – e.g., 1024 writers –Suggest collective I/O is sensitive to underlying file system performance. For rbIO, we observed that 1024 writers was the best performance so far for both BG/L and BG/P platforms.. Future Work – impact on performance of different filesystems –Leverage Darshan logs @ ALCF to better understand Intrepid performance –More experiments on Blue Gene/P under PVFS, CrayXT5 under Lustre

Scalable Parallel I/O Alternatives1 Scalable Parallel I/O Alternatives for Massively Parallel Partitioned Solver Systems Jing Fu, Ning Liu, Onkar Sahni,

Similar presentations

Presentation on theme: "Scalable Parallel I/O Alternatives1 Scalable Parallel I/O Alternatives for Massively Parallel Partitioned Solver Systems Jing Fu, Ning Liu, Onkar Sahni,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scalable Parallel I/O Alternatives1 Scalable Parallel I/O Alternatives for Massively Parallel Partitioned Solver Systems Jing Fu, Ning Liu, Onkar Sahni,

Similar presentations

Presentation on theme: "Scalable Parallel I/O Alternatives1 Scalable Parallel I/O Alternatives for Massively Parallel Partitioned Solver Systems Jing Fu, Ning Liu, Onkar Sahni,"— Presentation transcript:

Similar presentations

About project

Feedback