Hiding Periodic I/O Costs in Parallel Applications Xiaosong Ma Department of Computer Science University of Illinois at Urbana-Champaign Spring 2003
3 Roadmap Introduction Active buffering: hiding recurrent output cost Ongoing work: hiding recurrent input cost Conclusions
4 Introduction Fast-growing technology propels high performance applications –Scientific computation –Parallel data mining –Web data processing –Games, movie graphics Individual component’s growth un-coordinated –Manual performance tuning needed
5 We Need Adaptive Optimization Flexible and automatic performance optimization desired Efficient high-level buffering and prefetching for parallel I/O in scientific simulations
6 Scientific Simulations Important –Detail and flexibility –Save money and lives Challenging –Multi-disciplinary –High performance crucial
7 Parallel I/O in Scientific Simulations Write-intensive Collective and periodic “Poor stepchild” Bottleneck-prone Existing collective I/O focused on data transfer Computation … I/O Computation I/O Computation I/O Computation …
8 My Contributions Idea: I/O optimizations in larger scope –Parallelism between I/O and other tasks –Individual simulation’s I/O need –I/O related self-configuration Approach: hide the I/O cost Results –Publications, technology transfer, software
9 Roadmap Introduction Active buffering: hiding recurrent output cost Ongoing work: hiding recurrent input cost Conclusions
10 Latency Hierarchy on Parallel Platforms Along path of data transfer –Smaller throughput –Lower parallelism and less scalable local memory access inter-processor communication disk I/O wide-area transfer
11 Basic Idea of Active Buffering Purpose: maximize overlap between computation and I/O Approach: buffer data as early as possible
12 Challenges Accommodate multiple I/O architectures No assumption on buffer space Adaptive –Buffer availability –User request patterns
13 Roadmap Introduction Active buffering: hiding recurrent output cost –With client-server I/O architecture [IPDPS ’02] –With server-less architecture Ongoing work: hiding recurrent input cost Related work and future work Conclusions
14 Client-Server I/O Architecture compute processors I/O servers File System
15 Client State Machine send a block prepare buffer data exit enter collective write routine buffer space available data to send out of buffer space sent no overflow all data
16 init. exit message idle, no data to fetch & data to write done idle- listen alloc. buffers prepare receive a block fetch & write all write a block fetch a block busy- listen write request exit recv. fetch got write req. no request write done received all data idle & to fetch recv. exit msg. no data out of buffer space data to receive & enough buffer space write Server State Machine
17 Maximize Apparent Throughput Ideal apparent throughput per server D total T ideal = D c-buffered D c-overflow D s-overflow T mem-copy T Msg-passing T write More expensive data transfer only becomes visible when overflow happens Efficiently masks the difference in write speeds ++
18 Write Throughput without Overflow –Panda Parallel I/O library –SGI Origin 2000, SHMEM –Per client: 16MB output data per snapshot, 64MB buffer –Two servers, each with 256MB buffer
19 Write Throughput with Overflow –Panda Parallel I/O library –SGI Origin 2000, SHMEM, MPI –Per client: 96MB output data per snapshot, 64MB buffer –Two servers, each with 256MB buffer
20 Give Feedback to Application “Softer” I/O requirements Parallel I/O libraries have been passive Active buffering allows I/O libraries to take more active role –Find optimal output frequency automatically
21 init. exit message idle, no data to fetch & data to write done idle- listen alloc. buffers prepare receive a block fetch & write all write a block fetch a block busy- listen write request exit recv. fetch got write req. no request write done received all data idle & to fetch recv. exit msg. no data out of buffer space data to receive & enough buffer space write Server-side Active Buffering
22 Performance with Real Applications Application overview – GENX –Large-scale, multi-component, detailed rocket simulation –Developed at Center for Simulation of Advanced Rockets (CSAR), UIUC –Multi-disciplinary, complex, and evolving Providing parallel I/O support for GENX –Identification of parallel I/O requirements [PDSECA ’03] –Motivation and test case for active buffering
23 Overall Performance of GEN1 –SDSC IBM SP (Blue Horizon) –64 clients, 2 I/O servers with AB –160MB output data per snapshot (in HDF4)
24 Aggregate Write Throughput in GEN2 –LLNL IBM SP (ASCI Frost) –1 I/O server per 16-way SMP node –Write in HDF4
25 Scientific Data Migration Output data need to be moved Online migration Extend active buffering to migration –Local storage becomes another layer in buffer hierarchy Computation … I/O Computation I/O Computation I/O Computation interne t
26 I/O Architecture with Data Migration compute processors Internet File System workstation running visualization tool servers
27 Active Buffering for Data Migration Avoid unnecessary local I/O –Hybrid migration approach Combined with data compression [ICS ’02] Self-configuration for online visualization memory-to-memory transferdisk staging
28 Roadmap Introduction Active buffering: hiding recurrent output cost –With client-server I/O architecture –With server-less architecture [IPDPS ’03] Ongoing work: hiding recurrent input cost Conclusions
29 Server-less I/O Architecture compute processors File System I/O thread
30 Making ABT Transparent and Portable Unchanged interfaces High-level and file-system independent Design and evaluation [IPDPS ’03] Ongoing transfer to ROMIO ADIO NFSHFSNTFSPFSPVFSXFSUFS ABT
31 Active Buffering vs. Asynchronous I/O
32 Roadmap Introduction Active buffering: hiding recurrent output cost Ongoing work: hiding recurrent input cost Conclusions
33 I/O in Visualization Periodic reads Dual modes of operation –Interactive –Batch-mode Harder to overlap reads with computation Computation … I/O Computation I/O Computation I/O Computation
34 Efficient I/O Through Data Management In-memory database of datasets –Manage buffers or values Hub for I/O optimization –Prefetching for batch mode –Caching for interactive mode User-supplied read routine
35 Related Work Overlapping I/O with computation –Replacing synchronous calls with async calls [Agrawal et al. ICS ’96] –Threads [Dickens et al. IPPS ’99, More et al. IPPS ’97] Automatic performance optimization –Optimization with performance models [Chen et al. TSE ’00] –Graybox optimization [Arpaci-Dusseau et al. SOSP ’01]
36 Roadmap Introduction Active buffering: hiding recurrent output cost Ongoing work: hiding recurrent input cost Conclusions
37 Conclusions If we can’t shrink it, hide it! Performance optimization can be done –more actively –at higher-level –in larger scope Make I/O part of data management
38 References [IPDPS ’03] Xiaosong Ma, Marianne Winslett, Jonghyun Lee and Shengke Yu, Improving MPI-IO Output Performance with Active Buffering Plus Threads, 2003 International Parallel and Distributed Processing Symposium [PDSECA ’03] Xiaosong Ma, Xiangmin Jiao, Michael Campbell and Marianne Winslett, Flexible and Efficient Parallel I/O for Large-Scale Multi-component Simulations, The 4th Workshop on Parallel and Distributed Scientific and Engineering Computing with Applications [ICS ’02] Jonghyun Lee, Xiaosong Ma, Marianne Winslett and Shengke Yu, Active Buffering Plus Compressed Migration: An Integrated Solution to Parallel Simulations' Data Transport Needs, the 16th ACM International Conference on Supercomputing [IPDPS ’02] Xiaosong Ma, Marianne Winslett, Jonghyun Lee and Shengke Yu, Faster Collective Output through Active Buffering, 2002 International Parallel and Distributed Processing Symposium