Hiding Periodic I/O Costs in Parallel Applications Xiaosong Ma Department of Computer Science University of Illinois at Urbana-Champaign Spring 2003.

Slides:



Advertisements
Similar presentations
Introduction CSCI 444/544 Operating Systems Fall 2008.
Advertisements

GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Definition of a Distributed System (1) A distributed system is: A collection of independent computers that appears to its users as a single coherent system.
Distributed Processing, Client/Server, and Clusters
Chapter 16 Client/Server Computing Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
© 2005 Prentice Hall7-1 Stumpf and Teague Object-Oriented Systems Analysis and Design with UML.
1: Operating Systems Overview
Figure 1.1 Interaction between applications and the operating system.
Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
Database System Architectures  Client-server Database System  Parallel Database System  Distributed Database System Wei Jiang.
Client/Server Architectures
Designing Efficient Systems Services and Primitives for Next-Generation Data-Centers K. Vaidyanathan, S. Narravula, P. Balaji and D. K. Panda Network Based.
Alok 1Northwestern University Access Patterns, Metadata, and Performance Alok Choudhary and Wei-Keng Liao Department of ECE,
1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
Chapter 1. Introduction What is an Operating System? Mainframe Systems
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
 Introduction to Operating System Introduction to Operating System  Types Of An Operating System Types Of An Operating System  Single User Single User.
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
HDF5 A new file format & software for high performance scientific data management.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.
High Performance I/O and Data Management System Group Seminar Xiaosong Ma Department of Computer Science North Carolina State University September 12,
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Supercomputing Cross-Platform Performance Prediction Using Partial Execution Leo T. Yang Xiaosong Ma* Frank Mueller Department of Computer Science.
INVITATION TO COMPUTER SCIENCE, JAVA VERSION, THIRD EDITION Chapter 6: An Introduction to System Software and Virtual Machines.
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab.
CLUSTER COMPUTING TECHNOLOGY BY-1.SACHIN YADAV 2.MADHAV SHINDE SECTION-3.
Distributed System Concepts and Architectures 2.3 Services Fall 2011 Student: Fan Bai
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
CPSC 171 Introduction to Computer Science System Software and Virtual Machines.
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package MuQun Yang, Christian Chilan, Albert Cheng, Quincey Koziol, Mike.
Designing Abstract Interfaces for Device Independency Designing Abstract Interfaces for Device Independency Review of A Procedure for Designing Abstract.
DOE Network PI Meeting 2005 Runtime Data Management for Data-Intensive Scientific Applications Xiaosong Ma NC State University Joint Faculty: Oak Ridge.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
Chapter 16 Client/Server Computing Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
Presented by Robust Storage Management On Desktop, in Machine Room, and Beyond Xiaosong Ma Computer Science and Mathematics Oak Ridge National Laboratory.
Input and Output Optimization in Linux for Appropriate Resource Allocation and Management James Avery King.
TensorFlow– A system for large-scale machine learning
Introduction to Operating Systems
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
WP18, High-speed data recording Krzysztof Wrona, European XFEL
ENHANCING PERFORMANCE OF DATA MIGRATION VIA PARALLEL DATA COMPRESSION
Parallel Programming By J. H. Wang May 2, 2017.
Grid Computing.
Performance Evaluation of Adaptive MPI
Many-core Software Development Platforms
Supporting Fault-Tolerance in Streaming Grid Applications
Qingbo Zhu, Asim Shankar and Yuanyuan Zhou
Introduction to Operating Systems
Interpret the execution mode of SQL query in F1 Query paper
Database System Architectures
Presentation transcript:

Hiding Periodic I/O Costs in Parallel Applications Xiaosong Ma Department of Computer Science University of Illinois at Urbana-Champaign Spring 2003

3 Roadmap Introduction Active buffering: hiding recurrent output cost Ongoing work: hiding recurrent input cost Conclusions

4 Introduction Fast-growing technology propels high performance applications –Scientific computation –Parallel data mining –Web data processing –Games, movie graphics Individual component’s growth un-coordinated –Manual performance tuning needed

5 We Need Adaptive Optimization Flexible and automatic performance optimization desired Efficient high-level buffering and prefetching for parallel I/O in scientific simulations

6 Scientific Simulations Important –Detail and flexibility –Save money and lives Challenging –Multi-disciplinary –High performance crucial

7 Parallel I/O in Scientific Simulations Write-intensive Collective and periodic “Poor stepchild” Bottleneck-prone Existing collective I/O focused on data transfer Computation … I/O Computation I/O Computation I/O Computation …

8 My Contributions Idea: I/O optimizations in larger scope –Parallelism between I/O and other tasks –Individual simulation’s I/O need –I/O related self-configuration Approach: hide the I/O cost Results –Publications, technology transfer, software

9 Roadmap Introduction Active buffering: hiding recurrent output cost Ongoing work: hiding recurrent input cost Conclusions

10 Latency Hierarchy on Parallel Platforms Along path of data transfer –Smaller throughput –Lower parallelism and less scalable local memory access inter-processor communication disk I/O wide-area transfer

11 Basic Idea of Active Buffering Purpose: maximize overlap between computation and I/O Approach: buffer data as early as possible

12 Challenges Accommodate multiple I/O architectures No assumption on buffer space Adaptive –Buffer availability –User request patterns

13 Roadmap Introduction Active buffering: hiding recurrent output cost –With client-server I/O architecture [IPDPS ’02] –With server-less architecture Ongoing work: hiding recurrent input cost Related work and future work Conclusions

14 Client-Server I/O Architecture compute processors I/O servers File System

15 Client State Machine send a block prepare buffer data exit enter collective write routine buffer space available data to send out of buffer space sent no overflow all data

16 init. exit message idle, no data to fetch & data to write done idle- listen alloc. buffers prepare receive a block fetch & write all write a block fetch a block busy- listen write request exit recv. fetch got write req. no request write done received all data idle & to fetch recv. exit msg. no data out of buffer space data to receive & enough buffer space write Server State Machine

17 Maximize Apparent Throughput Ideal apparent throughput per server D total T ideal = D c-buffered D c-overflow D s-overflow T mem-copy T Msg-passing T write More expensive data transfer only becomes visible when overflow happens Efficiently masks the difference in write speeds ++

18 Write Throughput without Overflow –Panda Parallel I/O library –SGI Origin 2000, SHMEM –Per client: 16MB output data per snapshot, 64MB buffer –Two servers, each with 256MB buffer

19 Write Throughput with Overflow –Panda Parallel I/O library –SGI Origin 2000, SHMEM, MPI –Per client: 96MB output data per snapshot, 64MB buffer –Two servers, each with 256MB buffer

20 Give Feedback to Application “Softer” I/O requirements Parallel I/O libraries have been passive Active buffering allows I/O libraries to take more active role –Find optimal output frequency automatically

21 init. exit message idle, no data to fetch & data to write done idle- listen alloc. buffers prepare receive a block fetch & write all write a block fetch a block busy- listen write request exit recv. fetch got write req. no request write done received all data idle & to fetch recv. exit msg. no data out of buffer space data to receive & enough buffer space write Server-side Active Buffering

22 Performance with Real Applications Application overview – GENX –Large-scale, multi-component, detailed rocket simulation –Developed at Center for Simulation of Advanced Rockets (CSAR), UIUC –Multi-disciplinary, complex, and evolving Providing parallel I/O support for GENX –Identification of parallel I/O requirements [PDSECA ’03] –Motivation and test case for active buffering

23 Overall Performance of GEN1 –SDSC IBM SP (Blue Horizon) –64 clients, 2 I/O servers with AB –160MB output data per snapshot (in HDF4)

24 Aggregate Write Throughput in GEN2 –LLNL IBM SP (ASCI Frost) –1 I/O server per 16-way SMP node –Write in HDF4

25 Scientific Data Migration Output data need to be moved Online migration Extend active buffering to migration –Local storage becomes another layer in buffer hierarchy Computation … I/O Computation I/O Computation I/O Computation interne t

26 I/O Architecture with Data Migration compute processors Internet File System workstation running visualization tool servers

27 Active Buffering for Data Migration Avoid unnecessary local I/O –Hybrid migration approach Combined with data compression [ICS ’02] Self-configuration for online visualization memory-to-memory transferdisk staging

28 Roadmap Introduction Active buffering: hiding recurrent output cost –With client-server I/O architecture –With server-less architecture [IPDPS ’03] Ongoing work: hiding recurrent input cost Conclusions

29 Server-less I/O Architecture compute processors File System I/O thread

30 Making ABT Transparent and Portable Unchanged interfaces High-level and file-system independent Design and evaluation [IPDPS ’03] Ongoing transfer to ROMIO ADIO NFSHFSNTFSPFSPVFSXFSUFS ABT

31 Active Buffering vs. Asynchronous I/O

32 Roadmap Introduction Active buffering: hiding recurrent output cost Ongoing work: hiding recurrent input cost Conclusions

33 I/O in Visualization Periodic reads Dual modes of operation –Interactive –Batch-mode Harder to overlap reads with computation Computation … I/O Computation I/O Computation I/O Computation

34 Efficient I/O Through Data Management In-memory database of datasets –Manage buffers or values Hub for I/O optimization –Prefetching for batch mode –Caching for interactive mode User-supplied read routine

35 Related Work Overlapping I/O with computation –Replacing synchronous calls with async calls [Agrawal et al. ICS ’96] –Threads [Dickens et al. IPPS ’99, More et al. IPPS ’97] Automatic performance optimization –Optimization with performance models [Chen et al. TSE ’00] –Graybox optimization [Arpaci-Dusseau et al. SOSP ’01]

36 Roadmap Introduction Active buffering: hiding recurrent output cost Ongoing work: hiding recurrent input cost Conclusions

37 Conclusions If we can’t shrink it, hide it! Performance optimization can be done –more actively –at higher-level –in larger scope Make I/O part of data management

38 References [IPDPS ’03] Xiaosong Ma, Marianne Winslett, Jonghyun Lee and Shengke Yu, Improving MPI-IO Output Performance with Active Buffering Plus Threads, 2003 International Parallel and Distributed Processing Symposium [PDSECA ’03] Xiaosong Ma, Xiangmin Jiao, Michael Campbell and Marianne Winslett, Flexible and Efficient Parallel I/O for Large-Scale Multi-component Simulations, The 4th Workshop on Parallel and Distributed Scientific and Engineering Computing with Applications [ICS ’02] Jonghyun Lee, Xiaosong Ma, Marianne Winslett and Shengke Yu, Active Buffering Plus Compressed Migration: An Integrated Solution to Parallel Simulations' Data Transport Needs, the 16th ACM International Conference on Supercomputing [IPDPS ’02] Xiaosong Ma, Marianne Winslett, Jonghyun Lee and Shengke Yu, Faster Collective Output through Active Buffering, 2002 International Parallel and Distributed Processing Symposium