Parallel I/O Optimizations

Slides:

Advertisements

Similar presentations

MPI 3 RMA Bill Gropp and Rajeev Thakur. 2 Why Change RMA? Problems with using MPI 1.1 and 2.0 as compilation targets for parallel language implementations.

Advertisements

MPI3 RMA William Gropp Rajeev Thakur. 2 MPI-3 RMA Presented an overview of some of the issues and constraints at last meeting Homework - read Bonachea's.

1 Computer Science, University of Warwick Accessing Irregularly Distributed Arrays Process 0’s data arrayProcess 1’s data arrayProcess 2’s data array Process.

High Performance Computing Course Notes Parallel I/O.

File Consistency in a Parallel Environment Kenin Coloma

CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.

Phillip Dickens, Department of Computer Science, University of Maine. In collaboration with Jeremy Logan, Postdoctoral Research Associate, ORNL. Improving.

Parallel Processing1 Parallel Processing (CS 667) Lecture 9: Advanced Point to Point Communication Jeremy R. Johnson *Parts of this lecture was derived.

The Linux Kernel: Memory Management

Parallel I/O Performance Study Christian Chilan The HDF Group September 9, 2008SPEEDUP Workshop - HDF5 Tutorial1.

I/O Optimization for ENZO Cosmology Simulation Using MPI-IO Jianwei Li12/06/2001.

1 Message protocols l Message consists of “envelope” and data »Envelope contains tag, communicator, length, source information, plus impl. private data.

1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.

03/17/2008CSCI 315 Operating Systems Design1 Virtual Memory Notice: The slides for this lecture have been largely based on those accompanying the textbook.

1 What is message passing? l Data transfer plus synchronization l Requires cooperation of sender and receiver l Cooperation not always apparent in code.

HDF5 collective chunk IO A Working Report. Motivation for this project ► Found extremely bad performance of parallel HDF5 when implementing WRF- Parallel.

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 346, Royden, Operating System Concepts Operating Systems Lecture 24 Paging.

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 32 Paging Read Ch. 9.4.

1 Outline l Performance Issues in I/O interface design l MPI Solutions to I/O performance issues l The ROMIO MPI-IO implementation.

March 19981© Dennis Adams Associates Tuning Oracle: Key Considerations Dennis Adams 25 March 1998.

Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.

Parallel I/O Optimizations Sources/Credits:  R. Thakur, W. Gropp, E. Lusk. A Case for Using MPI's Derived Datatypes to Improve I/O Performance. Supercomputing.

Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad.

Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

1 External Sorting. 2 Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing gpa order.

© Dennis Shasha, Alberto Lerner, Philippe Bonnet 2004 DBMS Performance Monitoring.

MPI: Portable Parallel Programming for Scientific Computing William Gropp Rusty Lusk Debbie Swider Rajeev Thakur.

SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.

8.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Fragmentation External Fragmentation – total memory space exists to satisfy.

PIDX PIDX - a parallel API to capture the data models used by HPC application and write it out in an IDX format. PIDX enables simulations to write out.

Parallel Virtual File System (PVFS) a.k.a. OrangeFS

Basic Paging (1) logical address space of a process can be made noncontiguous; process is allocated physical memory whenever the latter is available. Divide.

Jonathan Walpole Computer Science Portland State University

UPC Parallel I/O Library

MPI: Portable Parallel Programming for Scientific Computing

Lecture 18: Coherence and Synchronization

CSI 400/500 Operating Systems Spring 2009

Join Processing in Database Systems with Large Main Memories (part 2)

Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy 11/12/2018.

Chapter 11: File System Implementation

Operating System Concepts

Department of Computer Science University of California, Santa Barbara

Lock Ahead: Shared File Performance Improvements

External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.

Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 8 11/24/2018.

Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 9 12/1/2018.

So far… Text RO …. printf() RW link printf Linking, loading

Selected Topics: External Sorting, Join Algorithms, …

Overview of Computer Architecture and Organization

CS222P: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.

Integrating DPDK/SPDK with storage application

Chapters 15 and 16b: Query Optimization

Thread Synchronization

Troubleshooting Techniques(*)

Lecture 25: Multiprocessors

CS222: Principles of Data Management Lecture #10 External Sorting

Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 9 4/5/2019.

External Sorting.

CS222P: Principles of Data Management Lecture #10 External Sorting

What I've done in past 6 months

Database System Architectures

Lecture 19: Coherence and Synchronization

CSE 542: Operating Systems

Lecture 18: Coherence and Synchronization

CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.

Parallel I/O for Distributed Applications (MPI-Conn-IO)

CSE 542: Operating Systems

Presentation transcript:

Parallel I/O Optimizations Sources/Credits: R. Thakur, W. Gropp, E. Lusk. A Case for Using MPI's Derived Datatypes to Improve I/O Performance. Supercomputing 98 URL: http://www.mcs.anl.gov/~thakur/dtype

High Performance I/O with Derived Data Types Potential of parallel file systems not fully utilized because of application’s I/O access patterns Parallel file systems: Tuned for access to large contiguous blocks User applications: Many small requests to non-contiguous blocks Can be improved using a single call is made using derived data types Using templates with holes can improve performance

Datatype Constructors in MPI contiguous vector/hvector indexed/hindexed/indexed_block struct subarray darray I I I I I I I I I I I I I I I I I I I I I I I I I I D D D D C C

Different levels of access

Different levels of access

Different levels of access

MPI I/O Optimizations 2 popular optimizations – data sieving and collective I/O

Optimizations for derived-datatype noncontiguous access Data sieving Make a few, large contiguous requests to the file system even if the user’s requests consists of several, small, nocontiguous requests Extract (pick out data) in memory that is really needed This is ok for read? For write? Use small buffer for writing with data sieving than for reading with data sieving. Why? Read-modify-write along with locking Greater the size of the write buffer, greater the contention among processes for locks

Optimizations for derived-datatype noncontiguous access Data sieving Collective I/O During collective-I/O functions, the implementation can analyze and merge the requests of different processes The merged request can be large and continuous although the individual requests were noncontiguous. Perform I/O in 2 phases: I/O phase – processes perform I/O for the merged request. Some data may belong to other processes. If the merged request is not contiguous, use data sieving Communication phase – processes redistribute data to obtain the desired distribution Additional cost of communication phase can be offset by performance gain due to contiguous access. Data sieving and collective-I/O also help improve caching and prefetching in underlying file system

Collective I/O Illustration P0 P0 P1 P1 P0 P0 P1 P1 P0 P1 P0 P1 P0 P1