Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010.

Slides:

Advertisements

Similar presentations

© Crown copyright Met Office Met Office Unified Model I/O Server Paul Selwood.

Advertisements

Computer Organization and Architecture

ARM-DSP Multicore Considerations CT Scan Example.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

MS698: Implementing an Ocean Model Benchmark tests. Dealing with big data files. – (specifically big NetCDF files). Signell paper. Activity: – nctoolbox.

The Zebra Striped Network Filesystem. Approach Increase throughput, reliability by striping file data across multiple servers Data from each client is.

Katie Antypas NERSC User Services Lawrence Berkeley National Lab NUG Meeting 1 February 2012 Best Practices for Reading and Writing Data on HPC Systems.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Astrophysics, Biology, Climate, Combustion, Fusion, Nanoscience Working Group on Simulation-Driven Applications 10 CS, 10 Sim, 1 VR.

Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.

Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Database System Architectures  Client-server Database System  Parallel Database System  Distributed Database System Wei Jiang.

Prince Sultan College For Woman

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

HDF5 collective chunk IO A Working Report. Motivation for this project ► Found extremely bad performance of parallel HDF5 when implementing WRF- Parallel.

I/O Systems ◦ Operating Systems ◦ CS550. Note:  Based on Operating Systems Concepts by Silberschatz, Galvin, and Gagne  Strongly recommended to read.

What is Concurrent Programming? Maram Bani Younes.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Enhancing GPU for Scientific Computing Some thoughts.

Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Experience with Using a Performance Predictor During Development a Distributed Storage System Tale Lauro Beltrão Costa *, João Brunet +, Lile Hattori #,

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

Using HDF5 in WRF Part of MEAD - an alliance expedition.

Titanium/Java Performance Analysis Ryan Huebsch Group: Boon Thau Loo, Matt Harren Joe Hellerstein, Ion Stoica, Scott Shenker P I E R Peer-to-Peer.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.

SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.

Stochastic optimization of energy systems Cosmin Petra Argonne National Laboratory.

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

A High performance I/O Module: the HDF5 WRF I/O module Muqun Yang, Robert E. McGrath, Mike Folk National Center for Supercomputing Applications University.

Towards Exascale File I/O Yutaka Ishikawa University of Tokyo, Japan 2009/05/21.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

High performance parallel computing of climate models towards the Earth Simulator --- computing science activities at CRIEPI --- Yoshikatsu Yoshida and.

CS 591 x I/O in MPI. MPI exists as many different implementations MPI implementations are based on MPI standards MPI standards are developed and maintained.

Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.

Making Watson Fast Daniel Brown HON111. Need for Watson to be fast to play Jeopardy successfully – All computations have to be done in a few seconds –

F. Douglas Swesty, DOE Office of Science Data Management Workshop, SLAC March Data Management Needs for Nuclear-Astrophysical Simulation at the Ultrascale.

Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,

CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.

Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.

U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.

PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

CS 471 Final Project 2d Advection/Wave Equation Using Fourier Methods December 10, 2003 Jose L. Rodriguez

Project18’s Communication Drawing Design By: Camilo A. Silva BIOinformatics Summer 2008.

OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

Accelerating High Performance Cluster Computing Through the Reduction of File System Latency David Fellinger Chief Scientist, DDN Storage ©2015 Dartadirect.

OSes: 2. Structs 1 Operating Systems v Objective –to give a (selective) overview of computer system architectures Certificate Program in Software Development.

SDM Center Parallel I/O Storage Efficient Access Team.

PIDX PIDX - a parallel API to capture the data models used by HPC application and write it out in an IDX format. PIDX enables simulations to write out.

Performance Evaluation of Parallel Algorithms on a Computational Grid Environment Simona Blandino 1, Salvatore Cavalieri 2 1 Consorzio COMETA, 2 Faculty.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

Kriging for Estimation of Mineral Resources GISELA/EPIKH School Exequiel Sepúlveda Department of Mining Engineering, University of Chile, Chile ALGES Laboratory,

I/O server work at ICHEC Alastair McKinstry IS-ENES workshop, 2013.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

OPERATING SYSTEMS CS 3502 Fall 2017

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Achieving the Ultimate Efficiency for Seismic Analysis

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

Parallel Programming By J. H. Wang May 2, 2017.

Unstructured Grids at Sandia National Labs

Parallel Computers.

External Projects related to WG 6

Lock Ahead: Shared File Performance Improvements

SDM workshop Strawman report History and Progress and Goal.

1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.

Accelerating Regular Path Queries using FPGA

Presentation transcript:

Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010

Recap –why I/O is such a problem POMPA Workshop May 3rd The I/O problem: –I/O is the limiting factor for scaling of COSMO Limiting factor for many data intensive applications –Speed of I/O subsystems for writing data is not keeping up with increases in speed of compute engines Idealised 2D grid layout: Increasing the number of processors by 4 leads to each processor having one quarter the number of grid points to compute one half the number of halo points to communicate The same amount of total data needs to be output at each time step. P processors, each with … MxN Grid points 2M+2N Halo points 4P processors, each with … (M/2)x(N/2) Grid points M+N Halo points

I/O reaches a scaling limit POMPA Workshop May 3rd Computation:Scales O(P) for P processors Minor scaling problem – issues of halo memory bandwidth, vector lengths, efficiency of software pipeline etc. Communication:Scales O(√P) for P processors Major scaling problem – the halo region decreases slowly as you increase the number of processors I/O (mainly “O”):No scaling Limiting factor in scaling– the same amount of total data is output at each time step

Current I/O strategies in COSMO Two types of output format – GRIB and NetCDF –Grib dominant in operational weather forecasting –NetCDF is the main format used in climate research GRIB output has the possibility of using asynchronous I/O processes to improve parallel performance NetCDF is always ultimately serialised through process zero of the simulation Actually in each case of GRIB and NetCDF the output is a multi-level data collection approach 4 POMPA Workshop May 3rd 2010

Multi-level approach POMPA Workshop May 3rd A 3 x 6 grid of processes, each with 3 atmospheric levels Collect on atmospheric levels Proc 0 Proc 1 Proc 2 I/O Proc Storage Data is sent to I/O proc level by level

Performance limitations and constraints Both Grib and NetCDF formats carry out the gather on levels stage For Grib-based weather simulations the final collect-and- store stage can deploy multiple I/O processes to deal with the data. –Allows improved performance where real storage bandwidth is the bottleneck –Produces multiple files (one per I/O process) that can easily be concatenated together Only process 0 can currently act as an I/O proc for the collect-and-store stage with NetCDF –Serialises the I/O through one compute process POMPA Workshop May 3rd

Possible strategies for fast NetCDF I/O 1.Use a version of parallel NetCDF to have all compute processes write to disk –eliminate both the gather-on- levels and collect-and-store stages 2.Use a version of parallel NetCDF on the subset of compute processes that are needed for the gather stage –Eliminate the collect-and-store stage 3.Use a set of asynchronous processes as is currently done in the Grib implementation –If more than one asynchronous process is employed this would require parallel NetCDF or post- processing POMPA Workshop May 3rd

Full parallel strategy A simple micro-benchmark of 3D data distributed on a 2D process grid showed reasonable results This was implemented in the RAPS code and tested with the IPCC benchmark at ~ 900 cores –No smoothing operations in this benchmark or in the code The results were poor –Much of the I/O in this benchmark is 2D fields –Not much data is written at each timestep –The current I/O performance is not bad –The parallel strategy became dominated by metadata operations File writes for 3D fields were reasonably fast (~0.025s for 50 Mbytes) Opening the file took a long time (0.4 to 0.5 seconds) The strategy may be useful for high-resolution simulations writing large 3D blocks of data –Originally this strategy was expected to target 2000 x 1000 x 60+ grids POMPA Workshop May 3rd

Slowdown from metadata The first strategy has problems related to metadata scalability Most modern high- performance file systems use POSIX I/O to open/close/seek etc. This is not scalable as it reduces file access operations to the time taken for Metadata operations POMPA Workshop May 3rd

Non-scalable metadata – file open speeds Opening a file is not a scalable operation on modern parallel file systems –See graph of two CSCS filesystems!! There are some mitigation strategies in MPI’s Romio/Adio layer –“Delayed open” only makes the POSIX open call when actually needed For MPI-IO collective operations, only a subset of processes actually write the data –No mitigation strategies for specific file systems (Lustre and GPFS) With current file systems using POSIX I/O calls full parallel I/O is not scalable We need to pursue the other strategies for COSMO, unless large blocks of data are being written POMPA Workshop May 3rd Time in seconds to open a file against number of MPI processes involved in file open

Next steps We are looking at all 3 strategies for improving NetCDF I/O We are investigating the current state of Metadata accesses in the MPI-IO layer and in file systems in general –Particularly Lustre and GPFS, but others (e.g. OrangeFS) … but for some jobs the individual I/O operations might not be large enough to allow much speedup POMPA Workshop May 3rd