HDF Experiences with I/O Bottlenecks

Slides:



Advertisements
Similar presentations
DELOS Highlights COSTANTINO THANOS ITALIAN NATIONAL RESEARCH COUNCIL.
Advertisements

The Big Picture Scientific disciplines have developed a computational branch Models without closed form solutions solved numerically This has lead to.
1 Projection Indexes in HDF5 Rishi Rakesh Sinha The HDF Group.
File Systems.
DCS Architecture Bob Krzaczek. Key Design Requirement Distilled from the DCS Mission statement and the results of the Conceptual Design Review (June 1999):
Astrophysics, Biology, Climate, Combustion, Fusion, Nanoscience Working Group on Simulation-Driven Applications 10 CS, 10 Sim, 1 VR.
University of Illinois at Urbana-ChampaignHDF Mike Folk HDF-EOS Workshop IV Sept , 2000 HDF Update HDF.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
By N.Gopinath AP/CSE. Why a Data Warehouse Application – Business Perspectives  There are several reasons why organizations consider Data Warehousing.
EARTH SCIENCE MARKUP LANGUAGE “Define Once Use Anywhere” INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
Parallel HDF5 Introductory Tutorial May 19, 2008 Kent Yang The HDF Group 5/19/20081SCICOMP 14 Tutorial.
DISTRIBUTED DATA FLOW WEB-SERVICES FOR ACCESSING AND PROCESSING OF BIG DATA SETS IN EARTH SCIENCES A.A. Poyda 1, M.N. Zhizhin 1, D.P. Medvedev 2, D.Y.
1 High level view of HDF5 Data structures and library HDF Summit Boeing Seattle September 19, 2006.
HDF5 A new file format & software for high performance scientific data management.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
1 Overview of HDF5 HDF Summit Boeing Seattle The HDF Group (THG) September 19, 2006.
February 2-3, 2006SRB Workshop, San Diego P eter Cao, NCSA Mike Wan, SDSC Sponsored by NLADR, NFS PACI Project in Support of NCSA-SDSC Collaboration Object-level.
The HDF Group Multi-threading in HDF5: Paths Forward Current implementation - Future directions May 30-31, 2012HDF5 Workshop at PSI 1.
A Domain-Specific Modeling Language for Scientific Data Composition and Interoperability Hyun ChoUniversity of Alabama at Birmingham Jeff GrayUniversity.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
The HDF Group HDF5 Datasets and I/O Dataset storage and its effect on performance May 30-31, 2012HDF5 Workshop at PSI 1.
EARTH SCIENCE MARKUP LANGUAGE Why do you need it? How can it help you? INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
A High performance I/O Module: the HDF5 WRF I/O module Muqun Yang, Robert E. McGrath, Mike Folk National Center for Supercomputing Applications University.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
GEON2 and OpenEarth Framework (OEF) Bradley Wallet School of Geology and Geophysics, University of Oklahoma
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
Enterprise Solutions Chapter 10 – Enterprise Content Management.
The HDF Group HDF5 Chunking and Compression Performance tuning 10/17/15 1 ICALEPCS 2015.
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.
1 Data Management with HDF5 Quincey Koziol Director of Core Software Development and HPC The HDF Group September 10, 2012NASA Digital.
ETICS An Environment for Distributed Software Development in Aerospace Applications SpaceTransfer09 Hannover Messe, April 2009.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Building a Data Warehouse
2nd GEO Data Providers workshop (20-21 April 2017, Florence, Italy)
Organizations Are Embracing New Opportunities
WP18, High-speed data recording Krzysztof Wrona, European XFEL
Memory COMPUTER ARCHITECTURE
Chapter 11: File System Implementation
SuperComputing 2003 “The Great Academia / Industry Grid Debate” ?
Open Source distributed document DB for an enterprise
SOFTWARE DESIGN AND ARCHITECTURE
Spark Presentation.
HDF5 October 8, 2017 Elena Pourmal Copyright 2016, The HDF Group.
Plans for an Enhanced NetCDF-4 Interface to HDF5 Data
VirtualGL.
HDF5 Metadata and Page Buffering
Joseph JaJa, Mike Smorul, and Sangchul Song
File System Implementation
File System Structure How do I organize a disk into a file system?
Software Design and Architecture
Efficiently serving HDF5 via OPeNDAP
Grid Computing.
CHAPTER 3 Architectures for Distributed Systems
GSAF Grid Storage Access Framework
University of Technology
Software Architecture in Practice
Large Scale Test of a storage solution based on an Industry Standard
C2CAMP (A Working Title)
Microsoft Azure Platform Powers New Elements Constellation Software Suite to Deliver Invaluable Insights From Your Data for Marketing and Sales MICROSOFT.
Designed for Big Data Visual Analytics, Zoomdata Allows Business Users to Quickly Connect, Stream, and Visualize Data in the Microsoft Azure Platform MICROSOFT.
Chapter 11: File System Implementation
SDM workshop Strawman report History and Progress and Goal.
Storage Structure and Efficient File Access
WIS Strategy – WIS 2.0 Submitted by: Matteo Dell’Acqua(CBS) (Doc 5b)
Catherine Foley Director of Digital Archive and Library Projects MATRIX, Center for Digital Humanities and Social Sciences at MSU Mid-Michigan Digital.
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

HDF Experiences with I/O Bottlenecks Mike Folk The HDF Group Collaborative Expedition Workshop Toward Scalable Data Management Overcoming I/O Bottlenecks in Full Data Path Processing June 10, 2008 National Science Foundation June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks Topics What is HDF? I/O bottlenecks and HDF June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks What is HDF? June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks HDF is… A file format for managing any kind of data Software system to manage data in the format Designed for high volume or complex data Designed for every size and type of system Open format and software library, tools There are two HDF’s: HDF4 and HDF5. For simplicity we focus on HDF5. June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks HDF5 The Format June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks 5 5

An HDF5 “file” is a container… …into which you can put your data objects. lat | lon | temp ----|-----|----- 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6 palette June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Structures to organize objects “Groups” “/” (root) 3-D array “/foo” lat | lon | temp ----|-----|----- 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6 Table “Datasets” Raster image This shows that you can mix objects of different types according to your needs. Typically, there will be metadata stored with objects to indicate what type of object they are. Like HDF4, HDF5 has a grouping structure. The main difference is that every HDF5 file starts with a root group, whereas HDF4 doesn’t need any groups at all. palette Raster image 2-D array June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Everything else is built essentially from these parts. HDF5 model Groups – provide structure among objects Datasets – where the primary data goes Data arrays Rich set of datatype options Flexible, efficient storage and I/O Attributes, for metadata Everything else is built essentially from these parts. June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks HDF5 The Software June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Tools, Applications, Libraries HDF Software Tools, Applications, Libraries HDF I/O Library It is useful to think about HDF software in terms of layers. At the bottom layer is the HDF5 file or other data source. Above that are two layers corresponding the the HDF library. First there is a low level interface that concentrates on basic I/O: opening and closing files, reading and writing bytes, seeking, etc. HDF5 provides a public API at this level so that people can write their own drivers for reading and writing to places other than those already provided with the library. Those that are already provided include UNIX stdio, and MPI-IO. Then comes the high-level, object -specific interface. This is the API that most people who develop HDF5 applications use. This is where you create a dataset or group, read and write datasets and subsets, etc. At the top are applications, or perhaps APIs used by applications. Examples of the latter are the HDF-EOS API that supports NASA’s EOSDIS datatypes, and the DSL API that supports the ASCI data models. HDF File June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Users of HDF Software Tools & Applications HDF5 Application Most data consumers are here. Scientific/engineering applications. Domain-specific libraries/API, tools. Tools & Applications Applications, tools use this API to create, read, write, query, etc. Power users (consumers) HDF5 Application Programming Interface Modules to adapt I/O to specific features of system, or do I/O in some special way. “Virtual file layer” (VFL) It is useful to think about HDF software in terms of layers. At the bottom layer is the HDF5 file or other data source. Above that are two layers corresponding the the HDF library. First there is a low level interface that concentrates on basic I/O: opening and closing files, reading and writing bytes, seeking, etc. HDF5 provides a public API at this level so that people can write their own drivers for reading and writing to places other than those already provided with the library. Those that are already provided include UNIX stdio, and MPI-IO. Then comes the high-level, object -specific interface. This is the API that most people who develop HDF5 applications use. This is where you create a dataset or group, read and write datasets and subsets, etc. At the top are applications, or perhaps APIs used by applications. Examples of the latter are the HDF-EOS API that supports NASA’s EOSDIS datatypes, and the DSL API that supports the ASCI data models. File system, MPI-IO, SAN, other layers “File” could be on parallel system, in memory, network, collection of files, etc. “File” June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Philosophy: a single platform with multiple uses One general format One library, with Options to adapt I/O and storage to data needs Layers on top and below Ability to interact well with other technologies Attention to past, present, future compatibility June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks Who uses HDF? June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks Who uses HDF5? Applications that deal with big or complex data Over 200 different types of apps 2+million product users world-wide Academia, government agencies, industry Types of apps growing – over 200 different types of applications of HDF5 reported Uses HDF are quite varied, and include sensor data management and acquisition, archiving, image repositories and interchange, scalable computational meshes storage and retrieval on massively parallel systems, remote-sensed data access and distribution, as a container for heterogeneous collections of complex data, or as an object store for object relational databases. R&D 100 award – in 2002, “one of the 100 most technologically significant products of the year” Serious adoption and reliance on HDF5 Scientific and engineering disciplines such as physics, cosmology, medicine and meteorology rely on HDF technologies Government and quasi-government agencies using for day-to-day operations, long term preservation Next generation US civil and military weather system will use HDF5 for data distribution Aberdeen Test Center using HDF5 as object for DB of 800,000 tests Also in Europe – EU project to use HDF5 for product model data Companies like Boeing, Agilent, GE adopting for company-wide data management Some industries you wouldn’t expect – finance; film-making (Lord of the Rings) Increasing need for support, services, quick response In 2004 a company wanted to hire about half of our staff to build infrastructure for company-wide use of h5. Aberdeen Test Center needed quick porting of HDF5 Java library to 64-bit Linux Etc. June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Applications with large amounts of data June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks Large simulations A simulation can have billions of elements Each element can have dozens of associated values June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks Large images Electron tomography 25-80Å resolution 4k x 4k x 500 images now 8k x 8k x 1k images soon (256 GB) June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

It’s not just about size. June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Computational fluid dynamics simulation data June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks Earth Science (EOS) Aqua (6/01) Aura TES HRDLS MLS OMI Terra CERES MISR MODIS MOPITT Aqua CERES MODIS AMSR June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Flight test High speed, multi-stream, multi-modal data collection Analyze and query specific parameters by time, space June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

I/O Bottlenecks and HDF June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

What is an I/O bottleneck? "I/O bottleneck" – a phenomenon where the performance or capacity of an entire system is severely limited by some aspect of I/O. Two types of bottlenecks Technology – getting the data around quickly Usability/accessibility – acquiring and making use of it The role for HDF Try not to cause bottlenecks Offer ways to deal with bottlenecks when they occur What is a bottleneck? Wikipedia: The term is metaphorically derived from the neck of a bottle, where the flow speed of the liquid is limited by its neck. In engineering, bottleneck is a phenomenon where the performance or capacity of an entire system is severely limited by a single component." Hence "I/O bottleneck" refers to phenomena where the performance or capacity of an entire system is severely limited by I/O. The challenge for HDF: try not to cause an I/O bottleneck provide solutions to I/O bottlenecks June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Programming Interface File system, MPI-IO, SAN, other layers HDF Bottlenecks Tools & Applications HDF5 Application Programming Interface Low level Interface By allowing people like HDF-EOS to build their own layers, they can put their own view on the API and what happens below. It is useful to think about HDF software in terms of layers. At the bottom layer is the HDF5 file or other data source. Above that are two layers corresponding the the HDF library. First there is a low level interface that concentrates on basic I/O: opening and closing files, reading and writing bytes, seeking, etc. HDF5 provides a public API at this level so that people can write their own drivers for reading and writing to places other than those already provided with the library. Those that are already provided include UNIX stdio, and MPI-IO. Then comes the high-level, object -specific interface. This is the API that most people who develop HDF5 applications use. This is where you create a dataset or group, read and write datasets and subsets, etc. At the top are applications, or perhaps APIs used by applications. Examples of the latter are the HDF-EOS API that supports NASA’s EOSDIS datatypes, and the DSL API that supports the ASCI data models. File system, MPI-IO, SAN, other layers File June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Sources of bottlenecks Architectural features Characteristics of data and information objects Accessing and operating on objects Usability/accessibility – beyond specialization June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Architecture-related I/O bottlenecks Software that does I/O often needs to operate on different systems. Differences within and among these systems can create I/O bottlenecks, as well as solutions. June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Architecture I/O bottleneck examples Architecture Bottlenecks Not enough memory, so apps have to swap to disk In a cluster, multiple processors doing I/O on the same file simultaneously Parallel file system has special features to avoid bottlenecks HDF response Keep an HDF file in core, so I/O goes from core to core Adaptable parallel I/O strategies, such as collective I/O, merging many small accesses into one large one Implement special I/O drivers in virtual file layer to exploit parallel file systems like PVFS, GPFS, Lustre Less important to highlight traditional bottlenecks than the angles people haven’t thought about. NARA, NASA – it’s not just speed, it’s longevity, so not a bottleneck at this point in time, but not creating one in the future. (Commitment to open format, etc.) June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Characteristics of data and information objects The size of objects, heterogeneity, and how we represent information. All are potential causes of I/O bottlenecks. June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Characteristics of data and information objects Heterogeneity Bottlenecks Need to represent similar data from different sources, but it comes in different formats. Having to convert data for interoperability HDF responses Creation of common models and corresponding I/O libraries, avoiding need to convert Add I/O filters to auto-convert data June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Characteristics of data and information objects Size bottlenecks Metadata/data differences: Hard to do both big I/O and small I/O efficiently, especially on high-end systems tuned for big I/O. HDF response Metadata caching options: Caches metadata & data to avoid re-reading/writing Let application control cache App can control when cache is flushed App can advise about cache sizes, replacement strategies June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Characteristics of data and information objects Representation bottlenecks Different apps need different views of information, requiring transformation change coordinate systems ingest to database change engineering units HDF Response Group, index, reference structures provide different views at one time I/O filters can operate on data during I/O June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Accessing and operating on objects I/O bottlenecks can occur when data is collected, generated, searched, analyzed, converted, and moved. June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Accessing and operating on objects Sequential R/W Bottlenecks Data from a single source at very high rate Data from multiple sources, simultaneously HDF response Use different file structures for sequential vs. random access Exploit available system optimizations (e.g. direct I/O to bypass system buffers) June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Accessing and operating on objects Partial access bottlenecks Access or operate on part of and object, slice through object, etc. Access to compressed object Perform a query about an object or collection HDF Response Offer rich set of partial I/O ops that recognize patterns and optimize for them Use chunking to enable fast slicing through arrays Compress in chunks, avoiding need to uncompress whole object Create and store indexes together with the data June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Accessing and operating on objects Remote access bottlenecks All of the above are exacerbated when the data is accessed from a distance or over a slow network HDF Response Avoid moving the data. Send operation to the data vs data to operation: Put HDF5 software inside remote data system, such as iRODS Implement remote query/access protocols, such as OPeNDAP June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Usability/accessibility Beyond specialization Data is collected for specific purposes, then frequently turns out to have many other uses. Too often only the first users (the specialists) have the knowledge and tools to access the data and interpret it meaningfully. June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks The gaps between producer and user may be social, political, economic, semantic, temporal. The greater the gaps between producer and consumer, the greater are the challenges to usability and accessibility. June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Usability/accessibility bottlenecks What data do I need and where do I find it? Now that I have it, what does this data really mean? Provenance? Quality? What tools do I need to access data? Do they exist? How do I use them? How do I transform the data to representations that address my information needs? How do I integrate and combine this data with my other data to create new information? Who can help me? June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

HDF responses to support usability Layer the software to make HDF accessible at different levels of expertise Develop and promote standard models and representations in HDF (EOS, netCDF, EXPRESS) Develop and promote metadata standards and their representation in HDF. Provide simple tools to view the data Provide tools to export just the data needed to other formats. Work with tool builders, open & proprietary June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Supporting usability across time Export to simple, enduring formats, such as XML Create maps to the data Define and store Access Information Packages Be tenacious about backward compatibility June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Philosophy: a single platform with multiple uses One general format One library, with Options to adapt I/O and storage to data needs Layers on top and below Ability to interact well with other technologies Attention to past, present, future compatibility June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

Mike Folk mfolk@hdfgroup.org Thank you Mike Folk mfolk@hdfgroup.org June 10, 2008 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks