1 HDF5 Life cycle of data Boeing September 19, 2006.

Slides:



Advertisements
Similar presentations
More on File Management
Advertisements

Part IV: Memory Management
A PLFS Plugin for HDF5 for Improved I/O Performance and Analysis Kshitij Mehta 1, John Bent 2, Aaron Torres 3, Gary Grider 3, Edgar Gabriel 1 1 University.
File Systems.
The HDF Group November 3-5, 2009HDF/HDF-EOS Workshop XIII1 HDF5 Advanced Topics Elena Pourmal The HDF Group The 13 th HDF and HDF-EOS.
Streaming NetCDF John Caron July What does NetCDF do for you? Data Storage: machine-, OS-, compiler-independent Standard API (Application Programming.
NetCDF An Effective Way to Store and Retrieve Scientific Datasets Jianwei Li 02/11/2002.
File System Implementation CSCI 444/544 Operating Systems Fall 2008.
File Systems Implementation
1 File Management in Representative Operating Systems.
Chapter 12: File System Implementation
HDF4 and HDF5 Performance Preliminary Results Elena Pourmal IV HDF-EOS Workshop September
NetCDF4 Performance Benchmark. Part I Will the performance in netCDF4 comparable with that in netCDF3? Will the performance in netCDF4 comparable with.
Status of netCDF-3, netCDF-4, and CF Conventions Russ Rew Community Standards for Unstructured Grids Workshop, Boulder
Chapter 5 Part 2 Secondary Storage Mgt. File Mgt. in Popular OSs
File Systems (1). Readings r Silbershatz et al: 10.1,10.2,
Parallel HDF5 Introductory Tutorial May 19, 2008 Kent Yang The HDF Group 5/19/20081SCICOMP 14 Tutorial.
HDF 1 HDF5 Advanced Topics Object’s Properties Storage Methods and Filters Datatypes HDF and HDF-EOS Workshop VIII October 26, 2004.
The HDF Group April 17-19, 2012HDF/HDF-EOS Workshop XV1 Introduction to HDF5 Barbara Jones The HDF Group The 15 th HDF and HDF-EOS Workshop.
CS 346 – Chapter 12 File systems –Structure –Information to maintain –How to access a file –Directory implementation –Disk allocation methods  efficient.
1 High level view of HDF5 Data structures and library HDF Summit Boeing Seattle September 19, 2006.
HDF5 A new file format & software for high performance scientific data management.
Sep , 2010HDF/HDF-EOS Workshop XIV1 HDF5 Advanced Topics Neil Fortner The HDF Group The 14 th HDF and HDF-EOS Workshop September 28-30, 2010.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
February 2-3, 2006SRB Workshop, San Diego P eter Cao, NCSA Mike Wan, SDSC Sponsored by NLADR, NFS PACI Project in Support of NCSA-SDSC Collaboration Object-level.
1 Introduction to HDF5 Data Model, Programming Model and Library APIs HDF and HDF-EOS Workshop VIII October 26, 2004.
The HDF Group Multi-threading in HDF5: Paths Forward Current implementation - Future directions May 30-31, 2012HDF5 Workshop at PSI 1.
The HDF Group HDF5 Datasets and I/O Dataset storage and its effect on performance May 30-31, 2012HDF5 Workshop at PSI 1.
File System Implementation Chapter 12. File system Organization Application programs Application programs Logical file system Logical file system manages.
HDF 1 New Features in HDF Group Revisions HDF and HDF-EOS Workshop IX November 30, 2005.
April 28, 2008LCI Tutorial1 Introduction to HDF5 Tools Tutorial Part II.
CSC 322 Operating Systems Concepts Lecture - 20: by Ahmed Mumtaz Mustehsan Special Thanks To: Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
10/22/2015CST Operating Systems1 Operating Systems CST 352 File Systems.
October 15, 2008HDF and HDF-EOS Workshop XII1 What will be new in HDF5?
©Silberschatz, Korth and Sudarshan11.1Database System Concepts Chapter 11: Storage and File Structure File Organization Organization of Records in Files.
1 Linux Operating System 許 富 皓. 2 Memory Addressing.
EXPRESS/HDF5 Mapping Specification Version 0.5 Walkthrough David Price October 2006.
HDF5-HL Packet Tables.
HDF Hierarchical Data Format Nancy Yeager Mike Folk NCSA University of Illinois at Urbana-Champaign, USA
The HDF Group November 3-5, 2009HDF/HDF-EOS Workshop XIII1 HDF5 Advanced Topics Elena Pourmal The HDF Group The 13 th HDF and HDF-EOS.
1/14/2005Yan Huang - CSCI5330 Database Implementation – Storage and File Structure Storage and File Structure II Some of the slides are from slides of.
September 9, 2008SPEEDUP Workshop - HDF5 Tutorial1 Introduction to HDF5 Command-line Tools.
HDF5 Q4 Demo. Architecture Friday, May 10, 2013 Friday Seminar2.
The HDF Group HDF5 Chunking and Compression Performance tuning 10/17/15 1 ICALEPCS 2015.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems File systems.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems File systems.
March 9, th International LCI Conference - HDF5 Tutorial1 HDF5 Advanced Topics.
The HDF Group 10/17/15 1 HDF5 vs. Other Binary File Formats Introduction to the HDF5’s most powerful features ICALEPCS 2015.
April 28, 2008LCI Tutorial1 Parallel HDF5 Tutorial Tutorial Part IV.
The HDF Group Introduction to HDF5 Session 7 Datatypes 1 Copyright © 2010 The HDF Group. All Rights Reserved.
Storage Tuning for Relational Databases Philippe Bonnet – Spring 2015.
NetCDF Data Model Details Russ Rew, UCAR Unidata NetCDF 2009 Workshop
Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.
The HDF Group Introduction to HDF5 Session ? High Performance I/O 1 Copyright © 2010 The HDF Group. All Rights Reserved.
File System Implementation
HDF and HDF-EOS Workshop XII
Module 11: File Structure
Moving from HDF4 to HDF5/netCDF-4
Introduction to HDF5 Session Five Reading & Writing Raw Data Values
FileSystems.
CS703 - Advanced Operating Systems
HDF5 Metadata and Page Buffering
Main Memory Management
What NetCDF users should know about HDF5?
Lecture 10: Buffer Manager and File Organization
Chapter 8: Main Memory.
HDF and HDF-EOS Workshop XII
Chapter 11: File System Implementation
Moving applications to HDF
File-System Structure
Presentation transcript:

1 HDF5 Life cycle of data Boeing September 19, 2006

2 Overview “Life cycle” of HDF5 data I/O operations for datasets with different storage layouts Compact dataset Contiguous dataset Datatype conversion Partial I/O for contiguous dataset Chunked dataset I/O for chunked dataset Variable length datasets and I/O

3 Life cycle: what does happen to data when it is transferred from application buffer to HDF5 file? File or other “storage” Virtual file I/O Library internals Object API Application Data buffer H5Dwrite Magic box Unbuffered I/O Data in a file

4 “Life cycle” of HDF5 data: inside the magic box Operations on data inside the magic box Datatype conversion Scattering - gathering Data transformation (filters, compression) Copying to/from internal buffers Concepts involved HDF5 metadata, metadata cache Chunking, chunk cache Data structures used B-trees (groups, dataset chunks) Hash tables Local and Global heaps (variable length data: link names, strings, etc.)

5 “Life cycle” of HDF5 data: inside the magic box Understanding of what is happening to data inside the magic box will help to write efficient applications HDF5 library has mechanisms to control behavior inside the magic box Goals of this and the next talk are to Introduce the basic concepts and internal data structures and explain how they affect performance and storage sizes Give some “recipes” for how to improve performance

6 Operations on data inside the magic box Datatype conversion Examples: float  integer LE  BE 64-bit integer to 16-bit integer (overflow may occur!) Scattering - gathering Data is scattered/gathered from/to user’s buffers into internal buffers for datatype conversion and partial I/O Data transformation (filters, compression) Checksum on raw data and metadata (in 1.8.0) Algebraic transform GZIP and SZIP compressions User-defined filters Copying to/from internal buffers

7 “Life cycle” of HDF5 data: inside the magic box HDF5 metadata Information about HDF5 objects used by the library Examples: object headers, B-tree nodes for group, B-Tree nodes for chunks, heaps, super-block, etc. Usually small compared to raw data sizes (KB vs. MB-GB) Metadata cache Space allocated to handle pieces of the HDF5 metadata Allocated by the HDF5 library in application’s memory space Cache behavior affects overall performance Will cover in the next talk

8 “Life cycle” of HDF5 data: inside the magic box Chunking mechanism Chunking – storage layout where a dataset is partitioned in fixed- size multi-dimensional tiles or chunks Used for extendible datasets and datasets with filters applied (checksum, compression) HDF5 library treats each chunk as atomic object Greatly affects performance and file sizes Chunk cache Created for each chunked dataset Default size 1MB

9 HDF5 file structure User block File header info Version #, etc. Root group Symbol Table Group Symbol Table Object Global Heap Local Heap

10 Writing a contiguous dataset of atomic type DataMetadata Dataspace 3 Rank Dim_2 = 5 Dim_1 = 4 Dimensions Time = 32.4 Pressure = 987 Temp = 56 Attributes Chunked Compressed Dim_3 = 7 Storage info IEEE 32-bit float Datatype

11 I/O operations for HDF5 datasets with different storage layouts Storage layouts Compact Contiguous Chunked I/O performance depends on Dataset storage properties Chunking strategy Metadata cache performance Etc.

12 Application memory Writing a compact dataset Dataset header …………. Datatype Dataspace …………. Attribute 1 Attribute 2 Data Metadata cache File Raw data is stored within the dataset header

13 Writing a contiguous dataset with no datatype conversion User buffer (matrix 5x4x7) Dataset header …………. Datatype Dataspace …………. Attribute 1 Attribute 2 ………… Application memory Metadata cache File Dataset headerDataset raw data

14 Writing a contiguous dataset with conversion Dataset header …………. Datatype Dataspace …………. Attribute 1 Attribute 2 ………… Application memory Metadata cache File Dataset headerDataset raw data Conversion buffer 1MB Dataset raw data

15 Sub-setting of contiguous dataset Series of adjacent rows File N Application data in memory Data is contiguous in a file One I/O operation M rows M

16 Sub-setting of contiguous dataset Adjacent, partial rows File N M … Application data in memory Data is scattered in a file in M contiguous blocks Several small I/O operation N elements

17 Sub-setting of contiguous dataset Extreme case: writing a column File N M … Application data in memory Data is scattered in a file in M contiguous blocks Several small I/O operation 1 element

18 Sub-setting of contiguous dataset Data sieve buffer File N M … Application data in memory Data is scattered in a file in M contiguous blocks 1 element Data is gathered in a sieve buffer in memory 64K memcopy

19 Performance tuning for contiguous dataset Datatype conversion Avoid for better performance Use H5Pset_buffer function to customize conversion buffer size Partial I/O Write/read in big contiguous blocks (at least the size of a block on FS) Use H5Pset_sieve_buf_size to improve performance for complex subsetting

20 Possible tuning work Datatype conversion Use of multiple threads for datatype conversion Partial I/O OS vector I/O Asynchronous I/O

21 Writing chunked dataset Dataset is partitioned into fixed-size multi-dimensional chunks of sizes X/4 x Y/2 x Z Dimension sizes X x Y x Z

22 Extending chunked dataset in any dimension Data can be added in any dimensions Compression is applied to each chunk Datatype conversion is applied to each chunk

23 Writing chunked dataset CB A ………….. Each chunk is written as a contiguous blob Chunks may be scattered all over the file Compression is performed when chunk is evicted from the chunk cache Other filters when data goes through filter pipeline (e.g. encryption) ABC C File Chunk cacheChunked dataset Filter pipeline

24 Writing chunked dataset Dataset_1 header ………… Application memory Metadata cache Chunking B-tree nodes Chunk cache Default size is 1MB Size of chunk cache is set for file Each chunked dataset has its own chunk cache Chunk may be too big to fit into cache Memory may grow if application keeps opening datasets Dataset_N header ………… ………

25 Partial I/O for chunked dataset Build list of chunks and loop through the list For each chunk: Bring chunk into memory Map selection in memory to selection in file Gather elements into conversion buffer and perform conversion Scatter elements back to the chunk Perform conversion when chunk is flushed from chunk cache For each element 3 memcopy performed 12 34

26 Partial I/O for chunked dataset 3 Application memory memcopy Application buffer Chunk Elements participated in I/O are gathered into corresponding chunk

27 Partial I/O for chunked dataset 3 Conversion buffer Gather data Scatter data Application memory Chunk cache On eviction from cache chunk is compressed and is written to the file File Chunk

28 Variable length datasets and I/O Examples of variable-length data String A[0] “the first string we want to write” ………………………………… A[N-1] “the N-th string we want to write” Each element is a record of variable-length A[0] (1,1,0,0,0,5,6,7,8,9) length of the first record is 10 A[1] (0,0,110,2005) ……………………….. A[N] (1,2,3,4,5,6,7,8,9,10,11,12,….,M) length of the N+1 record is M

29 Variable length datasets and I/O Variable length description in HDF5 application typedef struct { size_t length; void *p; }hvl_t; Base type can be any HDF5 type H5Tvlen_create(base_type) ~ 20 bytes overhead for each element Raw data cannot be compressed

30 Variable length datasets and I/O Global heap Application buffer Raw data Elements in application buffer point to global heaps where actual data is stored Global heap

31 Writing VL datasets Dataset header ………… Application memory Metadata cache B-tree nodes Chunk cache ……… Conversion buffer Raw data Global heap Chunk cache VL chunked dataset with selected region File Filter pipeline

32 VL chunked dataset in a file File Dataset header Chunking B-tree Dataset chunksRaw data

33 Variable length datasets and I/O Hints Avoid closing/opening a file while writing VL datasets global heap information is lost global heaps may have unused space Avoid writing VL datasets interchangeably data from different datasets will is written to the same heap If maximum length of the record is known, use fixed-length records and compression

34 Example: Boeing time-segment library application Multiple extendible 1-dim arrays of variable-length records Uses HDF5 Packet Table APIs (H5PT) HDF5 features used Chunked storage Chunk cache Compound and VL datatypes Datatype conversion Partial I/O Complexity affects performance Performance tuning is needed

35 Thank you! Questions ?