The HDF Group www.hdfgroup.org January 8, 2016 2016 ESIP Winter Meeting Data Container Study: HDF5 in a POSIX File System or HDF5 C 3 : Compression, Chunking,

Slides:



Advertisements
Similar presentations
A MapReduce Workflow System for Architecting Scientific Data Intensive Applications By Phuong Nguyen and Milton Halem phuong3 or 1.
Advertisements

A PLFS Plugin for HDF5 for Improved I/O Performance and Analysis Kshitij Mehta 1, John Bent 2, Aaron Torres 3, Gary Grider 3, Edgar Gabriel 1 1 University.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
The HDF Group ESIP Summer Meeting HDF-Java Overview Joel Plutchak The HDF Group 1 July 8 – 11, 2014.
The International Surface Pressure Databank (ISPD) and Twentieth Century Reanalysis at NCAR Thomas Cram - NCAR, Boulder, CO Gilbert Compo & Chesley McColl.
Introduction to the course January 9, Points to Cover  What is GIS?  GIS and Geographic Information Science  Components of GIS Spatial data.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
WORKFLOW IN MOBILE ENVIRONMENT. WHAT IS WORKFLOW ?  WORKFLOW IS A COLLECTION OF TASKS ORGANIZED TO ACCOMPLISH SOME BUSINESS PROCESS.  EXAMPLE: Patient.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Apache Airavata GSOC Knowledge and Expertise Computational Resources Scientific Instruments Algorithms and Models Archived Data and Metadata Advanced.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
The HDF Group July 8, 2014HDF 2014 ESIP Summer Meeting HDF Product Designer Aleksandar Jelenak, H. Joe Lee, Ted Habermann The.
Accessing the Amazon Elastic Compute Cloud (EC2) Angadh Singh Jerome Braun.
DISTRIBUTED DATA FLOW WEB-SERVICES FOR ACCESSING AND PROCESSING OF BIG DATA SETS IN EARTH SCIENCES A.A. Poyda 1, M.N. Zhizhin 1, D.P. Medvedev 2, D.Y.
The HDF Group ESIP Summer Meeting HDF OPeNDAP update Kent Yang The HDF Group 1 July 8 – 11, 2014.
Software Architecture
HDF5 A new file format & software for high performance scientific data management.
DM_PPT_NP_v01 SESIP_0715_AJ HDF Product Designer Aleksandar Jelenak, H. Joe Lee, Ted Habermann Gerd Heber, John Readey, Joel Plutchak The HDF Group HDF.
Ceph Storage in OpenStack Part 2 openstack-ch,
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Introduction to Hadoop and HDFS
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
February 2-3, 2006SRB Workshop, San Diego P eter Cao, NCSA Mike Wan, SDSC Sponsored by NLADR, NFS PACI Project in Support of NCSA-SDSC Collaboration Object-level.
Ohio State University Department of Computer Science and Engineering 1 Supporting SQL-3 Aggregations on Grid-based Data Repositories Li Weng, Gagan Agrawal,
RISICO on the GRID architecture First implementation Mirko D'Andrea, Stefano Dal Pra.
The HDF Group HDF5 Datasets and I/O Dataset storage and its effect on performance May 30-31, 2012HDF5 Workshop at PSI 1.
The HDF Group ESIP Summer Meeting HDF Studio John Readey The HDF Group 1 July 8 – 11, 2014.
CMAQ Runtime Performance as Affected by Number of Processors and NFS Writes Patricia A. Bresnahan, a * Ahmed Ibrahim b, Jesse Bash a and David Miller a.
Integrated Grid workflow for mesoscale weather modeling and visualization Zhizhin, M., A. Polyakov, D. Medvedev, A. Poyda, S. Berezin Space Research Institute.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
ARGONNE NATIONAL LABORATORY Climate Modeling on the Jazz Linux Cluster at ANL John Taylor Mathematics and Computer Science & Environmental Research Divisions.
U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.
September 9, 2008SPEEDUP Workshop - HDF5 Tutorial1 Introduction to HDF5 Command-line Tools.
Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.
Application Software System Software.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
Streaming Applications for Robots with Real Time QoS Oct Supun Kamburugamuve Indiana University.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Part of the AIST Framework.
Parallel IO for Cluster Computing Tran, Van Hoai.
Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.
Servicing Seismic and Oil Reservoir Simulation Data through Grid Data Services Sivaramakrishnan Narayanan, Tahsin Kurc, Umit Catalyurek and Joel Saltz.
Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
Can Data be Organized for Science and Reuse?
Big Data is a Big Deal!.
PROTECT | OPTIMIZE | TRANSFORM
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
HPC usage and software packages
Pegasus WMS Extends DAGMan to the grid world
Hadoop Aakash Kag What Why How 1.
Distributed Network Traffic Feature Extraction for a Real-time IDS
Spark Presentation.
Hadoop Clusters Tess Fulkerson.
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Applying Twister to Scientific Applications
Cloud Distributed Computing Environment Hadoop
Haiyan Meng and Douglas Thain
CS110: Discussion about Spark
Overview of big data tools
CS 345A Data Mining MapReduce This presentation has been altered.
Laura Bright David Maier Portland State University
Hierarchical Data Format (HDF) Status Update
Distributing META-pipe on ELIXIR compute resources
Fan Ni Xing Lin Song Jiang
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

The HDF Group January 8, ESIP Winter Meeting Data Container Study: HDF5 in a POSIX File System or HDF5 C 3 : Compression, Chunking, Clusters Aleksandar Jelenak, John Readey, H. Joe Lee, Ted Habermann 1

ESIP Winter Meeting Hardware 2 Using Open Science Data Cloud Griffin cluster Xeon systems with 1-16 cores 60 compute nodes 10Gb Ethernet Ephemeral local POSIX file system Shared persistent storage (Ceph object store, S3 API)

ESIP Winter Meeting Software 3 HDF5 library v Compression libraries: MAFISC/GZIP/BLOSC Operating system: Ubuntu Linux Linux development tools Any HDF5-supported C compiler HDF5 tools: h5dump, h5repack, etc. Python 3 Python packages: h5py, NumPy, ipyparallel, PyTables

ESIP Winter Meeting Data 4 NCEP/DOE Reanalysis II, for GSSTF, Daily Grid, v3 0.25×0.25 deg, global Time span ,850 daily files, 120GB NOAA Coral Reef Temperature Anomaly Database (CoRTAD) version × deg (~4km), global Time range , weekly time step 8 files, 253GB

ESIP Winter Meeting Workflow 5 Download data as HDF5 files from archive and transfer to S3 object store Repack original file(s) using HDF5 chunking and compression, transfer to S3 store Collate data from original files into one file with HDF5 chunking & compression, transfer to S3 store Launch a number of VMs and connect them into a ipyparallel cluster Data Ingest/Preprocessing Data Analysis Distribute input HDF5 data from S3 store to cluster VMs Execute data analysis task on cluster VMs Collect data analysis results from cluster VMs and prepare the report Shut down the cluster and VMs Index data in file(s) by collecting descriptive statistics (min, max, etc.) for each HDF5 chunk.

ESIP Winter Meeting System Architecture 6

ESIP Winter Meeting HDF5 Chunks 7 Chunking is one of storage layouts for HDF5 datasets HDF5 dataset’s byte stream is broken up in chunks and stored at various locations in the file Chunks are of equal size in dataset’s dataspace but may not be of equal byte size in the file HDF5 filtering works on chunks only Filters for compression/decompression, scaling, checksum calculation, etc.

ESIP Winter Meeting Findings: Chunking 8 Two different chunking algorithms: Unidata’s optimal chunking formula for 3D datasets h5py formula Three different chunk sizes chosen for the collated NCEP data set: Synoptic map: 1×72×144 Data rod: 7850×1×1 Data cube: 25×20×20

ESIP Winter Meeting Findings: Chunking 9 Input was collated NCEP data file: 7850×720×1440, 5 datasets, 121 gigabytes Outputs: Chunk SizeFilterFile Size Change (%)Runtime (hour) 1×72×144GZIP level ×1×1GZIP level ×20×20GZIP level

ESIP Winter Meeting Findings: Compression 10 Compression filters: GZIP, SZIP, MAFISC, Blosc NCEP data set: 7,850 files Chunk size: 45×180 FilterTotal File Size Change (%)Runtime (hour) GZIP, level SZIP MAFISC Blosc

ESIP Winter Meeting Data Indexing 11 Value range information (min, max) captured for each HDF5 dataset chunk These value, plus chunk dataset dataspace coordinates stored in a PyTables file ~30 minutes to collect index data from the collated NCEP data file Work on incorporating this information in processing is on-going

ESIP Winter Meeting Findings: Parallel 12 Load time improved up to 16 nodes Run time improved super-linearly with more nodes (up to 64)

ESIP Winter Meeting Conclusion 13 Using a computing environment where POSIX file system is not persistent storage poses unique challenges Chunk size does influence runtime Compression filter performance: Blosc < GZIP9 < MAFISC Increasing number of compute nodes reduces the observed differences in runtime