NEON HDF5 eddy4R-Docker-HDF5 team (IPT-EC): David Durden, Stefan Metzger, Andy Fox, Greg Holling, Hongyan Luo, Natchaya Pingintha-Durden, Cove Sturtevant,

Slides:



Advertisements
Similar presentations
Rune Hagelund, WesternGeco Stewart A. Levin, Halliburton
Advertisements

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
1 Projection Indexes in HDF5 Rishi Rakesh Sinha The HDF Group.
File Exchange Format for Vital Signs, ENV and its use in Electronic Interchange of Polysomnography Data Alpo Värri Institute of Signal Processing,
Data Management: Documentation & Metadata Types of Documentation.
© 2012 National Ecological Observatory Network, Inc. ALL RIGHTS RESERVED. THE DATA ASSIMILATION RESEARCH TESTBED (DART) FOR ECOLOGICAL FORECASTING Andy.
HDF 1 NCSA HDF XML Activities Robert E. McGrath Mike Folk National Center for Supercomputing Applications.
Lecture 4 Geodatabases. Geodatabases Outline  Data types  Geodatabases  Data table joins  Spatial joins  Field calculator  Calculate geometry 
 The Weka The Weka is an well known bird of New Zealand..  W(aikato) E(nvironment) for K(nowlegde) A(nalysis)  Developed by the University of Waikato.
About CUAHSI The Consortium of Universities for the Advancement of Hydrologic Science, Inc. (CUAHSI) is an organization representing 120+ universities.
© 2013 National Ecological Observatory Network, Inc. ALL RIGHTS RESERVED. THE NEON APPROACH TO DATA INGEST, CURATION, AND SHARING Christine Laney (Data.
Tanzima Z. Islam, Saurabh Bagchi, Rudolf Eigenmann – Purdue University Kathryn Mohror, Adam Moody, Bronis R. de Supinski – Lawrence Livermore National.
Attribute Data in GIS Data in GIS are stored as features AND tabular info Tabular information can be associated with features OR Tabular data may NOT be.
EARTH SCIENCE MARKUP LANGUAGE “Define Once Use Anywhere” INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
Training course on biodiversity data publishing and fitness-for-use in the GBIF Network, 2011 edition How Darwin Core Archives have changed the landscape.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
GIS Tutorial 1 Lecture 4 Geodatabases. Outline  Data types  Geodatabases  Data table joins  Spatial joins  Field calculator  Calculate geometry.
Data Compression By, Keerthi Gundapaneni. Introduction Data Compression is an very effective means to save storage space and network bandwidth. A large.
Victoria, May 2006 DAL for theorists: Implementation of the SNAP service for the TVO Claudio Gheller, Giuseppe Fiameni InterUniversitary Computing Center.
Content and Computer Platforms Week 3. Today’s goals Obtaining, describing, indexing content –XML –Metadata Preparing for the installation of Dspace –Computers.
Best Practices for Managing Historical Imagery Cody Benkelman Kumar Dhruv.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
EARTH SCIENCE MARKUP LANGUAGE Why do you need it? How can it help you? INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
Extensible Markup Language (XML) Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879).ISO 8879 XML is a.
The netCDF-4 data model and format Russ Rew, UCAR Unidata NetCDF Workshop 25 October 2012.
September, 2002 Efficient Bitmap Indexes for Very Large Datasets John Wu Ekow Otoo Arie Shoshani Lawrence Berkeley National Laboratory.
Data Models for Ecological Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental Sciences.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
DataONE: Preserving Data and Enabling Data-Intensive Biological and Environmental Research Bob Cook Environmental Sciences Division Oak Ridge National.
Data Models for Ecological Databases John Porter Department of Environmental Sciences University of Virginia.
Digital Library Project Plan Greg Ferguson LIU LIS 654 October 25, 2011.
IT Enablement Approaches Large Business may have hundreds of processes to be enabled by IT. Several Types of Application may be deployed –Departmental.
March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.
The Research Data Archive at NCAR: A System Designed to Handle Diverse Datasets Bob Dattore and Steven Worley National Center for Atmospheric Research.
1 Copyright © 2011 Tata Consultancy Services Limited Virtual Access Storage Method (VSAM) and Numeric Intrinsic Functions (NUMVAL and NUMVAL-C) LG - TMF148.
Laura Russell VertNet Meherzad Romer NatureServe Canada John Wieczorek
WMO GRIB Edition 3 Enrico Fucile Inter-Program Expert Team on Data Representation Maintenance and Monitoring IPET-DRMM Geneva, 30 May – 3 June 2016.
DICOM in Dart (DCMiD) Computer Integrated Surgery II, Spring 2014, Project 13 Damish Shah and Danielle Tinio, under the auspices of Dr. James Philbin Introduction.
Building an Information Management System for Global Data Sharing: A Strategy for the International Long Term Ecological Research (ILTER) Network Kristin.
The HDF Group Introduction to HDF5 Session ? High Performance I/O 1 Copyright © 2010 The HDF Group. All Rights Reserved.
Digital Cameras in the Classroom Day Two Details Ann Howden UEN Professional Development
Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB.
Geog. 314 Working with tables.
4k… 4K format was named because it has 4000 pixels horizontal resolution approximately. Meanwhile, standard 1080p and 720p resolutions were named because.
3.3 Fundamentals of data representation
HDF5 for Real-Time and/or Embedded Test Data
In-situ Data and obs4MIPs
Running virtualized Hadoop, does it make sense?
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
An Overview of Data-PASS Shared Catalog
ArchiveSpark Andrej Galad 12/6/2016 CS-5974 – Independent Study
Efficiently serving HDF5 via OPeNDAP
Training course on biodiversity data publishing and fitness-for-use in the GBIF Network, 2011 edition How Darwin Core Archives have changed the landscape.
What is FITS? FITS = Flexible Image Transport System
GLOBAL BIODIVERSITY INFORMATION FACILITY
Julia Powell Coast Survey Development Laboratory
Chapter 1 Data Storage.
Present status of the S-111 Product Specification
Space, Time and Variables in Hydrology
Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.
National Center for Atmospheric Research
Prepared by: Jennifer Saleem Arrigo, Program Manager
Variable Length Data and Records
SDMX Information Model: An Introduction
VIJAYA PAMIDI CS 257- Sec 01 ID:102
9. Practical use case 3: Pesticides Use Project
Linear Time Sorting.
Robert Dattore and Steven Worley
SDMX meeting Big Data technologies
Presentation transcript:

NEON HDF5 eddy4R-Docker-HDF5 team (IPT-EC): David Durden, Stefan Metzger, Andy Fox, Greg Holling, Hongyan Luo, Natchaya Pingintha-Durden, Cove Sturtevant, David Weinstein Date: 7/19/2016

The National Ecological Observatory Network 1/1/2019

Goals To implement a fast and efficient file format for NEON data HDF5 file format provides high compressibility and fast efficient reading and writing of large amounts of data Develop a standardized delivery structure for NEON data Structured files centered around the NEON data product numbering makes it an intuitive way to explore larger data files with interdependent data sets Provide metadata with NEON data HDF5 attributes are a concise way to package metadata with our NEON data 7/19/2016

TIS example (Large datasets) storage exchange assembly turbulent exchange assembly

eddy-covariance in the CI workflow

eddy4R-Docker-HDF5 workflow

NEON Data Product Naming Convention NEON.DOM.SITE.DPL.PRNUM.REV.TERMS.HOR.VER.TMI WHERE: NEON=NEON DOM=DOMAIN, e.g. D10 SITE=SITE, e.g. STER DPL=DATA PRODUCT LEVEL, e.g. DP1 PRNUM = PRODUCT NUMBER =>5 digit number. Set in data products catalog. TIS = 00000-09999 REV = REVISION, e.g 001. TERMS=From NEON’s controlled list of terms. Index is unique across products. HOR = HORIZONTAL INDEX. Semi-controlled; AIS and TIS use different rules. Examples: Tower=000, Hut = 700, DFIR=900. VER = VERTICAL INDEX. Semi-controlled; AIS and TIS use different rules. Examples: Ground level=000, second tower level=020. TMI=TEMPORAL INDEX. Examples: 001=1 minute, 030=30 minute, 999=irregular intervals.

NEON HDF5 file structure Collocating NEON’s long-term atmospheric measurements and field observations

Example File Collocating NEON’s long-term atmospheric measurements and field observations

Metadata in HDF5

Metadata Collocating NEON’s long-term atmospheric measurements and field observations

NEON’s first fluxes from SERC! Timeframe: 4/22/2016 -5/03/2016 File size for 1 day (4/22/2016): Compressed = 398 MB Uncompressed = 1.84 GB Data Compression Ratio ~ 4.5:1 Metadata: Units and variable names

Performance testing test datasets approximated 1 day of L0p IRGA data “compound”: single dataset with each row having many numeric float values and a single string value “simple”: one dataset with each row having many numeric float values, second dataset with each row having a single string value Results for COMPOUND dataset are: Results for SIMPLE dataset are:   Compressed Non-compressed Read 45 secs 4.25 secs Write 621 secs 11.25 secs Size 78 MB 266 MB   Compressed Non-compressed Read 1.45 secs 0.75 secs Write 21.45 secs 4 secs Size 21 MB 266 MB 1/1/2019

Future work Implement R code into the eddy4R package to produce NEON formatted HDF5 files Development is currently on Github, if interested you can join our development efforts by signing up for one of our working groups Easy way to imbed EML (Ecological Metadata Language) tags into HDF5? There is an ISO tag solution, but not anything for EML 1/1/2019

Collocating NEON’s long-term atmospheric measurements and field observations