Активное распределенное хранилище для многомерных массивов Дмитрий Медведев ИКИ РАН.

Slides:



Advertisements
Similar presentations
Introduction to the BinX Library eDIKT project team Ted Wen Robert Carroll
Advertisements

The Live Access Server (Access to observational data) Jonathan Callahan (University of Washington) Steve Hankin (NOAA/PMEL – PI) Roland Schweitzer, Kevin.
Aggregation and Subsetting in ERDDAP (a middleman data server) Bob Simons NOAA NMFS SWFSC ERD.
GRIB in TDS 4.3. NetCDF 3D Data dimensions: lat = 360; lon = 720; time = 12; variables: float temp(time, lat, lon); temp:coordinates = “time lat lon”;
DataTools Models Data, models and tools: Dealing with any complex hydraulic engineering problem invariable use is made of: data, models and tools.
1 Projection Indexes in HDF5 Rishi Rakesh Sinha The HDF Group.
Streaming NetCDF John Caron July What does NetCDF do for you? Data Storage: machine-, OS-, compiler-independent Standard API (Application Programming.
® OGC Web Services Initiative, Phase 9 (OWS-9): Innovations Thread - OPeNDAP James Gallagher and Nathan Potter, OPeNDAP © 2012 Open Geospatial Consortium.
Multidimensional Database in Context of DB2 OLAP Server Khang Pham Class: CSCI397-16C Instructor: Professor Renner.
Integrating Historical and Realtime Monitoring Data into an Internet Based Watershed Information System for the Bear River Basin Jeff Horsburgh David Stevens,
Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory
The International Surface Pressure Databank (ISPD) and Twentieth Century Reanalysis at NCAR Thomas Cram - NCAR, Boulder, CO Gilbert Compo & Chesley McColl.
Reiner Schlitzer Alfred Wegener Institute for Polar and Marine Research Ocean Data View - Available Data Collections and Data Model.
EU 2nd Year Review – Jan – WP9 WP9 Earth Observation Applications Demonstration Pedro Goncalves :
Word Wide Cache Distributed Caching for the Distributed Enterprise.
EARTH SCIENCE MARKUP LANGUAGE “Define Once Use Anywhere” INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
DISTRIBUTED DATA FLOW WEB-SERVICES FOR ACCESSING AND PROCESSING OF BIG DATA SETS IN EARTH SCIENCES A.A. Poyda 1, M.N. Zhizhin 1, D.P. Medvedev 2, D.Y.
Unidata’s TDS Workshop TDS Overview – Part II October 2012.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
Metadata templates and patterns Sergey Sukhonosov, Dr. Sergey Belov National Oceanographic Data Centre, Russia Training course on establishment of the.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Unidata TDS Workshop TDS Overview – Part I XX-XX October 2014.
Unidata’s Common Data Model John Caron Unidata/UCAR Nov 2006.
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Mapping between SOS standard specifications and INSPIRE legislation. Relationship between SOS and D2.9 Matthes Rieke, Dr. Albert Remke (m.rieke,
Open Data Protocol * Han Wang 11/30/2012 *
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
M.Lautenschlager (WDCC, Hamburg) / / 1 Semantic Data Management for Organising Terabyte Data Archives Michael Lautenschlager World Data Center.
M.Lautenschlager (WDCC, Hamburg) / / 1 Semantic Data Management for Organising Terabyte Data Archives Michael Lautenschlager World Data Center.
Accomplishments and Remaining Challenges: THREDDS Data Server and Common Data Model Ethan Davis Unidata Policy Committee Meeting May 2011.
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1.
The netCDF-4 data model and format Russ Rew, UCAR Unidata NetCDF Workshop 25 October 2012.
Integrated Grid workflow for mesoscale weather modeling and visualization Zhizhin, M., A. Polyakov, D. Medvedev, A. Poyda, S. Berezin Space Research Institute.
1 Dapper and Argo Joe Sirott PMEL/NOAA. 2 What is Dapper? Web server that provides distributed access to in-situ data via OPeNDAP protocol Clients include.
DAP4 James Gallagher & Ethan Davis OPeNDAP and Unidata.
Pradeep S Pushpendra Singh Consultants, Neudesic Technologies, Hyderabad, India.
_______________________________________________________________CMAQ Libraries and Utilities ___________________________________________________Community.
Climate Data Formats Deniz Bozkurt
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
1 DAPPER: An OPENDAP Server for In-Situ Data Joe Sirott Donald W. Denbo, Willa H Zhu University of Washington PMEL/NOAA.
Sciamachy features and usage with respect to end-users The typical fate of retrieval people dealing with large datasets… C. Frankenberg, SRON team, IUP.
Air Quality Data Services: Application of OGC specifications Air Quality Data: Multi-dimensional, multi-source, multi-format Point observations are collected.
May 2003National Coastal Data Development Center Brief Introduction Two components Data Exchange Infrastructure (DEI) Spatial Data Model (SDM) Together,
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
The HDF Group Data Interoperability The HDF Group Staff Sep , 2010HDF/HDF-EOS Workshop XIV1.
Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.
The HDF Group Introduction to netCDF-4 Elena Pourmal The HDF Group 110/17/2015.
Information Technology: GrADS INTEGRATED USER INTERFACE Maps, Charts, Animations Expressions, Functions of Original Variables General slices of { 4D Grids.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
GIS for Atmospheric Sciences and Hydrology By David R. Maidment University of Texas at Austin National Center for Atmospheric Research, 6 July 2005.
LAS and THREDDS: Partners for Education Roland Schweitzer Steve Hankin Jonathan Callahan Joe Mclean Kevin O’Brien Ansley Manke Yonghua Wei.
Data Stewardship at the NOAA Data Centers Sub Topic - Value Added Products ESIP Federation Meeting, Washington, DC January 6-8, 2009.
QARTOD in Practice Luke Campbell, Software Engineer, RPS ASA.
UC 2006 Tech Session 1 NetCDF in ArcGIS 9.2. UC 2006 Tech Session2 Overview Introduction to Multidimensional DataIntroduction to Multidimensional Data.
Bigtable: A Distributed Storage System for Structured Data
ESIP Air Quality Jan Air Quality Cluster Air Quality Cluster Technology Track Earth Science Information Partners Partners NASA NOAA EPA (?) USGS.
Your Data Any Place, Any Time Beyond Relational. Overview of Beyond Relational Applications Today Beyond Relational Feature Overview Whirlwind Feature.
Update on Unidata Technologies for Data Access Russ Rew
The HDF Group Introduction to HDF5 Session Two Data Model Comparison HDF5 File Format 1 Copyright © 2010 The HDF Group. All Rights Reserved.
NcBrowse: A Graphical netCDF File Browser Donald Denbo NOAA-PMEL/UW-JISAO
Other Projects Relevant (and Not So Relevant) to the SODA Ideal: NetCDF, HDF, OLE/COM/DCOM, OpenDoc, Zope Sheila Denn INLS April 16, 2001.
DataGrid France 12 Feb – WP9 – n° 1 WP9 Earth Observation Applications.
Data Browsing/Mining/Metadata
CS 540 Database Management Systems
Downloading Weather Observations
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Parallel NetCDF + MASS Development
Cloud Distributed Computing Environment Hadoop
Comparing NetCDF and a multidimensional array database on managing and querying large hydrologic datasets: a case study of SciDB– P5 Haicheng Liu.
Eurostat Unit B3 – IT and standards for data and metadata exchange
Presentation transcript:

Активное распределенное хранилище для многомерных массивов Дмитрий Медведев ИКИ РАН

Scientific data arrays Arrays are widely used in environmental sciences to store modelling results, satellite observations, raster maps, etc. Datasets can be quite large, up to several terabytes. Most data are stored as file collections in proprietary formats or universally adopted formats like netCDF, GRIB, HDF5. File access can be problematic: Scientists need to know about too many file formats Usually files must be completely downloaded before they can be used Thousands of files can be processed in one data request; only a small portion of their contents appears in the result set Currently available database solutions do not have convenient array storage capabilities.

ActiveStorage ActiveStorage is a generic storage for arrays of primitive data types. Its data model is based on the Unidata’s Common Data Model, used in netCDF, HDF5 and OpenDAP. Basically, ActiveStorage is a SQL Server database with CLR stored procedures and a client library. The stored procedures and the client library provide an abstraction layer for data access. Large arrays are split into chunks and can be spread across several parallel database servers for better performance.

RDBMS Binary data, metadata Stored procedures RDBMS Binary data, metadata Client library Middleware Client library ActiveStorageRasDaManSciDB

Common Data Model This is the Common Data Model (CDM) used in the recent versions of OpenDAP, netCDF and HDF5. Its purpose is the representation of multidimensional scientific data.

Database schema

Splitting an array into chunks 1 seek8 seeks 4 seeks Chunked array Non-chunked array We store chunks in BLOB fields of a database table Chunks do not need to be the same size chunk_keychunk

Data and directory tables The data table stores data chunks in BLOB columns. The directory table contains information about chunk boundaries. A chunk consists of a header and a data block. Two tables are automatically created for each new variable: Data table Directory table

How it works SQL Server DB Client library 2. Issue commands to the database server 3. Select the requested data from several chunks 3. Return the data parts to the client library 4. Assemble the data parts into one multi-dimensional array 1. Pass multi-dimensional data request to the client library Application

Parallel query processing SQL Server DB 1 Client library Application SQL Server DB 2

Parallel query performance 1 database server 4 parallel database servers

NCEP/NCAR Weather Reanalysis Continually updating gridded data set Incorporates observations and global climate model output 74 weather parameters 5000 netCDF files, 30 – 500 MB each Time coverage: 1948 – hourly values Grids: Regular grid, 2.5 x 2.5 degrees T62 Gaussian grid, 192 x 94 points.

Database contents ns1 – Single-layer data on regular grid ns2 – Single-layer data on Gaussian grid ns3, ns4, ns5 – Multi-layer data on regular grid Group: “ns2” NCEP/NCAR Weather Reanalysis Database “time” Group: “ns1” “lat” “lon” data Group: “ns5” Group: “ns4” “time” Group: “ns3” “lat” “lon” data “level”

NCDC Integrated Surface Database 1901 – 2008 time coverage. 30 million sensors. 1.7 billion observations. Fixed ground stationsShipsMobile stationsBuoys FM V N N N ADDGA KA1120N datetimelatlon Mandatory data sectionAdditional data sectionSection marker Group marker Parameter group Control data section ASCII files packed with gzip. 50 GB packed; 400 GB unpacked. When you’ve downloaded and unpacked the data...

Fixed stations

ActiveStorage database for NCDC data The main challenges: Observation times are irregular Observations are distributed unevenly in time and space Different stations have different sets of observed parameters Huge number of observations

Modifications to ActiveStorage N0 0 M ActiveStorage was designed to handle dense multidimensional arrays, with only a small number of missing values. It works well for regularly gridded data. Some multidimensional data are sparse and can not be represented by a single data block.

Modifications to ActiveStorage Sparse arrays can be represented as a tree hierarchy of dense data blocks Some data blocks can be empty Hierarchy levels are treated as additional dimensions (3,0,x,y,z)

Modifications to ActiveStorage

Point IDs Time series Time series are stored as a set of 1D arrays 1 array → 1 geographical point One geographical point may have observations from several sensors Sensors can be distinguished by observation parameters (station code, observation type, call letters, etc.) Time series representation

Buckets latitude longitude time 1⁰ 1 month 1⁰ Bucket Bucket IDs Arrays of point IDs The whole spatio-temporal domain is divided into buckets Each bucket contains a subset of observations from several geographical points A set of IDs of geographical points is stored as a 1D array For each bucket we store only those points that have observations in this bucket

Database contents NCDC Integrated Surface Database “time” Group: “mandatory” “buckets” data “time” Group: additional “buckets” data The “coords” table helps to select time series by latitude/longitude

Request processing chart Get bucket ids Read point ids from bucket Filter points by coordinates for each bucket Read observation times Data storage Filter points by time Read observation data Read observation data for each point Return results

Request processing times LocationSensorsObservationsTime Moscow s Madrid s Gulf of Guinea s Moscow, Madrid – fixed stations Small number of sensors Large number of observations Gulf of Guinea – buoys, ships Large number of sensors Small number of observations * All requests are 2 x 2 degrees, 01/01/2007 – 12/31/2007

ActiveStorage on Windows Azure

How it works Queue1 Queue2 Web Role Worker Role Processed chunks Raw chunks Result BLOB Storage

ActiveStorage on Windows Azure Advantages Easy and natural implementation of parallel query execution. BLOB read rates are quite good: 6.5 MB/s s overhead. Very scalable. CTP problem: replication overhead BLOB writes are several times slower than SQL Server. Message exchange rate is slow (several seconds).