Storage Structure and Efficient File Access

Slides:



Advertisements
Similar presentations
Adam Jorgensen Pragmatic Works Performance Optimization in SQL Server Analysis Services 2008.
Advertisements

Materialization and Cubing Algorithms. Cube Materialization Each cell of the data cube is a view consisting of an aggregation of interest. The values.
The HDF Group November 3-5, 2009HDF/HDF-EOS Workshop XIII1 HDF5 Advanced Topics Elena Pourmal The HDF Group The 13 th HDF and HDF-EOS.
Spark: Cluster Computing with Working Sets
Making earth science data more accessible: experience with chunking and compression Russ Rew January rd Annual AMS Meeting Austin, Texas.
Raster and Vector 2 Major GIS Data Models. Raster and Vector 2 Major GIS Data Models.
Artificial Neural Network Applications on Remotely Sensed Imagery Kaushik Das, Qin Ding, William Perrizo North Dakota State University
1 High level view of HDF5 Data structures and library HDF Summit Boeing Seattle September 19, 2006.
HDF5 A new file format & software for high performance scientific data management.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.
Bit Sequential (bSQ) Data Model and Peano Count Trees (P-trees) Department of Computer Science North Dakota State University, USA (the bSQ and P-tree technology.
Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo.
Chapter 3 Digital Representation of Geographic Data.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Conclusions and Future Considerations: Parallel processing of raster functions were 3-22 times faster than ArcGIS depending on file size. Also, processing.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
September, 2002 Efficient Bitmap Indexes for Very Large Datasets John Wu Ekow Otoo Arie Shoshani Lawrence Berkeley National Laboratory.
1 HDF5 Life cycle of data Boeing September 19, 2006.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.
Remotely sensed land cover heterogeneity
Spatial DBMS Spatial Database Management Systems.
SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW SC’11 UltraVis Workshop, November 13, 2011 Yu Su*, Gagan Agrawal*, Jon Woodring†
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
Modeling Big Data Execution speed limited by: –Model complexity –Software Efficiency –Spatial and temporal extent and resolution –Data size & access speed.
High Speed Detectors at Diamond Nick Rees. A few words about HDF5 PSI and Dectris held a workshop in May 2012 which identified issues with HDF5: –HDF5.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
The HDF Group HDF5 Chunking and Compression Performance tuning 10/17/15 1 ICALEPCS 2015.
Memory Hierarchy: Terminology Hit: data appears in some block in the upper level (example: Block X)  Hit Rate : the fraction of memory access found in.
Phil Hurvitz Avian Conservation Lab Meeting 8. March. 2002
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.
1 Complex Spatio-Temporal Pattern Queries Cahide Sen University of Minnesota.
Spatially Complete Global Spectral Surface Albedos: Value-Added Datasets Derived From Terra MODIS Land Products Eric G. Moody 1,2, Michael D. King 1, Steven.
What is GIS? “A powerful set of tools for collecting, storing, retrieving, transforming and displaying spatial data”
NOTE, THIS PPT LARGELY SWIPED FROM
1 AQA ICT AS Level © Nelson Thornes 2008 Operating Systems What are they and why do we need them?
Data Warehouses and OLAP 1.  Review Questions ◦ Question 1: OLAP ◦ Question 2: Data Warehouses ◦ Question 3: Various Terms and Definitions ◦ Question.
Basic Digital Imaging For PE 266 Technology in HPER.
Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
1 5. Abstract Data Structures & Algorithms 5.6 Algorithm Evaluation.
The HDF Group Introduction to HDF5 Session ? High Performance I/O 1 Copyright © 2010 The HDF Group. All Rights Reserved.
ITEC2110, Digital Media Chapter 2 Fundamentals of Digital Imaging 1 GGC -- ITEC Digital Media.
Modeling Big Data Execution speed limited by: Model complexity
Database management system Data analytics system:
Indexing Structures for Files and Physical Database Design
LAND COVER CLASSIFICATION WITH THE IMPACT TOOL
Spatial Models – Raster Stacy Bogan
Spatial Data Activities at the Reading e-Science Centre
Introduction to HDF5 Session Five Reading & Writing Raw Data Values
HDF5 Metadata and Page Buffering
Digital Data Format and Storage
BTEC NCF Dip in Comp - Unit 02 Fundamentals of Computer Systems Lesson 10 - Text & Image Representation Mr C Johnston.
Spatial Analysis With Big Data
Performance Performance is fundamentally limited by: Size of data
Meng Lu and Edzer Pebesma
USGS Status Frank Kelly, USGS EROS CEOS Plenary 2017 Agenda Item #4.14
Spatial Data Models Raster uses individual cells in a matrix, or grid, format to represent real world entities Vector uses coordinates to store the shape.
Ministry of Higher Education
O.S Lecture 13 Virtual Memory.
Comparing NetCDF and a multidimensional array database on managing and querying large hydrologic datasets: a case study of SciDB– P5 Haicheng Liu.
Igor Appel Alexander Kokhanovsky
Similarity Search: A Matching Based Approach
Kenneth Moreland Edward Angel Sandia National Labs U. of New Mexico
13.3 Accelerating Access to Secondary Storage
Dr. Sampath Jayarathna Cal Poly Pomona
The P-tree Structure and its Algebra Qin Ding Maleq Khan Amalendu Roy
Presentation transcript:

Storage Structure and Efficient File Access

Overview Efficiency aspects Factors used for improving efficiency Measures of efficiency Physical aspects of the storage medium Study area Synthetic computation comparison test Factors; compression method, chunk sizes, other filters applied to compression measures; process time, file size,

Efficiency aspects Speed Quotas (inode, & storage): Costs Interactive user experience Pixel profiles (spatial, spectral, temporal) Quotas (inode, & storage): RSA (Raster Storage Archive): 1 file per time-slice and spectral band GA v0 & 1: 1 file per time-slice containing multiple spectral bands GA v2: 1 file containing multiple time-slices of multiple spectral bands Costs Compute & storage (related to compression) A large driving force behind speed is the user experience during data exploration. GA is moving towards users being able to interactively interrogate and analyse data The performance gain in terms of speed for both v0 & v1 over the RSA was immense

Measures of Efficiency Creation times (computational costs) Collection management strategies Appending new data (version control, creation times) Algorithm run times (computational costs) largely dependent on disk I/O Improving efficiency can aid in collection management strategies

Physical aspects of the storage medium Dimensional ordering (BSQ, BIL, BIP in the traditional 3D sense) HDF5 chunking Compression filters filter options fletcher32 (collection management) shuffle (bit reorganisation) scale offsets

Region of Interest 2x WRS2 Path Rows; 90/84 & 90/85 Covers 14x 1ox10 cells of the Australian Reflectance Grid (ARG) 1 years worth of Landsat 5 data from 2008 Yields datasets containing partial image coverage as well as complete image coverage This coverage should test a variety of conditions; sparse, dense, as well as a variety of land-covers, which all will have some form of impact on compression

Factors Compression algorithms Shuffle filter HDF5 chunksize Deflate (gzip) Levels: 1, 4, 8 LZF Shuffle filter HDF5 chunksize Spatial (Square in shape) 50, 75, 100, 150, 200, 300, 400 Depth 5, 10, 20 Example chunk size (10, 200, 200) stored across the entire array dimensions which consist of (22, 4000, 4000) Total number of combinations: 168 Cells: 14 Total number of files: 2352 Depth (z-axis) chunks play a role in improving the efficiency of a user’s experience with querying and retrieving a temporal profile of a dataset

Effects across each cell There will be variability! Sparse arrays tended to compress well and vice versa for dense 149_-34 was the opposite. Spatial association of data???

Read and compression efficiency Large chunks, faster reads (but becomes more impacted by partial chunk reads) Spatial chunks > 150 tended to lose their compression efficiency Minor effects related to chunk sizes not being a factor of the total array dimensions. (larger file sizes, longer read times) LZF vs Raw!!! Results averaged across each cell Interesting to note is that the LZF compression resulted in faster read times than the raw data Smaller chunk sizes resulted in slower read times Chunks not a factor of the array dimensions (larger files, slightly longer read times)

Synthetic data simulations Dimensions: (90, 4000, 4000); double precision (float64) Chunksizes Depth 2, 4, 8, 16, 32, 64 Spatial 128 Simulate appending newly acquired data (collection management) Append single acquisition Append multiple acquisitions at once Dataset creation times Synthetic algorithm computation time

Dataset creation times (append) GZip 25secs to 7mins LZF 12secs to 2mins Each point represents a new file using a depth (z-axis) chunksize of ‘n’.

Dataset creation times (append in chunks) Major improvement; Less re-reading GZip 20-26 seconds LZF 12-14 seconds

Dataset creation times (comparison) Datasets written using complete (depth/z-axis) chunks

Dataset creation times (chunk factor affects) Write a spatial block (eg 256x256) of data Dips represent the edge of the spatial array dimensions (less data to write) 4000x4000 spatial dimensions

Dataset creation times (append, sequentially) 136 band image, chunks (32, 128, 128), appended sequentially Each write needs to write each chunk collectively Appending sequentially you can see that the write times increase with each band index

Synthetic algorithm process times (comparison) result=log⁡(data)*√data- [x] where: data is a 3D array x is a vector of averages (time point average) LZF was the only method comparable to the GTIff file. HDF5 suffered issues due to re-reading of the data.

Key points Write in chunks that contain all data required Aids in collection management in that a file could be aggregated when there is sufficient data required by the chunk dimensions Your main type of analysis should determine how you chunk your data. If there isn’t then go for an in-between solution. Large chunks can improve read times (but can also be a downside if you require less data than that returned by a chunk) GA currently has no products produced from algorithms requiring 4D data, so no point in reading 4D chunks to through away whole dimensions. The v2 allows a user to configure the chunking sizes during the ingestion phase of a dataset