Validating the Rasdaman Capability for Handling Big Raster Data

Validating the Rasdaman Capability for Handling Big Raster Data
-- A progress report Presenter: Chaowei Phil Yang, gmu Task PI Chris Scheele1, Fei Hu2, Manzhu Yu2, Mengchao Xu2, Kai Liu2, Qunying Huang1, Chaowei Yang2 1 University of Wisconsin – Madison, Madison, WI. 2NSF Spatiotemporal innovation center, George Mason Univ., Fairfax, VA. supported by Mike little through aist (NNX15AH51G) as part of the AIST data container study led by Kamalika Das/NASA Ames with participation from Thomas Clune/goddard, Kamalika Das/ames, Daniel Duffy/goddard, Ted Habermann/hdf, Thomas Huang/jpl, Kwo-Sen Kuo/goddard, Chris Mattman/jpl, Chaowei Phil Yang/gmu

Outline Background Datasets Test Environment and Design Results
Conclusion and Future Work

NASA AIST Data Container Study
Big Earth data are collected and accumulated daily. Grand challenges exist in the data lifecycle of preprocess, management, publish & access, analyses, and presentation. Data container project was launched by AIST to capture and validate innovation of technologies and methodologies on addressing big Earth data challenges.

Large scale data management solution
The volume, velocity, and variety of spatial data, along with the computational intensive nature of spatial queries, pose grand challenge to the storage technologies for effective big data management. E.g., weather-Induced disaster events (hurricane, and dust storm) that evolve over time usually do not have well-defined boundaries. Their features may be captured by multiple satellites and images of different time series. To process and extract information from the large scale satellite data, a data-intensive framework is needed for distributed storage and computation resources.

Large scale data management solution
Recently, array-based database systems have emerged as a scalable and cost-effective database solution to store and retrieve massive multi-dimensional arrays, such as sensor, image, and statistical data. Rasdaman (raster data manager) is one of them An open-source, distributed, array-based database Implements OGC standard interfaces Provides a tight integration of raster access into the query language Ability to use multiple servers to store and process data

Objective Evaluate the Rasdaman as a container for big Earth Science data management and analytics.

Daily Surface Reflectance 10/30/2015 usgs.gov
Datasets -1: MODIS Terra and Aqua Satellites 36 Bands (Atmosphere, Ocean, Land) Gridded level 2 Daily Surface Reflectance Product (MYD09GA) in HDF4 format. Average file size 85 MB Global coverage collected from Oct. 1 – Nov. 5, 2015 totaling 1 TB. Daily Surface Reflectance 10/30/2015 usgs.gov

Datasets – 2: Dust Storm Dataset
Non-Hydrostatic Mesoscale Dust Model (NMM-Dust) Non-hydrostatic mesoscale model developed by NCEP Provides 3-7 day forecasts at the regional level NetCDF data format Daily output 30 GB 5+ Dimensional Data

Outline Background Datasets Test Environment and Design Results
Conclusion and Future Work

Testing Platform Test Platforms Location Server Size CPU Core
CPU Speed Memory Storage Network Test platform 1 UW-Madison 2 8 3.4 GHz 16 GB 2TB 1G Test platform 2 GMU 20 24 2.80GHz 24GB 4TB 20G

Testing Matrix Performance Hardware Software Application
CPU/Memory Test Scalablity Test Data Size Test Software Rasdaman Hive Spark Application MODIS Data Access Test Dust Storm Data Mining Test

Testing Queries Query Design Query ID Description Function 1
Select a single pixel from single image Spatial 2 Select a subset from a single image 3 Select a single pixel from multiple images Temporal 4 Select a subset from multiple images 5 Select mean value of each band of a single image Statistical 6 Select mean value of each band across multiple images 7 Select band 1 - band 2 from single image Operational 8 Select band 1 - band 2 from multiple images

Workflow - MODIS

Workflow – Dust Model Output
Dust Model Output in NetCDF Extract Variables of interest Dust Concentration in NetCDF Import Rasdaman Query Test Result

Outline Background Datasets Test Environment and Design
Initial Results Conclusion and Future Work

Dust Model Output - Spatial Query Test
Testing Results Dust Model Output - Spatial Query Test

MODIS – Different Query Test Rasdaman vs. Hive vs. Spark
Testing Results MODIS – Different Query Test Rasdaman vs. Hive vs. Spark

MODIS – Multi-Threading/Server Test Rasdaman
Testing Results MODIS – Multi-Threading/Server Test Rasdaman

MODIS – Spark Test – one request
Testing Results MODIS – Spark Test – one request

Conclusion and Future Work
Initial Results and Next Steps Conclusion and Future Work Hive performs better for single pixel extraction from multiple images Rasdaman has the best performance for queries with statistical and operational functions Except for the single pixel extractions, Spark performs better than Hive and close to Rasdaman Rasdaman supports NetCDF data format better than HDF Rasdaman clustering configuration is complex and we are in communicating with Peter to see if we can get a testing license for the data container study

Initial Results and Next Steps
Optimal configuration (e.g., scalability) of Rasdaman can be achieved based the number of CPU cores Array-based database systems (e.g., Radsaman) have the potential to provide a scalable and cost-effective database solutions to store and retrieve massive scientific datasets Scalability of Rasdaman on multiple servers, spatiotemporal indexing, and optimization should be further investigated (in touch with Peter Bauman/rasdaman executive director)

Selected References Selected References
Baumann, P., A. Dehmel, P. Furtado, R. Ritsch and N. Widmann (1998). The multidimensional database system RasDaMan. ACM SIGMOD Record, ACM. Baumann, P., A. Dehmel, P. Furtado, R. Ritsch and N. Widmann (1999). Spatio-temporal retrieval with RasDaMan. VLDB. Liu, H. (2014). Comparing NetCDF and a multidimensional array database on managing and querying large hydrologic datasets: a case study of SciDB, TU Delft, Delft University of Technology. Merticariu, V. and A. Dumitru (2015). Array Processing in the Cloud: the rasdaman Approach. EGU General Assembly Conference Abstracts. Li Z., Hu F., Schnase J., Duffy D., Lee T., Yang C., Bowen M. (2016), A Spatiotemporal Indexing Approach for Efficient Process of Big Array-based Climate Data with MapReduce, International Journal of Geographic Information Science (In press). Wilson, B. D., Mattmann, C. A., Waliser, D. E., Kim, J., Loikith, P., Lee, H., ... & Whitehall, K. D. (2014, December). SciSpark: Highly Interactive and Scalable Model Evaluation and Climate Metrics. In AGU Fall Meeting Abstracts (Vol. 1, p. 3772).

Acknowledgements Project is funded by.

Validating the Rasdaman Capability for Handling Big Raster Data

Similar presentations

Presentation on theme: "Validating the Rasdaman Capability for Handling Big Raster Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Validating the Rasdaman Capability for Handling Big Raster Data

Similar presentations

Presentation on theme: "Validating the Rasdaman Capability for Handling Big Raster Data"— Presentation transcript:

Similar presentations

About project

Feedback