Download presentation
Presentation is loading. Please wait.
Published byMaryann Chambers Modified over 6 years ago
1
Validating the Rasdaman Capability for Handling Big Raster Data
-- A progress report Presenter: Chaowei Phil Yang, gmu Task PI Chris Scheele1, Fei Hu2, Manzhu Yu2, Mengchao Xu2, Kai Liu2, Qunying Huang1, Chaowei Yang2 1 University of Wisconsin – Madison, Madison, WI. 2NSF Spatiotemporal innovation center, George Mason Univ., Fairfax, VA. supported by Mike little through aist (NNX15AH51G) as part of the AIST data container study led by Kamalika Das/NASA Ames with participation from Thomas Clune/goddard, Kamalika Das/ames, Daniel Duffy/goddard, Ted Habermann/hdf, Thomas Huang/jpl, Kwo-Sen Kuo/goddard, Chris Mattman/jpl, Chaowei Phil Yang/gmu
2
Outline Background Datasets Test Environment and Design Results
Conclusion and Future Work
3
NASA AIST Data Container Study
Big Earth data are collected and accumulated daily. Grand challenges exist in the data lifecycle of preprocess, management, publish & access, analyses, and presentation. Data container project was launched by AIST to capture and validate innovation of technologies and methodologies on addressing big Earth data challenges.
4
Large scale data management solution
The volume, velocity, and variety of spatial data, along with the computational intensive nature of spatial queries, pose grand challenge to the storage technologies for effective big data management. E.g., weather-Induced disaster events (hurricane, and dust storm) that evolve over time usually do not have well-defined boundaries. Their features may be captured by multiple satellites and images of different time series. To process and extract information from the large scale satellite data, a data-intensive framework is needed for distributed storage and computation resources.
5
Large scale data management solution
Recently, array-based database systems have emerged as a scalable and cost-effective database solution to store and retrieve massive multi-dimensional arrays, such as sensor, image, and statistical data. Rasdaman (raster data manager) is one of them An open-source, distributed, array-based database Implements OGC standard interfaces Provides a tight integration of raster access into the query language Ability to use multiple servers to store and process data
6
Objective Evaluate the Rasdaman as a container for big Earth Science data management and analytics.
7
Daily Surface Reflectance 10/30/2015 usgs.gov
Datasets -1: MODIS Terra and Aqua Satellites 36 Bands (Atmosphere, Ocean, Land) Gridded level 2 Daily Surface Reflectance Product (MYD09GA) in HDF4 format. Average file size 85 MB Global coverage collected from Oct. 1 – Nov. 5, 2015 totaling 1 TB. Daily Surface Reflectance 10/30/2015 usgs.gov
8
Datasets – 2: Dust Storm Dataset
Non-Hydrostatic Mesoscale Dust Model (NMM-Dust) Non-hydrostatic mesoscale model developed by NCEP Provides 3-7 day forecasts at the regional level NetCDF data format Daily output 30 GB 5+ Dimensional Data
9
Outline Background Datasets Test Environment and Design Results
Conclusion and Future Work
10
Testing Platform Test Platforms Location Server Size CPU Core
CPU Speed Memory Storage Network Test platform 1 UW-Madison 2 8 3.4 GHz 16 GB 2TB 1G Test platform 2 GMU 20 24 2.80GHz 24GB 4TB 20G
11
Testing Matrix Performance Hardware Software Application
CPU/Memory Test Scalablity Test Data Size Test Software Rasdaman Hive Spark Application MODIS Data Access Test Dust Storm Data Mining Test
12
Testing Queries Query Design Query ID Description Function 1
Select a single pixel from single image Spatial 2 Select a subset from a single image 3 Select a single pixel from multiple images Temporal 4 Select a subset from multiple images 5 Select mean value of each band of a single image Statistical 6 Select mean value of each band across multiple images 7 Select band 1 - band 2 from single image Operational 8 Select band 1 - band 2 from multiple images
13
Workflow - MODIS
14
Workflow – Dust Model Output
Dust Model Output in NetCDF Extract Variables of interest Dust Concentration in NetCDF Import Rasdaman Query Test Result
15
Outline Background Datasets Test Environment and Design
Initial Results Conclusion and Future Work
16
Dust Model Output - Spatial Query Test
Testing Results Dust Model Output - Spatial Query Test
17
MODIS – Different Query Test Rasdaman vs. Hive vs. Spark
Testing Results MODIS – Different Query Test Rasdaman vs. Hive vs. Spark
18
MODIS – Multi-Threading/Server Test Rasdaman
Testing Results MODIS – Multi-Threading/Server Test Rasdaman
19
MODIS – Spark Test – one request
Testing Results MODIS – Spark Test – one request
20
Conclusion and Future Work
Initial Results and Next Steps Conclusion and Future Work Hive performs better for single pixel extraction from multiple images Rasdaman has the best performance for queries with statistical and operational functions Except for the single pixel extractions, Spark performs better than Hive and close to Rasdaman Rasdaman supports NetCDF data format better than HDF Rasdaman clustering configuration is complex and we are in communicating with Peter to see if we can get a testing license for the data container study
21
Initial Results and Next Steps
Optimal configuration (e.g., scalability) of Rasdaman can be achieved based the number of CPU cores Array-based database systems (e.g., Radsaman) have the potential to provide a scalable and cost-effective database solutions to store and retrieve massive scientific datasets Scalability of Rasdaman on multiple servers, spatiotemporal indexing, and optimization should be further investigated (in touch with Peter Bauman/rasdaman executive director)
22
Selected References Selected References
Baumann, P., A. Dehmel, P. Furtado, R. Ritsch and N. Widmann (1998). The multidimensional database system RasDaMan. ACM SIGMOD Record, ACM. Baumann, P., A. Dehmel, P. Furtado, R. Ritsch and N. Widmann (1999). Spatio-temporal retrieval with RasDaMan. VLDB. Liu, H. (2014). Comparing NetCDF and a multidimensional array database on managing and querying large hydrologic datasets: a case study of SciDB, TU Delft, Delft University of Technology. Merticariu, V. and A. Dumitru (2015). Array Processing in the Cloud: the rasdaman Approach. EGU General Assembly Conference Abstracts. Li Z., Hu F., Schnase J., Duffy D., Lee T., Yang C., Bowen M. (2016), A Spatiotemporal Indexing Approach for Efficient Process of Big Array-based Climate Data with MapReduce, International Journal of Geographic Information Science (In press). Wilson, B. D., Mattmann, C. A., Waliser, D. E., Kim, J., Loikith, P., Lee, H., ... & Whitehall, K. D. (2014, December). SciSpark: Highly Interactive and Scalable Model Evaluation and Climate Metrics. In AGU Fall Meeting Abstracts (Vol. 1, p. 3772).
23
Acknowledgements Project is funded by.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.