Validating the Rasdaman Capability for Handling Big Raster Data

Slides:



Advertisements
Similar presentations
Spatial Ontology Community of Practice Workshop, USGS, Dec.2, Using Knowledge to Facilitate Better Data Discovery, Access, and Utilization for CloudGIS.
Advertisements

University of Chicago Department of Energy The Parallel and Grid I/O Perspective MPI, MPI-IO, NetCDF, and HDF5 are in common use Multi TB datasets also.
Implementation of a Data Node in China's Spatial Information Grid Based on NWGISS Dengrong Zhang, Le Yu, Liping Di Institute of Spatial.
Big Data Workflows N AME : A SHOK P ADMARAJU C OURSE : T OPICS ON S OFTWARE E NGINEERING I NSTRUCTOR : D R. S ERGIU D ASCALU.
The International Surface Pressure Databank (ISPD) and Twentieth Century Reanalysis at NCAR Thomas Cram - NCAR, Boulder, CO Gilbert Compo & Chesley McColl.
CLIMATE SCIENTISTS’ BIG CHALLENGE: REPRODUCIBILITY USING BIG DATA Kyo Lee, Chris Mattmann, and RCMES team Jet Propulsion Laboratory (JPL), Caltech.
Data Mining – Intro.
Активное распределенное хранилище для многомерных массивов Дмитрий Медведев ИКИ РАН.
Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001 George Kollios Boston University.
UNIVERSITY of MARYLAND GLOBAL LAND COVER FACILITY High Performance Computing in Support of Geospatial Information Discovery and Mining Joseph JaJa Institute.
Delivery of Forecasted Atmospheric Ozone and Dust for a Public Health Decision-Support System-Architecture and Functionality William B. Hudspeth, Jeff.
C. Yang, M. Sun, J. Xia, J. Li, K. Liu, Q. Huang and Z. Gui, Chapter 12 How to test the readiness of cloud services, In Spatial Cloud Computing,
DISTRIBUTED DATA FLOW WEB-SERVICES FOR ACCESSING AND PROCESSING OF BIG DATA SETS IN EARTH SCIENCES A.A. Poyda 1, M.N. Zhizhin 1, D.P. Medvedev 2, D.Y.
, Implementing GIS for Expanded Data Accessibility and Discoverability ASDC Introduction The Atmospheric Science Data Center (ASDC) at NASA Langley Research.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
material assembled from the web pages at
Appraisal and Data Mining of Large Size Complex Documents Rob Kooper, William McFadden and Peter Bajcsy National Center for Supercomputing Applications.
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
A Global Agriculture Information System Zhong Liu 1,4, W. Teng 2,4, S. Kempler 4, H. Rui 3,4, G. Leptoukh 3 and E. Ocampo 3,4 1 George Mason University,
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
Integrated Grid workflow for mesoscale weather modeling and visualization Zhizhin, M., A. Polyakov, D. Medvedev, A. Poyda, S. Berezin Space Research Institute.
, Key Components of a Successful Earth Science Subsetter Architecture ASDC Introduction The Atmospheric Science Data Center (ASDC) at NASA Langley Research.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Vegetation Condition Indices for Crop Vegetation Condition Monitoring Zhengwei Yang 1,2, Liping Di 2, Genong Yu 2, Zeqiang Chen 2 1 Research and Development.
ESIP Federation 2004 : L.B.Pham S. Berrick, L. Pham, G. Leptoukh, Z. Liu, H. Rui, S. Shen, W. Teng, T. Zhu NASA Goddard Earth Sciences (GES) Data & Information.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
A High performance I/O Module: the HDF5 WRF I/O module Muqun Yang, Robert E. McGrath, Mike Folk National Center for Supercomputing Applications University.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
Data Discovery and Access to The International Surface Pressure Databank (ISPD) 1 Thomas Cram Gilbert P. Compo* Doug Schuster Chesley McColl* Steven Worley.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
GEON2 and OpenEarth Framework (OEF) Bradley Wallet School of Geology and Geophysics, University of Oklahoma
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
Web Log Data Analytics with Hadoop
Welcome to the PRECIS training workshop
Data Discovery and Access to The International Surface Pressure Databank (ISPD) 1 Thomas Cram Gilbert P. Compo* Doug Schuster Chesley McColl* Steven Worley.
Support to scientific research on seasonal-to-decadal climate and air quality modelling Pierre-Antoine Bretonnière Francesco Benincasa IC3-BSC - Spain.
Big Data Analytics and HPC Platforms
Accessing the VI-SEEM infrastructure
Presented by: Omar Alqahtani Fall 2016
Data Mining – Intro.
Open-source Scientific Computing and Data Analytics using HDF
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Central Satellite Data Repository Supporting Research and Development
Database management system Data analytics system:
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
Joslynn Lee – Data Science Educator
Pathology Spatial Analysis February 2017
Spatial Data Activities at the Reading e-Science Centre
Worldbank Conference on Land and Poverty, 2017-mar-23
A Web-enabled Approach for generating data processors
Meng Lu and Edzer Pebesma
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
SpatialHadoop: A MapReduce Framework for Spatial Data
CyberGIS: Reston, VA, September 22, 2018
Ministry of Higher Education
Geospatial Technology in Climate Change
Author: Ahmed Eldawy, Mohamed F. Mokbel, Christopher Jonathan
Framework & Results A general solution for bridging EO to application
Introduction to D4Science
Data Warehousing and Data Mining
Tools for Processing Big Data Jinan Al Aridhee and Christian Bach
Comparing NetCDF and a multidimensional array database on managing and querying large hydrologic datasets: a case study of SciDB– P5 Haicheng Liu.
Overview of big data tools
Laura Bright David Maier Portland State University
Big DATA.
Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.
V. Uddameri Texas Tech University
Visualization of Global Argo Metadata:
Presentation transcript:

Validating the Rasdaman Capability for Handling Big Raster Data -- A progress report Presenter: Chaowei Phil Yang, gmu Task PI Chris Scheele1, Fei Hu2, Manzhu Yu2, Mengchao Xu2, Kai Liu2, Qunying Huang1, Chaowei Yang2 1 University of Wisconsin – Madison, Madison, WI. 2NSF Spatiotemporal innovation center, George Mason Univ., Fairfax, VA. supported by Mike little through aist (NNX15AH51G) as part of the AIST data container study led by Kamalika Das/NASA Ames with participation from Thomas Clune/goddard, Kamalika Das/ames, Daniel Duffy/goddard, Ted Habermann/hdf, Thomas Huang/jpl, Kwo-Sen Kuo/goddard, Chris Mattman/jpl, Chaowei Phil Yang/gmu

Outline Background Datasets Test Environment and Design Results Conclusion and Future Work

NASA AIST Data Container Study Big Earth data are collected and accumulated daily. Grand challenges exist in the data lifecycle of preprocess, management, publish & access, analyses, and presentation. Data container project was launched by AIST to capture and validate innovation of technologies and methodologies on addressing big Earth data challenges.

Large scale data management solution The volume, velocity, and variety of spatial data, along with the computational intensive nature of spatial queries, pose grand challenge to the storage technologies for effective big data management. E.g., weather-Induced disaster events (hurricane, and dust storm) that evolve over time usually do not have well-defined boundaries. Their features may be captured by multiple satellites and images of different time series. To process and extract information from the large scale satellite data, a data-intensive framework is needed for distributed storage and computation resources.

Large scale data management solution Recently, array-based database systems have emerged as a scalable and cost-effective database solution to store and retrieve massive multi-dimensional arrays, such as sensor, image, and statistical data. Rasdaman (raster data manager) is one of them An open-source, distributed, array-based database Implements OGC standard interfaces Provides a tight integration of raster access into the query language Ability to use multiple servers to store and process data

Objective Evaluate the Rasdaman as a container for big Earth Science data management and analytics.

Daily Surface Reflectance 10/30/2015 usgs.gov Datasets -1: MODIS Terra and Aqua Satellites 36 Bands (Atmosphere, Ocean, Land) Gridded level 2 Daily Surface Reflectance Product (MYD09GA) in HDF4 format. Average file size 85 MB Global coverage collected from Oct. 1 – Nov. 5, 2015 totaling 1 TB. Daily Surface Reflectance 10/30/2015 usgs.gov

Datasets – 2: Dust Storm Dataset Non-Hydrostatic Mesoscale Dust Model (NMM-Dust) Non-hydrostatic mesoscale model developed by NCEP Provides 3-7 day forecasts at the regional level NetCDF data format Daily output 30 GB 5+ Dimensional Data

Outline Background Datasets Test Environment and Design Results Conclusion and Future Work

Testing Platform Test Platforms Location Server Size CPU Core CPU Speed Memory Storage Network Test platform 1 UW-Madison 2 8 3.4 GHz 16 GB 2TB 1G Test platform 2 GMU 20 24 2.80GHz 24GB 4TB 20G

Testing Matrix Performance Hardware Software Application CPU/Memory Test Scalablity Test Data Size Test Software Rasdaman Hive Spark Application MODIS Data Access Test Dust Storm Data Mining Test

Testing Queries Query Design Query ID Description Function 1 Select a single pixel from single image Spatial 2 Select a subset from a single image 3 Select a single pixel from multiple images Temporal 4 Select a subset from multiple images 5 Select mean value of each band of a single image Statistical 6 Select mean value of each band across multiple images 7 Select band 1 - band 2 from single image Operational 8 Select band 1 - band 2 from multiple images

Workflow - MODIS

Workflow – Dust Model Output Dust Model Output in NetCDF Extract Variables of interest Dust Concentration in NetCDF Import Rasdaman Query Test Result

Outline Background Datasets Test Environment and Design Initial Results Conclusion and Future Work

Dust Model Output - Spatial Query Test Testing Results Dust Model Output - Spatial Query Test

MODIS – Different Query Test Rasdaman vs. Hive vs. Spark Testing Results MODIS – Different Query Test Rasdaman vs. Hive vs. Spark

MODIS – Multi-Threading/Server Test Rasdaman Testing Results MODIS – Multi-Threading/Server Test Rasdaman

MODIS – Spark Test – one request Testing Results MODIS – Spark Test – one request

Conclusion and Future Work Initial Results and Next Steps Conclusion and Future Work Hive performs better for single pixel extraction from multiple images Rasdaman has the best performance for queries with statistical and operational functions Except for the single pixel extractions, Spark performs better than Hive and close to Rasdaman Rasdaman supports NetCDF data format better than HDF Rasdaman clustering configuration is complex and we are in communicating with Peter to see if we can get a testing license for the data container study

Initial Results and Next Steps Optimal configuration (e.g., scalability) of Rasdaman can be achieved based the number of CPU cores Array-based database systems (e.g., Radsaman) have the potential to provide a scalable and cost-effective database solutions to store and retrieve massive scientific datasets Scalability of Rasdaman on multiple servers, spatiotemporal indexing, and optimization should be further investigated (in touch with Peter Bauman/rasdaman executive director)

Selected References Selected References Baumann, P., A. Dehmel, P. Furtado, R. Ritsch and N. Widmann (1998). The multidimensional database system RasDaMan. ACM SIGMOD Record, ACM. Baumann, P., A. Dehmel, P. Furtado, R. Ritsch and N. Widmann (1999). Spatio-temporal retrieval with RasDaMan. VLDB. Liu, H. (2014). Comparing NetCDF and a multidimensional array database on managing and querying large hydrologic datasets: a case study of SciDB, TU Delft, Delft University of Technology. Merticariu, V. and A. Dumitru (2015). Array Processing in the Cloud: the rasdaman Approach. EGU General Assembly Conference Abstracts. Li Z., Hu F., Schnase J., Duffy D., Lee T., Yang C., Bowen M. (2016), A Spatiotemporal Indexing Approach for Efficient Process of Big Array-based Climate Data with MapReduce, International Journal of Geographic Information Science (In press). Wilson, B. D., Mattmann, C. A., Waliser, D. E., Kim, J., Loikith, P., Lee, H., ... & Whitehall, K. D. (2014, December). SciSpark: Highly Interactive and Scalable Model Evaluation and Climate Metrics. In AGU Fall Meeting Abstracts (Vol. 1, p. 3772).

Acknowledgements Project is funded by.