DESIGN & IMPLEMENTATION Ibmdbpy-spatial : An Open-source implementation of in-database geospatial analytics in Python Avipsa Roy1,2, Edouard Fouché2, Rafael Rodriguez Morales2, and Gregor Moehler2 1 Institute for Geoinformatics, University of Münster 2 IBM Research and Development GmbH , Böblingen MOTIVATION DESIGN & IMPLEMENTATION CONCLUSION & RESULTS Perform spatial analysis with open-source Interactive Ipython notebooks. Connect to the database with simple ODBC/JDBC connection with no prior GDAL installation. Replace complex SQL queries with simple function call in Python. Perform spatial operations within the database and fetch results as IdaGeoSeries/IdaGeoDataFrame within Ipython notebook. Eliminate the need to load big datasets into memory all at a once with In-database GeoDataFrames. Visualise the results of analysis with other Python libraries – matplotlib and folium . This work introduces a new method for the exploratory analysis of spatio-temporal data in an efficient and fast manner with the help of the Python package ibmdbpy. The primary and most interesting approach is to perform in-database analytics on spatial data stored in a traditional enterprise data warehouse. In most cases, a lot of geospatial data is locked up in a spatial database and requires intensive spatial query processing with complex SQL which might not be very well known to Spatial Analysts or Geoscience experts. Hence, the idea of doing the similar task with an easy-to-use Python-like syntax from an open source Python package is clearly a solution to this problem. As it is, Python itself is much efficient in performing query processing with its wrapper functionalities. In this package we have wrapped the spatial functions of IBM dashDB to ease the use of complex SQLs and develop faster analysis for geospatial data. With the help of an ODBC or JDBC connection it is also possible to avoid the burden of loading large shapefiles into memory. The user can rather have the data inside IBM dashDB and just retrieve the data during execution in the form of a pandas like DataFrame and then view the results inside Jupyter notebooks. It also eases the process of combining the results of spatial queries on raw data and visualise the results in a more meaningful fashion with additional Python libraries like matplotlib and folium.
ibmdbpy-spatial 2. Design & Architecture 1. Motivation In-Database Analytics 4. Results 3. Setup & Installation
Motivation Utilise the analytics capabilities offered by Python. Represent data stored in spatial databases as IdaGeoDataFrames. Convert spatial queries to simple Python wrapper functions. Perform spatial operations within the database & fetch the results as a dataframe into memory
Motivation The Python ecosystem for Analytics is rich. Scipy, Numpy, Pandas Jupyter Notebooks Matplotlib, folium etc. Existing spatial analysis libraries require additional dependency on gdal. Geopandas Shapely Fiona
Motivation Performance Limitation Loading large datasets into memory often slows down performance. Huge volumes of data typically stored in data warehouses are impractical to extract all at once.
In-Database Analytics Perform computation efficiently inside the database. Fetch only a subset of the entire dataset into memory in a single instance. Benefit from columnar storage and parallel processing of the database.
In-Database Analytics IBM dashDB Cloud-based data warehousing system Optimized for analytics Provides Spatial Extender Integrates BLU technology In-memory column store Massive Parallel Processing (MPP)
The SQL-Pushdown approach We translate higher level syntax into SQL We push them to the underlying database with Python wrappers Everything happens transparently
What we want to achieve… SPATIAL QUERIES PYTHON FUNCTIONS SELECT IDA1."OBJECTID" AS "INDEXERIDA1",IDA2."OBJECTID" AS "INDEXERIDA2",DB2GSE.ST_WITHIN(IDA1.SHAPE,IDA2.SHAPE) AS "RESULT“ FROM (SELECT * FROM SAMPLES.GEO_CUSTOMER WHERE ("INSURANCE_VALUE" > 200000)) AS IDA1, (SELECT * , DB2GSE.ST_BUFFER(SHAPE,20,'STATUTE MILE') AS "buffer_20_mile" FROM SAMPLES.GEO_TORNADO) AS IDA2
Design & Architecture Spatial Data Warehouse hosted on cloud Database driver to establish interoperability Ibmdbpy-spatial running in local machine
Ibmdby-spatial Pandas-like interface for IBM dashDB Compatible Python 2.7 up to 3.5 ODBC or JDBC connection (cross-platform
How to install Ibmdbpy-spatial? ‘ibmdbpy’ Available to be downloaded from pypi. Windows : ODBC connection with dashDB (through pypyodbc) Linux: JDBC connection (through jaydebeapi) Connect to and create a instance (30 day free trial) pip install ibmdbpy >>> from ibmdbpy.base import IdaDataBase >>> idadb = IdaDataBase("DASHDB“) >>> jdbc = 'jdbc:db2://<HOST>:<PORT>/<DBNAME>:user=<UID>;password=<PWD>' >>> idadb = IdaDataBase(jdbc) IBM Bluemix dashDB https://console.ng.bluemix.net/
In-database GeoDataFrame An IdaGeoDataFrame instance is a pointer to a table in the database.
Ibmdbpy- Spatial Functions
Ibmdbpy- Spatial Functions Select the polygon representing 200 miles along a tornado trajectory.
Ibmdbpy- Spatial Functions Find the geographic area of each county in square kilometers.
Results Spatial Analysis of crime hotspots in New York City based on 7 major felonies dataset.
IBM Bluemix console
IBM Bluemix console
Deployment Distribution via PyPI. Available on GitHub: https://github.com/ibmdbanalytics/ibmdbpy License : BSD pip install ibmdbpy
Conclusion Ibmdbpy is an interface for in-database geospatial analysis Relies on the database engine No data extraction required No additional dependencies on GDAL Intuitive: GeoPandas-like syntax Documentation: http://pythonhosted.org/ibmdbpy/geospatial.html