Efficiently Managing Large-Scale Raster Species Distribution Data in PostgreSQL Jianting Zhang, Dept. of Computer Science The City College of the City.

Efficiently Managing Large-Scale Raster Species Distribution Data in PostgreSQL Jianting Zhang, Dept. of Computer Science The City College of the City University of New York Michael Gertz, Institute of Computer Science University of Heidelberg Le Gruenwald, School of Computer Science The University of Oklahoma

Outline Introduction Background and Related Work Species Distribution Data Quadtree Indexing and Query Processing The Proposed Solution Database Preparation Query Window Decomposition Query Optimization and Result Combination Experiments and Evaluation Conclusion and Future Work

Introduction EnvironmentSpecies Taxonomic (Linnaean ranks) Kingdom Phylum Class Order Family Genus Species SubSpecies Phylogenentic Area Water- Energy Latitude Altitude Productivity Environmental Gradient Community – Ecosystem – Biome – Biosphere Phylogeography 3

Introduction Geographical View Taxonomic View Environmental View Ecoregion View Linked Environment for Exploratory Analysis of Large-Scale Species Distribution Data ACMGIS’08

Introduction Taxonomic Geographical Correlation Distribution Configuration Environmental Phylogenetic Approximation Distribution

Background: Species Distribution Data Museum Collections or Species checklists Global Biodiversty Facility (GBIF) data portal (http://data.gbif.org/) 177,887,193 occurrence records from 294 data providers as of 07/25/2009 Species 2000 Annual Checklist (www.sp2000.org) 1,160,711 species as of 2009 Species Range Maps NatureServe Birds of the Western Hemisphere http://www.natureserve.org 4253 birds species ~3 Gigabytes Little Tree Species http://esp.cr.usgs.gov/data/atlas/little/ 679 tree species, 137 Megabytes Compiled Species databases WWF Wildfinder database http://www.worldwildlife.org/science /data/wildfinder.cfm 29112 species, 4815 genus, 445 families, 69 orders, 350045 species-ecoregion records, ~80 Megabytes USDA Plant databases http://plants.usda.gov/ 89759 plant species in 3141 US counties

Background: Species Distribution Data

Related Work Quadtree Indexing :One of the oldest and most extensively studied indexing and query processing approach [Gaede and Günther 1998; Samet 2005] –Research prototypes: QUILT (Shaffer et al 1990) and SAND (Esperanca and Samet 2002, Samet and Webber 2006 ) - Lacking of full SQL support –Window query in linear quadtree (Aboulnaga and Aref 2001) -Key values are stored as Morton codes for B-Tree Indexing – Implementation is non-trivial –SP-GIST (Aref and Ilyas 2001, Eltabakh et al 2006) - Quad-tree based indexing of line-segments in PostgreSQL Commercial products (Oracle/SQL Server) [Kothuri et al 2002, Fang et al 2008 ] Quadtree based indexing of polygonal data Filtering mechanism to facilitate querying spatial relationships at the polygon level Not directly accessible to application developers

Related Work Pieces that contribute to the research –Quadtree representation of geo-referenced data and linear quadtrees (Hunter et al 1979, Gargantini 1982, Samet et al 1983, Shaffer et al 1990, Esperanca and Samet 2002, Samet and Webber 2006) –Efficient window query decomposition (Aref and Samet 1993, Aref and Samet 1997, Proietti 1999, Aboulnaga and Aref 2001, Tsai et al 2004) –Microsoft SQL Server Spatial’s implementation of quadtree indexing based on path query (Fang et al 2008) –LTREE Tree path indexing module in PostgreSQL Open source products  PostgreSQL: Quadtree indexing for binary raster data is not available  Rasdaman: support storing and querying dense multi-dimensional real- valued arrays based on tiling (chunking) techniques

Overview of the Proposed Approach

Database Preparation Polygons representing species distributions vary in sizes and shapes significantly Distributions among a large number of species overlap greatly – cross layer query It is inefficient to create a quadtree for each polygon - quadtree paths will be duplicated Associate a quadtree node with a set of species identifiers instead of a single species identifier How to combine individual quadtrees for efficient query processing?

Database Preparation Individual Quadtrees for Species Distributions

Database Preparation Classic Combination

Database Preparation Improved Combination

Database Preparation Each of quadtree nodes (leaf or non-leaf) and their associated species identifiers become a tuple in PostgreSQL database Index the table based on the paths of the quadtree nodes – offline Sample query: –Select bk_id, sp_ids from TB where bk_id <@ ‘3’ –selecting all the tuples whose paths are decedents of tree path ‘3’ –Results: A (16), B(4), C(4)

Query Window Decomposition Transform a spatial query window into tree paths to match with the database tuples - Decomposition Exact matches + searching for ancestors. Exact matches + searching for descendents Condition to match a query window cell C with a database tuple R (C.ID is an ancestor of R.path) or (C.ID is a descendant of R.path) A B CC 3 3.2 3.2.0 3.2.2 A B C

Query Window Decomposition Complexity analysis Decomposition algorithm  O(m) (Tsai et al 2004), m is the larger of the width and height of a query window Converting the outputs of (Tsai et al 2004) to tree paths  O(l*d). l is the number of decomposed cells and d is the depth of quadtree for a raster tessellation. l is proportional to one dimension of the query window, i.e., m. (Tsai et al 2004) The overall complexity O(m)+ O(l*d)  O(m*d) d is a relative small number  O(m) We adopt the approach reported in (Tsai et al 2004) -efficiency and easy of implementation m=8 d=3 l=13 (1 level 2 and 12 level 3)

Query Optimization A B CC 3 3.2 3.2.0 3.2.2 For a large query window that does not align with quadrant divisions very well, the decomposed cells can be thousands or even more and most of them will be small cells with same ancestor nodes As the queries are sent to the server independently, duplicated tuples may be returned and need to be removed when combining query results How to minimize duplication while ensure correctness? Can we remove duplication and make combination as simple additions?

Query Optimization 1 1 111 1 1 2 2 2 2 4 8 8 Least Common Ancestor (LCA) Retrieve all the tuples whose quadtree paths are the descendents (inclusive) of the nodes below Retrieve all the tuples whose quadtree paths match the cell identifiers exactly Retrieve all the tuples whose paths are the ancestors of the root identifier

Proposed Approach -Discussions Summary –Utilizes existing database storage and indexing functions – no need to define new data types, develop new indexing approaches, modify query syntax and revise database query engines –User queries are transformed into formats that are supported by existing database backend and the results are combined in the middleware to answer users’ query effectively and efficiently. Advantages: –Use SQL query syntax instead of being forced to use APIs –Use a variety of databases (as long as they support efficient path query) –The underlying database systems are left untouched Reduce technical complexities Does not depend on the availability of source code

Experiment Setups –Dell Precision T5400 workstation/PostgreSQL 8.3.5 –Species Distribution Data NatureServe: http://www.natureserve.org/getData/animalData.jsp Mammals (1693 species), birds (4148 species) and amphibian (5816 species) –Quadtree West hemisphere, i.e., (-180,-90, 0, 90) Depth=14 (2 14 =16384) Spatial resolution is finer than 1 arc minute (180*60=10800) –Four query window sizes: 0.1, 0.5, 1 and 5 degrees –For each query window size: 100 queries with random centers

Experiments on Database Preparation Rasterized bird species distributions (finest resolution) –46,139,247 cells –1,318,136,140 pairs of (cell, identifier) combinations –28.7 species per cell # of quadtree nodes (database tuples) –Classic combination: 7,511,823 leaf nodes –Proposed combination: 4,957,050 leaf and non-leaf nodes –Proposed combination: 1:9.3 compression ratio # of species identifiers –Classic combination: 831,903,250 (110.7 per node) –Proposed combination: 23,865,343 (4.8 per node) 34.9 times less with respect to the total # of species identifiers 23 time less with respect to the average # of identifiers per node

Experiments on Query Processing # of species Query response time (ms): Baseline (lt) and Optimized (op) approaches

Experiments on Query Processing Window Size Baseline QueryOptimized Query AVGMAXAVGMAX 0.10.140.360.030.06 0.50.932.390.120.25 11.633.930.200.50 59.6821.131.163.38 Average and Maximum Response Times under Four Query Windows for the Three Approaches (in seconds)

Conclusions and Future Work This research tackles the problem of storing, indexing and querying large-scale species distribution data in the form of binary rasters in a database environment The approach does not require any modifications on database backend and is applicable to many database systems that support tree path matching. A middleware approach has been adopted by utilizing existing PostgreSQL database support for tree paths and by transforming spatial window query into tree path matching. An end-to-end system to manage large-scale species distribution datasets has been developed with demonstrated efficiency based on 4000+ bird species distribution data in the West Hemisphere.

Conclusions and Future Work Future work –Further extend the solution to manage even larger scale of species distribution datasets. The ultimate goal is to support all known species at the million scale and reduce the average query response time to below one second for realistic query window sizes to support interactive customer applications –Possible strategies New efficient data structures and algorithms (e.g. column store, bitmap) Combing pre-computation and on-demand processing (query window decomposition) Using main-memory database techniques (indexing/querying) Using GPU for faster query processing (Fermi, CUDA, Nexus) Using Map-Reduce/Hadoop for parallel/distributed query processing (HadoopDB)

Latest Progresses Main-memory database and query processing algorithm –Memory Consumption: ~200M for 4000+ birds species –Query response time: ~1/4 second for query window size as large as 5 by 5 degrees –Can be used as a cache system for extremely large scale species distribution data Web-based tool: http://geoteci.engr.ccny.cuny.edu/geoteci/SPTestMap.html

Efficiently Managing Large-Scale Raster Species Distribution Data in PostgreSQL Jianting Zhang, Dept. of Computer Science The City College of the City.

Similar presentations

Presentation on theme: "Efficiently Managing Large-Scale Raster Species Distribution Data in PostgreSQL Jianting Zhang, Dept. of Computer Science The City College of the City."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficiently Managing Large-Scale Raster Species Distribution Data in PostgreSQL Jianting Zhang, Dept. of Computer Science The City College of the City.

Similar presentations

Presentation on theme: "Efficiently Managing Large-Scale Raster Species Distribution Data in PostgreSQL Jianting Zhang, Dept. of Computer Science The City College of the City."— Presentation transcript:

Similar presentations

About project

Feedback