Efficiently Managing Large-Scale Raster Species Distribution Data in PostgreSQL Jianting Zhang, Dept. of Computer Science The City College of the City.

Slides:

Advertisements

Similar presentations

1 DATA STRUCTURES USED IN SPATIAL DATA MINING. 2 What is Spatial data ? broadly be defined as data which covers multidimensional points, lines, rectangles,

Advertisements

Data Models There are 3 parts to a GIS: GUI Tools

Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,

Wavelets Fast Multiresolution Image Querying Jacobs et.al. SIGGRAPH95.

Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.

Fast Algorithms For Hierarchical Range Histogram Constructions

Danzhou Liu Ee-Peng Lim Wee-Keong Ng

Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.

Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.

GI Systems and Science January 30, Points to Cover  Recap of what we covered so far  A concept of database Database Management System (DBMS) 

BTrees & Bitmap Indexes

Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.

Heuristic alignment algorithms and cost matrices

Spatial Information Systems (SIS) COMP Spatial access methods: Indexing.

Embedding and Extending GIS for Exploratory Analysis of Large-Scale Species Distribution Data Jianting Zhang, Dept. of Computer Science The City College.

Geographic Information Systems

Supporting Web-based Visual Exploration of Large-Scale Raster Geospatial Data Using Binned Min-Max Quadtree Jianting Zhang 12, Simin You 2 City College.

Spatial Indexing I Point Access Methods.

An Incremental Refining Spatial Join Algorithm for Estimating Query Results in GIS Wan D. Bae, Shayma Alkobaisi, Scott T. Leutenegger Department of Computer.

NPS Introduction to GIS: Lecture 1

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

General Trees and Variants CPSC 335. General Trees and transformation to binary trees B-tree variants: B*, B+, prefix B+ 2-4, Horizontal-vertical, Red-black.

Abstract Shortest distance query is a fundamental operation in large-scale networks. Many existing methods in the literature take a landmark embedding.

An Overview of the SAND spatial Database System Claudio Esperanca Hanan Samet Presented By Gautam Shanbhag.

Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.

GI Systems and Science January 23, Points to Cover  What is spatial data modeling?  Entity definition  Topology  Spatial data models Raster.

Rebecca Boger Earth and Environmental Sciences Brooklyn College.

Prepared by Abzamiyeva Laura Candidate of the department of KKGU named after Al-Farabi Kizilorda, Kazakstan 2012.

Indexing structures for files D ƯƠ NG ANH KHOA-QLU13082.

Mobile Mapping Systems (MMS) for infrastructural monitoring and mapping are becoming more prevalent as the availability and affordability of solutions.

1 Efficient packet classification using TCAMs Authors: Derek Pao, Yiu Keung Li and Peng Zhou Publisher: Computer Networks 2006 Present: Chen-Yu Lin Date:

Improving Min/Max Aggregation over Spatial Objects Donghui Zhang, Vassilis J. Tsotras University of California, Riverside ACM GIS’01.

1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,

Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

OBIS Portal Architecture Concepts plus potential for utilization as a basis for Regional OBIS Nodes Tony Rees, CSIRO Marine Research, Hobart (and OBIS.

Index Tuning for Adaptive Multi-Route Data Stream Systems Karen Works, Elke A. Rundensteiner, and Emmanuel Agu Database Systems Research.

Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio.

1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Trevor Brown – University of Toronto B-slack trees: Space efficient B-trees.

© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.

R-Tree. 2 Spatial Database (Ia) Consider: Given a city map, ‘index’ all university buildings in an efficient structure for quick topological search.

Leonardo Guerreiro Azevedo Geraldo Zimbrão Jano Moreira de Souza Approximate Query Processing in Spatial Databases Using Raster Signatures Federal University.

1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.

INTRODUCTION TO GIS  Used to describe computer facilities which are used to handle data referenced to the spatial domain.  Has the ability to inter-

Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.

Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.

R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.

Internal and External Sorting External Searching

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

CSE554Contouring IISlide 1 CSE 554 Lecture 3: Contouring II Fall 2011.

CSE554Contouring IISlide 1 CSE 554 Lecture 5: Contouring (faster) Fall 2013.

Dynamics of Binary Search Trees under batch insertions and deletions with duplicates ╛ BACKGROUND The complexity of many operations on Binary Search Trees.

Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,

1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.

1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.

LEEASP: A Linked Environment of Coordinated Multiple Views for Exploratory Analysis of Large-Scale Species Distribution Data Jianting Zhang, Dept. of CS,

Jeremy Iverson & Zhang Yun 1.  Chapter 6 Key Concepts ◦ Structures and access methods ◦ R-Tree  R*-Tree  Mobile Object Indexing  Questions 2.

Geog. 314 Working with tables.

Indexing Multidimensional Data

A Case Study in Building Layered DHT Applications

CPS216: Data-intensive Computing Systems

INTRODUCTION TO GEOGRAPHICAL INFORMATION SYSTEM

Geographic Information Systems

Spatial Online Sampling and Aggregation

Selected Topics: External Sorting, Join Algorithms, …

A Framework for Testing Query Transformation Rules

Presentation transcript:

Efficiently Managing Large-Scale Raster Species Distribution Data in PostgreSQL Jianting Zhang, Dept. of Computer Science The City College of the City University of New York Michael Gertz, Institute of Computer Science University of Heidelberg Le Gruenwald, School of Computer Science The University of Oklahoma

Outline Introduction Background and Related Work Species Distribution Data Quadtree Indexing and Query Processing The Proposed Solution Database Preparation Query Window Decomposition Query Optimization and Result Combination Experiments and Evaluation Conclusion and Future Work

Introduction EnvironmentSpecies Taxonomic (Linnaean ranks) Kingdom Phylum Class Order Family Genus Species SubSpecies Phylogenentic Area Water- Energy Latitude Altitude Productivity Environmental Gradient Community – Ecosystem – Biome – Biosphere Phylogeography 3

Introduction Geographical View Taxonomic View Environmental View Ecoregion View Linked Environment for Exploratory Analysis of Large-Scale Species Distribution Data ACMGIS’08

Introduction Taxonomic Geographical Correlation Distribution Configuration Environmental Phylogenetic Approximation Distribution

Background: Species Distribution Data Museum Collections or Species checklists Global Biodiversty Facility (GBIF) data portal ( 177,887,193 occurrence records from 294 data providers as of 07/25/2009 Species 2000 Annual Checklist ( 1,160,711 species as of 2009 Species Range Maps NatureServe Birds of the Western Hemisphere birds species ~3 Gigabytes Little Tree Species tree species, 137 Megabytes Compiled Species databases WWF Wildfinder database /data/wildfinder.cfm species, 4815 genus, 445 families, 69 orders, species-ecoregion records, ~80 Megabytes USDA Plant databases plant species in 3141 US counties

Background: Species Distribution Data

Related Work Quadtree Indexing :One of the oldest and most extensively studied indexing and query processing approach [Gaede and Günther 1998; Samet 2005] –Research prototypes: QUILT (Shaffer et al 1990) and SAND (Esperanca and Samet 2002, Samet and Webber 2006 ) - Lacking of full SQL support –Window query in linear quadtree (Aboulnaga and Aref 2001) -Key values are stored as Morton codes for B-Tree Indexing – Implementation is non-trivial –SP-GIST (Aref and Ilyas 2001, Eltabakh et al 2006) - Quad-tree based indexing of line-segments in PostgreSQL Commercial products (Oracle/SQL Server) [Kothuri et al 2002, Fang et al 2008 ] Quadtree based indexing of polygonal data Filtering mechanism to facilitate querying spatial relationships at the polygon level Not directly accessible to application developers

Related Work Pieces that contribute to the research –Quadtree representation of geo-referenced data and linear quadtrees (Hunter et al 1979, Gargantini 1982, Samet et al 1983, Shaffer et al 1990, Esperanca and Samet 2002, Samet and Webber 2006) –Efficient window query decomposition (Aref and Samet 1993, Aref and Samet 1997, Proietti 1999, Aboulnaga and Aref 2001, Tsai et al 2004) –Microsoft SQL Server Spatial’s implementation of quadtree indexing based on path query (Fang et al 2008) –LTREE Tree path indexing module in PostgreSQL Open source products  PostgreSQL: Quadtree indexing for binary raster data is not available  Rasdaman: support storing and querying dense multi-dimensional real- valued arrays based on tiling (chunking) techniques

Overview of the Proposed Approach

Database Preparation Polygons representing species distributions vary in sizes and shapes significantly Distributions among a large number of species overlap greatly – cross layer query It is inefficient to create a quadtree for each polygon - quadtree paths will be duplicated Associate a quadtree node with a set of species identifiers instead of a single species identifier How to combine individual quadtrees for efficient query processing?

Database Preparation Individual Quadtrees for Species Distributions

Database Preparation Classic Combination

Database Preparation Improved Combination

Database Preparation Each of quadtree nodes (leaf or non-leaf) and their associated species identifiers become a tuple in PostgreSQL database Index the table based on the paths of the quadtree nodes – offline Sample query: –Select bk_id, sp_ids from TB where bk_id ‘3’ –selecting all the tuples whose paths are decedents of tree path ‘3’ –Results: A (16), B(4), C(4)

Query Window Decomposition Transform a spatial query window into tree paths to match with the database tuples - Decomposition Exact matches + searching for ancestors. Exact matches + searching for descendents Condition to match a query window cell C with a database tuple R (C.ID is an ancestor of R.path) or (C.ID is a descendant of R.path) A B CC A B C

Query Window Decomposition Complexity analysis Decomposition algorithm  O(m) (Tsai et al 2004), m is the larger of the width and height of a query window Converting the outputs of (Tsai et al 2004) to tree paths  O(l*d). l is the number of decomposed cells and d is the depth of quadtree for a raster tessellation. l is proportional to one dimension of the query window, i.e., m. (Tsai et al 2004) The overall complexity O(m)+ O(l*d)  O(m*d) d is a relative small number  O(m) We adopt the approach reported in (Tsai et al 2004) -efficiency and easy of implementation m=8 d=3 l=13 (1 level 2 and 12 level 3)

Query Optimization A B CC For a large query window that does not align with quadrant divisions very well, the decomposed cells can be thousands or even more and most of them will be small cells with same ancestor nodes As the queries are sent to the server independently, duplicated tuples may be returned and need to be removed when combining query results How to minimize duplication while ensure correctness? Can we remove duplication and make combination as simple additions?

Query Optimization Least Common Ancestor (LCA) Retrieve all the tuples whose quadtree paths are the descendents (inclusive) of the nodes below Retrieve all the tuples whose quadtree paths match the cell identifiers exactly Retrieve all the tuples whose paths are the ancestors of the root identifier

Proposed Approach -Discussions Summary –Utilizes existing database storage and indexing functions – no need to define new data types, develop new indexing approaches, modify query syntax and revise database query engines –User queries are transformed into formats that are supported by existing database backend and the results are combined in the middleware to answer users’ query effectively and efficiently. Advantages: –Use SQL query syntax instead of being forced to use APIs –Use a variety of databases (as long as they support efficient path query) –The underlying database systems are left untouched Reduce technical complexities Does not depend on the availability of source code

Experiment Setups –Dell Precision T5400 workstation/PostgreSQL –Species Distribution Data NatureServe: Mammals (1693 species), birds (4148 species) and amphibian (5816 species) –Quadtree West hemisphere, i.e., (-180,-90, 0, 90) Depth=14 (2 14 =16384) Spatial resolution is finer than 1 arc minute (180*60=10800) –Four query window sizes: 0.1, 0.5, 1 and 5 degrees –For each query window size: 100 queries with random centers

Experiments on Database Preparation Rasterized bird species distributions (finest resolution) –46,139,247 cells –1,318,136,140 pairs of (cell, identifier) combinations –28.7 species per cell # of quadtree nodes (database tuples) –Classic combination: 7,511,823 leaf nodes –Proposed combination: 4,957,050 leaf and non-leaf nodes –Proposed combination: 1:9.3 compression ratio # of species identifiers –Classic combination: 831,903,250 (110.7 per node) –Proposed combination: 23,865,343 (4.8 per node) 34.9 times less with respect to the total # of species identifiers 23 time less with respect to the average # of identifiers per node

Experiments on Query Processing # of species Query response time (ms): Baseline (lt) and Optimized (op) approaches

Experiments on Query Processing Window Size Baseline QueryOptimized Query AVGMAXAVGMAX Average and Maximum Response Times under Four Query Windows for the Three Approaches (in seconds)

Conclusions and Future Work This research tackles the problem of storing, indexing and querying large-scale species distribution data in the form of binary rasters in a database environment The approach does not require any modifications on database backend and is applicable to many database systems that support tree path matching. A middleware approach has been adopted by utilizing existing PostgreSQL database support for tree paths and by transforming spatial window query into tree path matching. An end-to-end system to manage large-scale species distribution datasets has been developed with demonstrated efficiency based on bird species distribution data in the West Hemisphere.

Conclusions and Future Work Future work –Further extend the solution to manage even larger scale of species distribution datasets. The ultimate goal is to support all known species at the million scale and reduce the average query response time to below one second for realistic query window sizes to support interactive customer applications –Possible strategies New efficient data structures and algorithms (e.g. column store, bitmap) Combing pre-computation and on-demand processing (query window decomposition) Using main-memory database techniques (indexing/querying) Using GPU for faster query processing (Fermi, CUDA, Nexus) Using Map-Reduce/Hadoop for parallel/distributed query processing (HadoopDB)

Latest Progresses Main-memory database and query processing algorithm –Memory Consumption: ~200M for birds species –Query response time: ~1/4 second for query window size as large as 5 by 5 degrees –Can be used as a cache system for extremely large scale species distribution data Web-based tool: