Tony Rees Divisional Data Centre CSIRO Marine Research, Australia Application of c-squares spatial indexing to an archive of remotely.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Tony Rees Divisional Data Centre CSIRO Marine Research, Australia Application of c-squares spatial indexing to an archive of remotely.
Organisation Of Data (1) Database Theory
C-squares - a new simple, XML friendly, display/ query/ exchange format for representing spatial data extents at the metadata level System concept and.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Indian Statistical Institute Kolkata
Tables Lesson 6. Skills Matrix Tables Tables store data. Tables are relational –They store data organized as row and columns. –Data can be retrieved.
Spatial Indexing, Search, and Mapping for Species level databases Tony Rees, CSIRO Marine and Atmospheric Research (CMAR), Hobart, Tasmania, Australia.
Tony Rees and Glenelg Smith Divisional Data Centre + Remote Sensing Facility CSIRO Marine Research, Australia Application of c-squares.
Maps as Numbers Lecture 3 Introduction to GISs Geography 176A Department of Geography, UCSB Summer 06, Session B.
PROCESS IN DATA SYSTEMS PLANNING DATA INPUT DATA STORAGE DATA ANALYSIS DATA OUTPUT ACTIVITIES USER NEEDS.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
Data Input How do I transfer the paper map data and attribute data to a format that is usable by the GIS software? Data input involves both locational.
By: Jamie McPeek. 1. Background Information 1. Metasearch 2. Sets 3. Surface Web/Deep Web 4. The Problem 5. Application Goals.
Attribute databases. GIS Definition Diagram Output Query Results.
Raster and Vector 2 Major GIS Data Models. Raster and Vector 2 Major GIS Data Models.
Creating Web Page Forms
Spatial data models (types)
ESRM 250 & CFR 520: Introduction to GIS © Phil Hurvitz, KEEP THIS TEXT BOX this slide includes some ESRI fonts. when you save this presentation,
Systems analysis and design, 6th edition Dennis, wixom, and roth
Map Scale, Resolution and Data Models. Components of a GIS Map Maps can be displayed at various scales –Scale - the relationship between the size of features.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
Maps as Numbers Getting Started with GIS Chapter 3.
OBIS Portal Architecture Concepts plus potential for utilization as a basis for Regional OBIS Nodes Tony Rees, CSIRO Marine Research, Hobart (and OBIS.
PHP meets MySQL.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Chapter 3 Digital Representation of Geographic Data.
Best Practices for Designing Effective Map Services Tanu Hoque.
OBIS and species distributions Tony Rees discussion presentation, March 2003 Some fundamental intentions for OBIS... –Choose any species and discover its.
Support the spread of “good practice” in generating, managing, analysing and communicating spatial information Introduction to GIS for the Purpose of Practising.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
10/10/2012ISC239 Isabelle Bichindaritz1 Physical Database Design.
ITCS373: Internet Technology Lecture 5: More HTML.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Marr CollegeHigher ComputingSlide 1 Higher Computing: COMPUTER SYSTEMS Part 1: Data Representation – 6 hours.
MySQL spatial indexing for GIS data in a web 2.0 internet application Brian Toone Samford University
Spatial Issues in DBGlobe Dieter Pfoser. Location Parameter in Services Entering the harbor (x,y position)… …triggers information request.
Chapter 10 Designing the Files and Databases. SAD/CHAPTER 102 Learning Objectives Discuss the conversion from a logical data model to a physical database.
Web Development Web development never ends: 1.Find out what the stakeholders need (sponsors, users, etc.) 2.Investigate available technology 3.Plan the.
Methodology – Physical Database Design for Relational Databases.
INTRODUCTION TO GIS  Used to describe computer facilities which are used to handle data referenced to the spatial domain.  Has the ability to inter-
Towards Unifying Vector and Raster Data Models for Hybrid Spatial Regions Philip Dougherty.
ERDDAP The Next Generation of Data Servers Bob Simons DOC / NOAA / NMFS / SWFSC / ERD Monterey, CA Disclaimer: The opinions expressed.
AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.
© University of Reading 2009www.reading.ac.uk Reading e-Science Centre October 6, 2009, GO-ESSP, Hamburg Fast regridding of complex grids for visualization.
System concept and development by: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia c-squares - a new method for representing, querying,
What is GIS? “A powerful set of tools for collecting, storing, retrieving, transforming and displaying spatial data”
Spatial Data Models Geography is concerned with many aspects of our environment. From a GIS perspective, we can identify two aspects which are of particular.
Graphics and Image Data Representations 1. Q1 How images are represented in a computer system? 2.
June 27-29, DC2 Software Workshop - 1 Tom Stephens GSSC Database Programmer GSSC Data Servers for DC2.
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
System concept and development by: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia c-squares - a new method for representing, querying,
This was written with the assumption that workbooks would be added. Even if these are not introduced until later, the same basic ideas apply Hopefully.
General Architecture of Retrieval Systems 1Adrienn Skrop.
MESA A Simple Microarray Data Management Server. General MESA is a prototype web-based database solution for the massive amounts of initial data generated.
Agenda for Today  DATABASE Definition What is DBMS? Types Of Database Most Popular Primary Database  SQL Definition What is SQL Server? Versions Of SQL.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.
N5 Databases Notes Information Systems Design & Development: Structures and links.
Why indexing? For efficient searching of a document
Web Development Web development never ends:
Working in the Forms Developer Environment
Indexing Structures for Files and Physical Database Design
INTRODUCTION TO GEOGRAPHICAL INFORMATION SYSTEM
Enhancing Web Map Performance in ArcGIS Online
Why use Binary? It is a two state system (on/off) which makes it simple to operate Even if degradation of current occurs (ie a slight drop in voltage)
To Be Safe For Now: Keep your shapefiles simple
The ultimate in data organization
Presentation transcript:

Tony Rees Divisional Data Centre CSIRO Marine Research, Australia Application of c-squares spatial indexing to an archive of remotely sensed data

The starting point … Archive of c.60,000 satellite scenes, current, accumulating at another 15+ per day Region covered varies for each scene Often is requirement to retrieve scenes including a specific region or place Low-res GIF images (like examples shown) are available for last 24 months only Polygons can be calculated for each scene, but currently are not stored in any spatially searchable system.

Options for spatial indexing - 1 Bounding rectangles Mostly simple to generate/ store/ query Frequently a poor fit to the actual dataset extent Potentially become massively over-representative as data approaches the pole Do not cope well with oblique or curved boundaries to the dataset footprint (common in satellite images) Poor discrimination for fine- scale queries (many “false positives” returned) Can be problems with footprints which cross the date line, or include a pole Scene footprint bounding rectangle (follows lat, long grid lines)

Options for spatial indexing - 2 Bounding polygons OK to generate/ store Best fit to the actual dataset extent Need additional points as data approaches the pole (regions of high curvature) Computationally expensive to query (often use bounding rectangles as well, as pre-filter) Tend to require a dedicated GIS system to handle the spatial queries. bounding polygon inflection points

Options for spatial indexing - 3 Raster (tiled) approach Simple to generate/ store (as list of tiles intersected) Reasonable fit to the actual dataset extent; improves with decreasing unit square size Automatically increases resolution as data approaches the pole (regions of high curvature) Simple/ rapid to query, no computation or dedicated GIS required No ambiguity with footprints which extend across date line, or include a pole. * * * * * * * * * * * * * * * * * * * * * * * * * * * boundary of all tiles for this scene

C-squares basics Based on a tiled (gridded) representation of the earth’s surface, at choice of 10 x 10, 5 x 5, 1 x 1, 0.5 x 0.5 degree tiles, etc. (0.5 x 0.5 degree tiles are used in this example) Every tile has a unique, hierarchical alphanumeric ID (incorporates the ID of all parent, grandparent, etc. in every child tile ID) Dataset (=scene) extents are represented by a list of all the tiles included in, or intersected by the footprint Spatial search comprises looking for one or more tile IDs in the set associated with any dataset (= simple text search). (more details – see )

Spatial queries supported Retrieve all scenes which include all or part of a given tile (at 10 x 10, 5 x 5, 1 x 1, or 0.5 x 0.5 degrees), or set of tiles –optionally, filter also by other criteria e.g. date, satellite, etc. Retrieve all tiles associated with a given scene, with option to: –draw representation on a map (or range of maps) –export the list to a file or another database –compare the list with other lists (~ = polygon overlays) NB, choice of initial tile size is important: - too few tiles, only coarse spatial queries supported; - too many tiles, indexes get very large (and queries potentially slow) However, compression of blocks of contiguous tiles can be quite effective (1 code can replace 4, 25, 100, or 400 codes in certain cases).

A real-world example NOAA Jun :23 10 x 10 degree squares (28) (base level of hierarchy, cannot compress) 0.5 x 0.5 degree squares - detail 5 x 5 degree squares (99) = 36 after comp. 1 x 1 degree squares (1,982) = 515 after comp. 0.5 x 0.5 degree squares (7,691) = 704 after comp.

Practicalities for satellite data A single scene may measure (say) 40 x 50 degrees approx., = x 1 degree squares, or x 0.5 degree squares Quadtree-like compression reduces this to (e.g.) 500 codes at 1 x 1 degree resolution, 700 codes at 0.5 x 0.5 degree resolution Still require quite a lot of codes (e.g. 42 million) to represent a collection of 60,000 scenes Each code is 10 characters long, scene IDs are (say) 6 characters long, thus c.670 million bytes required for raw index data (before compiling secondary indexes, etc.) With secondary indexing, probably need around 2 Gb (+) to hold the spatial index.

Designing the index Option 1: a table of scenes, with string of codes comprising each scene footprint, 1 row per scene –comment: fast to retrieve all codes for a scene, slow for a spatial query. List of codes may present storage problems (e.g. in database field/s) owing to length. Option 2: a table of c-square codes, with list of scenes containing any code, 1 row per code (= an “inverted index”) –comment: fast for a spatial search, slow to retrieve all codes for a scene. List of scenes may present storage problems (e.g. in database field/s) owing to length. Harder to manage with respect to scene addition/deletion etc. (continues…)

Designing the index (cont’d) Option 3: a table of scenes/c-square pairs, 1 row per pair, indexed on both columns –comment: OK to retrieve all codes for a scene, also for a spatial query. Row length is always constant, new rows simply created as required. However, table potentially becomes very long (e.g. 40 million rows), may become slow to query … Option 4: as option (3), however single table replaced by multiple smaller tables – e.g. split by blocks of c-squares, blocks of scenes, or year, or satellite –comment: will tend to be “tuned” to favour queries which can be completed by accessing smallest number of tables. Could overcome this by adding (say) element of Option 1 in parallel (with added cost of storing the same information in more than one place).

Present philosophy … Currently going with Option 3: a table of scenes/c- square pairs, 1 row per pair, indexed on both columns. However, will consider splitting this along spatial or date lines if performance degrades excessively as the index grows (= Option 4); may then add in Option 1 in parallel also, for rapid footprint retrieval if needed.

Actual table structure – 2 “search” tables only Scene details (1 row per scene) Scene/c-square pairs ( rows per scene)

Spatial Search Interface (prototype):

Clickable map interface

Example search result

SQL for spatial search (example for 0.5 degree search square) select distinct A.scene_id, B.satellite, B.scene_date_time, B.image_location from satdata.satdata_csq_all A, satdata.scene_info B where ( (sqrsize = 0.5 and (A.csquare = search_csq -- e.g. 3414:100:1 (0.5 degree square) or A.csquare = substr(search_csq,1,8)|| ': * ' -- 1 level of compression or A.csquare = substr(search_csq,1,6)|| ' ** : * ' -- 2 levels of compression or A.csquare = substr(search_csq,1,4)||': *** : * ') -- 3 levels of compression ) -- (plus other supported search square size options go here) ) and (startdate is null or B.scene_date_time >= startdate) and (enddate is null or B.scene_date_time <= enddate) and (sat = 'any' or B.satellite = sat) and A.scene_id = B.scene_id order by B.scene_date_time, B.satellite;

This can be triggered by a HTTP request, e.g. as follows: sat=any&startday=20&startmon=06&startyr=2003&endday=22&endmon= 06&endyr=2003&W_long=150&N_lat=-25&E_long=155&S_lat=-30 … Generates either – HTML result page to browser, or machine parseable scenes list, e.g.: h1awj h1awl h1awm h1awo h1awr h1awt h1aww h1awy h1ax0 h1ax2 h1ax5 h1ax7 h1ax9 h1axb h1axd h1axe h1axh h1axj h1axl h1axn h1axo h1axr h1axt h1axw h1axx h1ay0 h1ay1 h1ay4 h1ay8 h1ay9 h1ayb h1ayc h1ayd h1aye h1ayg h1ayh h1ayj h1ayl h1ayo h1ayp h1ayr h1ays h1ayu h1ayw h1az1 h1az3 h1az5 h1az8 h1aza h1azg h1azj h1azl h1azm h1azn h1azp h1azq h1azt h1azw h1azz h1b00 h1b03 h1b04 h1b06 h1b09 h1b0a h1b0c h1b0d h1b0f h1b0i h1b0j h1b0n h1b0q h1b0t h1b0u h1b0w h1b0y h1b10 h1b12 h1b13 h1b15 h1b18 h1b1b h1b1e h1b1h h1b1j h1b1m h1b1o h1b1r h1b1u h1b1w h1b20 h1b21 h1b24 h1b26 h1b28 h1b2a h1b2b h1b2e h1b2g

Automated scene processing and upload 1.An automated routine prepares 2 sets of files: scene information, and scene polygons, in batches of scenes 2.For the first batch, a script initiates file upload to the database, then file conversion (c-square generation) for each scene in the batch (currently takes c.3 minutes/scene) 3.When all scenes have been converted, the c-squares are added to the master searchable table, and the scenes are flagged as “upload completed” 4.Close to the estimated completion time, a robot wakes up and checks at 10 minute intervals to detect when the job is complete (via a http call to the same procedure which creates the web scene count on the opening screen). 5.When the scene count returns “0 scenes currently being uploaded”, the robot calls the script to start again from stage (2)-(5), with the next batch of scenes.

C-square encoding algorithm “First generation” conversion code constructed January-May 2003, at CMR Hobart Currently runs in Oracle PL/SQL, reads/writes from/to 6 Oracle tables (including compression of the c-square string from each scene) Includes special treatment for: –scenes crossing date line –scenes including a pole Polar X-Y coordinates used to interpolate between quoted boundary points (gives better representation for southernmost boundary) Special “anti-bleed” treatment incorporated (polygons touching, but not entering a given square do not trigger that square) Could probably benefit from re-writing the code in a different language for faster operation, and/or refining method of operation.

Next steps … Keep a watch on index performance (i.e., retrieval times) as database size increases Assess whether this approach is attractive, cf. systems already in use elsewhere – and whether of interest to others Look into improving the encoding procedure speed (currently a bottleneck for processing a large archive).