2003 Apr 81 Indexing the Sky Clive Page. 2003 Apr 82.

Slides:



Advertisements
Similar presentations
Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.
Advertisements

Spatial (or N-Dimensional) Search in a Relational World Jim Gray.
CMU SCS : Multimedia Databases and Data Mining Lecture#5: Multi-key and Spatial Access Methods - II C. Faloutsos.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Searching on Multi-Dimensional Data
Chapter 11 Above: Principal contraction rates calculated from GPS velocities. Visualized using MATLAB.
Multidimensional Data
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Spatial indexing PAMs (II).
Spatial Indexing SAMs. Spatial Access Methods PAMs Grid File kd-tree based (LSD-, hB- trees) Z-ordering + B+-tree R-tree Variations: R*-tree, Hilbert.
Computer Science Spatio-Temporal Aggregation Using Sketches Yufei Tao, George Kollios, Jeffrey Considine, Feifei Li, Dimitris Papadias Department of Computer.
Dynamic Granular Locking Approach to Phantom Protection in R-trees Kaushik Chakrabarti Sharad Mehrotra Department of Computer Science University of Illinois.
Spatial Information Systems (SIS) COMP Raster-based structures (2) Data conversion.
Maps as Numbers Lecture 3 Introduction to GISs Geography 176A Department of Geography, UCSB Summer 06, Session B.
Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.
Chapter 3: Data Storage and Access Methods
Spatial Indexing I Point Access Methods.
PROCESS IN DATA SYSTEMS PLANNING DATA INPUT DATA STORAGE DATA ANALYSIS DATA OUTPUT ACTIVITIES USER NEEDS.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Robust estimation Problem: we want to determine the displacement (u,v) between pairs of images. We are given 100 points with a correlation score computed.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Multidimensional Data Many applications of databases are ``geographic'' = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
כמה מהתעשייה? מבנה הקורס השתנה Computer vision.
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
Clive Page University of Leicester Meeting at ROE January 25 (1)Cross-matching Catalogues (2)Column-based storage for data exploring.
Introduction to Sky Survey Problems Bob Mann. Introduction to sky survey database problems Astronomical data Astronomical databases –The Virtual Observatory.
GEOREFERENCING SYSTEMS
2003 June 301 Spatial Indexing Clive Page June 302.
How to speed up search of ILMT light curves using the HTM (Hierarchical Triangular Mesh) method in relational databases ARC Liège, 11 February 2010 ILMT.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.
天文信息技术联合实验室 New Progress On Astronomical Cross-Match Research Zhao Qing.
Mutlidimensional Indices Instructor: Randal Burns Lecture for 29 November 2005 Computer Science Johns Hopkins University.
1 CO Games Concepts Week 20 Matrices continued Gareth Bellaby.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
Algorithm Course Dr. Aref Rashad February Algorithms Course..... Dr. Aref Rashad Part: 4 Search Algorithms.
VOMegaPlot Efficient Plotting of Large VOTable Datasets.
1 Database Management Systems: part of the solution or part of the problem? Clive Page 2004 April 28.
Greg Janée chit-chat with CS database folks 10/26/01 Gazetteer database 4.5 million items, each having: –1+ names fair to good discriminator –1 geospatial.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.
MySQL spatial indexing for GIS data in a web 2.0 internet application Brian Toone Samford University
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.
Hashing Hashing is another method for sorting and searching data.
Creating and Maintaining Geographic Databases. Outline Definitions Characteristics of DBMS Types of database Relational model SQL Spatial databases.
Prof. Bayer, DWH, Ch.5, SS Chapter 5. Indexing for DWH D1Facts D2.
2003 May 24Clive Page Implementation of XMATCH function.
Hashing Basis Ideas A data structure that allows insertion, deletion and search in O(1) in average. A data structure that allows insertion, deletion and.
GIS Data Models GEOG 370 Christine Erlien, Instructor.
Object Oriented Database By Ashish Kaul References from Professor Lee’s presentations and the Web.
URBDP 422 URBAN AND REGIONAL GEO-SPATIAL ANALYSIS Lecture 3: Building a GeoDatabase; Projections Lab Session: Exercise 3: vector analysis Jan 14, 2014.
Spatial Database 2/5/2011 Reference – Ramakrishna Gerhke and Silbershatz.
1 Multi-Level Indexing and B-Trees. 2 Statement of the Problem When indexes grow too large they have to be stored on secondary storage. However, there.
Recent spatial work by Jim Gray and Alex Szalay Bob Mann.
How to represent coverage: temporal, spectral, positional Clive Page AstroGrid Project University of Leicester 2003 March 19.
Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.
A configuration method for structured P2P overlay network considering delay variations Tomoya KITANI (Shizuoka Univ. 、 Japan) Yoshitaka NAKAMURA (NAIST,
© University of Reading 2009www.reading.ac.uk Reading e-Science Centre October 6, 2009, GO-ESSP, Hamburg Fast regridding of complex grids for visualization.
Lecture 3 With every passing hour our solar system comes forty-three thousand miles closer to globular cluster 13 in the constellation Hercules, and still.
Spatial Searches in the ODM. slide 2 Common Spatial Questions Points in region queries 1.Find all objects in this region 2.Find all “good” objects (not.
Multidimensional Access Structures COMP3017 Advanced Databases Dr Nicholas Gibbins –
Jean Ballet, CEA Saclay GSFC, 31 May 2006 All-sky source search
COMP 430 Intro. to Database Systems
Fitting Curve Models to Edges
Minwise Hashing and Efficient Search
Presentation transcript:

2003 Apr 81 Indexing the Sky Clive Page

2003 Apr 82

3 Formats of Raw Data Radio: –Complex visibility for each polarisation at set of points sampling the (u,v) plane. Infra-red, Optical, Ultra-violet: –Images from 1k×1k to 18k×20k, collected every few seconds or few minutes. X-ray, Gamma-ray: –Lists of detected photons (x, y, time, energy) typically accumulated for several hours.

2003 Apr 84 Formats of Reduced Data Images Time-series Spectra Source Catalogues: –Vital to cross-identify sources from different wavebands, basis for many subsequent data mining investigations. –Problem: can be large, examples: USNO-B1,045,913,669 rows 30 columns 1 st XMM-Newton catalogue 56,711 rows379 columns

2003 Apr 85 Required Functionality SELECT sources in given small patch of sky (circle, rectangle, or polygon) JOIN two tables e.g. from different wavebands to find corresponding sources –Principal matching criterion is positional match - typically overlap of error-circles.

2003 Apr 86 Problems handling source catalogues Positions use spherical-polar coordinates (RA, Dec) –Right Ascension corresponds to geographic longitude –Declination corresponds to geographic latitude There are singularities at the poles and distortions in the scales everywhere except at the equator. RA wraps from 24 hours (360 degrees) to zero. Two-dimensional indexing is really needed. All source positions are imprecise  points have an error radius. Distances between points must use a great-circle distance function not cartesian distance.

2003 Apr 87 Indexing Possibilities 1.Use simple B-tree on one spatial axis only 2.Use 1-d to 2-d mapping function then B-tree 3.Use spatial index such as R-tree

2003 Apr 88 (1) Index one spatial axis only For example consider USNO-B: a table of a billion rows. Typical search/join uses a radius of say 3 arc-seconds. Probability of finding a source in a circle of radius 3 arc- seconds in a random position is around 17%, so most searches find 0 or 1 rows. An index on just one coordinate (say Dec) will effectively search a strip 360° wide by 6 arc-seconds high, and will find some 10,000 rows matching. These have to be scanned sequentially to find at most one matching row. Conclusion: a true 2-d index can gain five orders of magnitude in efficiency.

2003 Apr 89 (2) 2-d to 1-d mapping Cover the space with cells (pixels) and number them. Create conventional B-tree on resulting set of integers. Each point maps to an integer. Areas map to a list of integers: –Ideally a small spatial area maps to a small range of integers so one can do a range search using the B-tree. –Various space-filling curves such as the Z-ordering index and Peano Curve have been used in the hope that this works…

2003 Apr 810 Z-order mapping function

2003 Apr 811 Space-filling Curves All have same failing: –At some places in the grid a high-order bit flips and the range of integers becomes huge. –Tests confirm this defect: the worst-case performance is rather poor. Simple cartesian grids also unsuited to spherical-polar coordinates as there are too many tiny pixels near the poles.

2003 Apr 812 Covering the sky evenly with pixels Hierarchical Equal Area iso-Latitude Pixelisation (HEALPix) – invented at European Southern Observatory. Hierarchical Triangular Mesh (HTM) – invented at Johns Hopkins University Can use either algorithm – call it pixel-code or PCODE for short –Do not try to conduct spatial range search using range of PCODE values.

2003 Apr 813 Hierarchical Equal Area iso-Latitude Pixelisation (HEALPix)

2003 Apr 814 Hierarchical Triangular Mesh (HTM)

2003 Apr 815 Spatial Join using PCODE Table CAT1 has columns ID1 RA DEC POSERR MAGNITUDE etc Table CAT2 has columns ID2 RA DEC POSERR FLUX etc

2003 Apr 816 Create additional tables with PCODE values Table CAT1 has columns ID1 – primary key RA DEC POSERR MAGNITUDE Etc Table P1 has columns ID1 PCODE1 – primary key Table CAT2 has columns ID2 – primary key RA DEC POSERR FLUX Etc Table P2 has columns ID2 PCODE2 – primary key

2003 Apr 817 JOIN the two PCODE tables Note: tables P1, P2 have extra rows when error-circles overlap more than one pixel. Join P1 and P2 on PCODE1=PCODE2 making a table PJOIN with just two columns: ID1 and ID2. Use SELECT DISTINCT to remove any duplicates Table PJOIN identifies pixels which may contain sources with overlapping error circles (or they may just be near but not overlapping) Create B-tree index on PJOIN(ID1)

2003 Apr 818 Use PJOIN table to match catalogue rows Three-way join then produces required results, e.g. SELECT cols FROM CAT1, PJOIN, CAT2 WHERE CAT1.ID1=PJOIN.ID1 AND PJOIN.ID2=CAT2.ID2 AND (2 * asin(sqrt(pow(sin((cat1.dec- cat2.dec)/2),2) + cos(cat1.dec) * cos(cat2.dec) * pow(sin((cat1.ra- cat2.ra)/2),2))) <= cat1.poserr+cat2.poserr) ;

2003 Apr 819 (3) True Multi-dimensional Indexing Hot topic of research in computer science departments for more than 20 years Very many algorithms have been proposed: –BANG file, BV-tree, Buddy tree, Cell tree, G-tree, GBD-tree, Gridfile, hB- tree, kd-tree, LSD-tree, P-tree, PK-tree, PLOP hashing, Pyramid tree, Q 0 - tree, Quadtree, R-tree, SKD-tree, SR-tree, SS-tree, TV-tree, UB-tree, Z- order index. –So many alternatives, but none of them provides a good general solution, like the B-tree in 1-D indexing. R-tree indexing is built into several modern DBMS.

2003 Apr 820 Spatial Options in current DBMS Commercial: DB2Spatial Extender – multi-level grid file IngresNone OracleSpatial Option – R-tree (?) SQL ServerNone SybaseSpatial Option (Boeing SQS) – R-tree Open Source: MySQLR-tree in V4.1 (beta, documentation lacking) InterbaseNone PostgreSQLR-tree

2003 Apr 821 Using R-trees Used R-trees in Postgres – does what it says on the box. Problems/limitations include: Object indexed by R-tree is a rectangular box, so must draw a box outside each error circle Boxes get rather extended (along RA axis) near poles Need a subsequent filter to remove spurious matches where rectangles overlap but circles do not. R-tree indices are large, creation is slow (2 hours for table of 3.5 million rows using Postgres). –Kalpakis et al. used Informix to load part of USNO-A2 and found data load and R-tree creation would have taken 39 days for the entire 500M row table.

2003 Apr 822 Comparison of PCODE and R-tree Advantages –PCODE join seems to be faster (but not yet benchmarked with identical systems). –Takes up less disc space in total. –Can use any DBMS, not just those with an R-tree or other spatial data option. Disadvantages –Additional tables and indices have to be created –More complex set of joins. –Needs external code as neither HTM or HEALPix can be expressed as an SQL-callable function (they return a variable-length array of integers).

2003 Apr 823 Conclusions Indexing on just one spatial axis is simply too inefficient for large tables. R-trees are powerful and easy to use, but index creation times are a serious cause for concern. 2d  1d mapping functions such as HTM or HEALPix are more complicated to use, but may be worthwhile for JOINs if they turn out to be faster.