2003 Apr 81 Indexing the Sky Clive Page
2003 Apr 82
3 Formats of Raw Data Radio: –Complex visibility for each polarisation at set of points sampling the (u,v) plane. Infra-red, Optical, Ultra-violet: –Images from 1k×1k to 18k×20k, collected every few seconds or few minutes. X-ray, Gamma-ray: –Lists of detected photons (x, y, time, energy) typically accumulated for several hours.
2003 Apr 84 Formats of Reduced Data Images Time-series Spectra Source Catalogues: –Vital to cross-identify sources from different wavebands, basis for many subsequent data mining investigations. –Problem: can be large, examples: USNO-B1,045,913,669 rows 30 columns 1 st XMM-Newton catalogue 56,711 rows379 columns
2003 Apr 85 Required Functionality SELECT sources in given small patch of sky (circle, rectangle, or polygon) JOIN two tables e.g. from different wavebands to find corresponding sources –Principal matching criterion is positional match - typically overlap of error-circles.
2003 Apr 86 Problems handling source catalogues Positions use spherical-polar coordinates (RA, Dec) –Right Ascension corresponds to geographic longitude –Declination corresponds to geographic latitude There are singularities at the poles and distortions in the scales everywhere except at the equator. RA wraps from 24 hours (360 degrees) to zero. Two-dimensional indexing is really needed. All source positions are imprecise points have an error radius. Distances between points must use a great-circle distance function not cartesian distance.
2003 Apr 87 Indexing Possibilities 1.Use simple B-tree on one spatial axis only 2.Use 1-d to 2-d mapping function then B-tree 3.Use spatial index such as R-tree
2003 Apr 88 (1) Index one spatial axis only For example consider USNO-B: a table of a billion rows. Typical search/join uses a radius of say 3 arc-seconds. Probability of finding a source in a circle of radius 3 arc- seconds in a random position is around 17%, so most searches find 0 or 1 rows. An index on just one coordinate (say Dec) will effectively search a strip 360° wide by 6 arc-seconds high, and will find some 10,000 rows matching. These have to be scanned sequentially to find at most one matching row. Conclusion: a true 2-d index can gain five orders of magnitude in efficiency.
2003 Apr 89 (2) 2-d to 1-d mapping Cover the space with cells (pixels) and number them. Create conventional B-tree on resulting set of integers. Each point maps to an integer. Areas map to a list of integers: –Ideally a small spatial area maps to a small range of integers so one can do a range search using the B-tree. –Various space-filling curves such as the Z-ordering index and Peano Curve have been used in the hope that this works…
2003 Apr 810 Z-order mapping function
2003 Apr 811 Space-filling Curves All have same failing: –At some places in the grid a high-order bit flips and the range of integers becomes huge. –Tests confirm this defect: the worst-case performance is rather poor. Simple cartesian grids also unsuited to spherical-polar coordinates as there are too many tiny pixels near the poles.
2003 Apr 812 Covering the sky evenly with pixels Hierarchical Equal Area iso-Latitude Pixelisation (HEALPix) – invented at European Southern Observatory. Hierarchical Triangular Mesh (HTM) – invented at Johns Hopkins University Can use either algorithm – call it pixel-code or PCODE for short –Do not try to conduct spatial range search using range of PCODE values.
2003 Apr 813 Hierarchical Equal Area iso-Latitude Pixelisation (HEALPix)
2003 Apr 814 Hierarchical Triangular Mesh (HTM)
2003 Apr 815 Spatial Join using PCODE Table CAT1 has columns ID1 RA DEC POSERR MAGNITUDE etc Table CAT2 has columns ID2 RA DEC POSERR FLUX etc
2003 Apr 816 Create additional tables with PCODE values Table CAT1 has columns ID1 – primary key RA DEC POSERR MAGNITUDE Etc Table P1 has columns ID1 PCODE1 – primary key Table CAT2 has columns ID2 – primary key RA DEC POSERR FLUX Etc Table P2 has columns ID2 PCODE2 – primary key
2003 Apr 817 JOIN the two PCODE tables Note: tables P1, P2 have extra rows when error-circles overlap more than one pixel. Join P1 and P2 on PCODE1=PCODE2 making a table PJOIN with just two columns: ID1 and ID2. Use SELECT DISTINCT to remove any duplicates Table PJOIN identifies pixels which may contain sources with overlapping error circles (or they may just be near but not overlapping) Create B-tree index on PJOIN(ID1)
2003 Apr 818 Use PJOIN table to match catalogue rows Three-way join then produces required results, e.g. SELECT cols FROM CAT1, PJOIN, CAT2 WHERE CAT1.ID1=PJOIN.ID1 AND PJOIN.ID2=CAT2.ID2 AND (2 * asin(sqrt(pow(sin((cat1.dec- cat2.dec)/2),2) + cos(cat1.dec) * cos(cat2.dec) * pow(sin((cat1.ra- cat2.ra)/2),2))) <= cat1.poserr+cat2.poserr) ;
2003 Apr 819 (3) True Multi-dimensional Indexing Hot topic of research in computer science departments for more than 20 years Very many algorithms have been proposed: –BANG file, BV-tree, Buddy tree, Cell tree, G-tree, GBD-tree, Gridfile, hB- tree, kd-tree, LSD-tree, P-tree, PK-tree, PLOP hashing, Pyramid tree, Q 0 - tree, Quadtree, R-tree, SKD-tree, SR-tree, SS-tree, TV-tree, UB-tree, Z- order index. –So many alternatives, but none of them provides a good general solution, like the B-tree in 1-D indexing. R-tree indexing is built into several modern DBMS.
2003 Apr 820 Spatial Options in current DBMS Commercial: DB2Spatial Extender – multi-level grid file IngresNone OracleSpatial Option – R-tree (?) SQL ServerNone SybaseSpatial Option (Boeing SQS) – R-tree Open Source: MySQLR-tree in V4.1 (beta, documentation lacking) InterbaseNone PostgreSQLR-tree
2003 Apr 821 Using R-trees Used R-trees in Postgres – does what it says on the box. Problems/limitations include: Object indexed by R-tree is a rectangular box, so must draw a box outside each error circle Boxes get rather extended (along RA axis) near poles Need a subsequent filter to remove spurious matches where rectangles overlap but circles do not. R-tree indices are large, creation is slow (2 hours for table of 3.5 million rows using Postgres). –Kalpakis et al. used Informix to load part of USNO-A2 and found data load and R-tree creation would have taken 39 days for the entire 500M row table.
2003 Apr 822 Comparison of PCODE and R-tree Advantages –PCODE join seems to be faster (but not yet benchmarked with identical systems). –Takes up less disc space in total. –Can use any DBMS, not just those with an R-tree or other spatial data option. Disadvantages –Additional tables and indices have to be created –More complex set of joins. –Needs external code as neither HTM or HEALPix can be expressed as an SQL-callable function (they return a variable-length array of integers).
2003 Apr 823 Conclusions Indexing on just one spatial axis is simply too inefficient for large tables. R-trees are powerful and easy to use, but index creation times are a serious cause for concern. 2d 1d mapping functions such as HTM or HEALPix are more complicated to use, but may be worthwhile for JOINs if they turn out to be faster.