Sky Query: A distributed query engine for astronomy László Dobos1, Tamás Budavári2, Alex Szalay2, István Csabai1 1 Eötvös Loránd University, Hungary 2 Johns Hopkins University, Baltimore Sky Query: A distributed query engine for astronomy
The multiwavelength sky infrared (2MASS) visible (DSS) ultraviolet (Galex)
Crossmatching Astronomical catalogs Done by coordinates in RDBMS o(100 million) objects o(1TB – 10TB) DB size Done by coordinates RA, Dec Astrometric error Different sky coverage Different wavelength range Moving objects etc.
Crossmatching on demand Crossmatch any number of catalogs All combinations cannot be precomputed Maybe catalog pairs? User can specify List of catalogs to match Region of interes Priors for non-coordinate-based matching
Problem description Astronomers „script” what they do multiple re-runs, tweak parameters etc. huge web forms: no-no All data in RDBMS run computation inside the database use multiple servers and parallelize must be transparent for users Problem description in SQL functions and language extensions to support astronomy syntax to formulate the coordinate-based probabilistic join spatial constraints: celestial regions
Sample SQL query Standard SQL Probabilistic crossmatch SELECT s.objId, g.objID, t.objID, s.ra, s.dec, g.ra, g.dec, t.ra, t.dec, x.ra, x.dec FROM SDSSDR7:Galaxies AS s CROSS JOIN Galex:Galaxies AS g CROSS JOIN TwoMASS:ExtendedSources AS t XMATCH BAYESIAN AS x MUST s ON POINT(s.cx, s.cy, s.cz), 0.1 MUST g ON POINT(g.ra, g.dec), 0.2 MAY t ON POINT(t.ra, t.dec), 0.5 HAVING LIMIT 1e3 REGION CIRCLE J2000 165.7, 0.3, 60 Standard SQL Probabilistic crossmatch Spatial constraint
Zone algorithms Pure SQL: Can leverage from query optimizer of SQL Server Divide sphere into zones ZoneID: very simple hash on declination Indexes built on ZoneID and right ascension help very quick pre-filtering of match candidates very well parallelized on multi-core machines [Gray, Szalay & Nieto-Santisteban 2006, The Zones Algorithm for Finding Points-Near-a-Point or Cross-Matching Spatial Datasets]