Introduction to Sky Survey Problems Bob Mann
Introduction to sky survey database problems Astronomical data Astronomical databases –The Virtual Observatory – concept & status –Large sky survey databases Spatial indexing in astronomical databases Case Study: SDSS & SkyServer
Observational Astronomy Electromagnetic spectrum IRAS 25 2MASS 2 DSS Optical IRAS 100 NVSS 20cm GB 6cm ROSAT ~keV WENSS 92cm
Astronomical data – in original form Optical –Image: array of pixel values X-ray –Event list: positions, arrival times, energies of all detected photons Radio –Interferometric visibilities: sparse Fourier transform of a region of the sky Very different types of data
Astronomical data – in final form Most research done using catalogue data –i.e. tables of attributes of detected sources – mainly discrete sources (stars, galaxies, etc) –Data compression Catalogue - few% of image data volume –Amenable to representation in relational DB Natural indexing by location in sky
Astronomical Databases Sky survey archives –Homogeneous data, standard reduction pipeline –“Science Archive” – do science on DB Telescope archives –Semi-indexed collections of raw data files from all observations taken – heterogeneous –Download data for reduction and analysis Specialist data centres – collections of catalogues Bibliographic databases– scans of major journals
The Virtual Observatory Concept: –Interoperable federation of all the world’s significant astronomical databases –Facilitate multi-wavelength astronomy Status: –Several projects underway – AstroGrid in UK –5+ years’ work to create a fully working VO The VO sets the context for the design of new sky survey databases
AstroGrid: Consortium: –Edinburgh, Leicester, Cambridge, RAL, MSSL, Jodrell Bank, Queens Belfast 3 year (~£4M) project: –1 yr Phase A Study – finished end of 2002 –2 yr Phase B Implementation – to end 2004 Web (later Grid) service framework; in Java Web (later Grid) service framework; in Java Currently building web services, portals, etc - researching OGSA and OGSA-DAI
Large sky survey databases Major science driver for AstroGrid – and VO –New science – mining multi-wavelength data Largest are optical/near-infrared sky surveys Largest of these hosted in Edinburgh: –current - SuperCOSMOS, SDSS (mirror) –future - WFCAM, VISTA –Each yield 1-10TB of catalogue data in RDBMS
Spatial queries in astronomy Two important types: –Select entries (with predicate) in area of sky –Match entries (esp. between two tables) Second is special case of first –i.e. both boil down to “point-within-distance-of- point” –but distances in two cases can be very different Advantage in using a hierarchical spatial indexing scheme –Perform spatial query at appropriate granularity
Spatial Indexing in Astronomy The Celestial Sphere Many coordinate systems Most common is the equatorial system, with Right Ascension and Declination as analogues Declination as analogues of Longitude & Latitude of Longitude & Latitude
Spatial indexing in astronomical databases Basic DBMS indexes are 1-D – e.g. B-trees Some DBMSs support general 2-D indexing –Usually using R-trees (or variants) – rectangles: astronomical experiments not too successful: [Clive] Some DBMSs have native spatial indexing –Little knowledge of this in astronomy - want to know more But The Celestial Sphere is a sphere(!) –Many geographical spatial DBs use planar projections So, astronomers have felt the need to develop spatial indexing prescriptions of their own
Hierarchical Triangular Mesh - HTM Developed by Sloan survey archive team at JHU Start with projection of octahedron on sphere and subdivide triangles at their midpoints Generate unique pixel ID code based on position in the sky and level in hierarchy – can index that with B-tree
Hierarchical Equal Area Iso- Latitude Pixelisation (HEALPix) Developed by Kris Gorski (now JPL/Caltech) Start with division of sphere into twelve equal area curvilinear quadrilaterals, then divide each into four then divide each into four Like HTM, produces a pixel code on which a pixel code on which a B-tree index can be made B-tree index can be made (Ian – HEALPix in Oracle?)
Sky survey DB case study: SkyServer for SDSS Sloan Digital Sky Survey (SDSS): –first of new generation of sky surveys US-led team, dedicated telescope & camera Image half of northern sky in 5 optical bands Then obtain optical spectra for 1,000,000 galaxies Estimated ~1TB of catalogue data
SDSS Archive First of new generation of sky survey archives –Represents the state-of-the-art in sky survey databases Developed by Alex Szalay’s team at Johns Hopkins Project started in earnest in about 1996 –OODBMSs seen as the coming thing –SDSS chose Objectivity/DB for their archive: ~15 staff-years of effort later, they’d rewritten much of the DBMS themselves…and then jumped ship and started using MS SQL Server! - SkyServer (in collaboration with Jim Gray, MS Research)
SkyServer design considerations Power & flexibility to pose arbitrary queries Simple – astronomers ignorant of SQL! Hide messy spherical trigonometry –Distance on sphere between (a1,d1) and (a2,d2) is given in SQL by 2.0*asin(sqrt(square(sin(0.5*(radians(d1-d2)))) + cos(radians(d1))*cos(radians(d2))* square(sin(0.5*(radians(a1-a2))))) 2.0*asin(sqrt(square(sin(0.5*(radians(d1-d2)))) + cos(radians(d1))*cos(radians(d2))* square(sin(0.5*(radians(a1-a2))))) –Don’t want users typing this –Don’t really want DBMS to evaluate expressions like this often
SkyServer spatial queries Simple table-valued functions exposed to user: –E.g. select count(*) from fGetNearbyObjEq(a,d,radius) from fGetNearbyObjEq(a,d,radius) (a,d)=(Right Ascension, Declination) Functions call SQL Server Extended Stored Procedure –HTM index manipulation routines, implemented in a Dynamically Linked Library (DLL) –DLL generated from HTM package in C++
Lessons from HTM implementation in SkyServer SQL is not great for spherical trigonometry –Messy to write, slow to compute Have to define stored procedures/functions –Expose a clean interface to users –Let them pose queries the way they want to Replace trig operations by integer arithmetic –Library of HTM index operations underneath Precompute tables of neighbouring objects –Far fewer spatial match operations at query time
Problems with this approach How easy to develop stored procedures, etc? –Needs detailed knowledge of DBMS –Extended Stored Procedure calls slow How well will query optimiser use HTM? –…less well than built-in spatial index?… …but that might be poorly suited to astronomical applications… How easy to implement all this in DBMSs other than SQL Server? But this works reasonably well in practice!