LSST, the Spatial Cross-Match Challenge Maria Nieto-Santisteban Alexander Szalay Ani Thakar The Johns Hopkins University Jim Gray Microsoft Research
What is Cross-Matching? Identify point(s) in A with point(s) in B Cones: Find points nearby one point Distance from few arcseconds to few degrees Neighborhood: points nearby points Distance from few arcseconds to very few arcminutes Decide whether those points share more than just their position Points: Search around single position versus All-to-All match with arbitrary distance. Radius: Cone searches may go from small to very big All-to-All tend to be small. Otherwise we face combinatorial explosion. It is not a matter of radius but density, number of objects Whether is better to use cones or neighbors it may depend on the ratio between cardinalities and relative dispersion Maria A. Nieto-Santisteban / ADASS 2006
Maria A. Nieto-Santisteban / ADASS 2006 Zones Bin the data ZoneID = floor ((dec + 90.0) /zoneHeight) Place the data close on disk Cluster Index on ZoneID, Ra Trick required to handle the (360,0) Efficient Cones Neighbors (especially) Useful Partition the data Distribute workload Maria A. Nieto-Santisteban / ADASS 2006
Maria A. Nieto-Santisteban / ADASS 2006 Zone Table ObjID ZoneID* RA Dec CX CY CZ … 1 0.0 -90.0 2 20250 180.0 3 181.0 4 40500 360.0 +90.0 * Using a zone height of 8 arcsec in this example Maria A. Nieto-Santisteban / ADASS 2006
Maria A. Nieto-Santisteban / ADASS 2006 Declination vs. Zone, RA Order by Dec This slide shows two ways of ordering (“clustering”) catalog data on disk: by DECLINATION or by (ZONE, RA) (WHAT DO THE NUMBERS AND THEIR LOCATIONS MEAN?) The numbers in the figures are indexes that indicate the order on the disk. The positions of the indexes in the figure indicate the location of the object in a simulated LSST image of the Galactic center. The LSST image extends 3 degrees to the right and up off the top of the screen. (WHAT PROBLEM ARE WE TRYING TO SOLVE?) Now imagine we want to find neighbors within 8 arcsec of the object in blue. (WHAT IS THE WORST APPROACH?) The worst approach would be to calculate the distance from the blue object to all 2.5 million sources in the image. This would be very expensive because only 7 out of the 2.5 million distances would be within 8 arcsec (WHAT ABOUT ORDERING BY DECLINATION? FEWER CALCULATIONS.) We can do much better by using DECLINATION as a cluster index, as in the figure on the left. The indexes increase from bottom to top in the figure. There are big jumps between neighbors because stars off the right edge have intermediate declinations. Using the declination index, we limit the search range to +/-8 arcsec, as shown by the horizontal red lines. Then we only have to calculate distances for about 8000 sources, which is much better than 2.5 million. (WHAT ABOUT ORDERING BY ZONE AND RA? THE FEWEST CALCULATIONS.) We can do even better by using ZONE and RA as a cluster index , as in the figure on the right. Zone index increases in coarse bins from the bottom of the figure to the top. Within each zone, index increases monotonically from left to right. See how the indexes go 1, 2, 3, 4, … in the bottom zone. Then 3925, 3926,… in the next zone. And so on. Using the cluster index, we limit the search to a (different) narrow range in right ascension for each zone. We only have to calculate distances for 8 objects, which is much better than 8000 and much, much better than 2.5 million (DISK I/O AND SEEK TIME ARE ALSO BETTER WITH ZONES.) The other key point is that with a zone index, data for the neighbors are concentrated in a few disk blocks. With the DECLINATION index, data for the neighbors is spread over many disk blocks, with only one neighbor per disk block. In other words, the ZONE approach requires less CPU, less physical I/O, and less seeking by the disk. Order by RA within Zone Maria A. Nieto-Santisteban / ADASS 2006
“Circular” Regions Near the Poles d = cos1{sin(1) sin(2) + cos(1) cos(2) cos(1 1)} Maria A. Nieto-Santisteban / ADASS 2006
Maria A. Nieto-Santisteban / ADASS 2006 SQL CrossNeighbors SELECT * FROM prObj1 z1 JOIN zoneZone ZZ ON ZZ.zoneID1 = z1.zoneID JOIN prObj2 z2 ON ZZ.ZoneID2 = z2.zoneID WHERE z2.ra BETWEEN z1.ra-ZZ.alpha AND z2.ra+ZZ.alpha AND z2.dec BETWEEN z1.dec-@r AND z1.dec+@r (z1.cx*z2.cx+z1.cy*z2.cy+z1.cz*z2.cz) > cos(radians(@r)) Maria A. Nieto-Santisteban / ADASS 2006
Number of Rows in LSST Catalogs Single Exposure Single Night End of Survey Objects N/A 51010 Variable Objects 105 108 3108 Source Detections 3106 3109 81012 DIA Source Detections ( 105 ) ( 108 ) 31011 Data access will require good data organization Data partitioned and placed according their position in the sky Sources: Single Night = DR20 / 3000 because there are 300 night/year * 10 years = 3000 nights in the survey Sources: Single Exposure = Single Night / 900 because there are 900 exposures/night. There are 10 hours = 36000 seconds/night. Each exposure takes 40 seconds (15+3+15+5 [+2]), so there are 36000/40=900 exposures/night Maria A. Nieto-Santisteban / ADASS 2006
LSST Cross-Match’s challenges Issue alerts within 60 seconds Challenge: Heavily time constrained Nightly pipeline @ archive Challenge: Database consistency Deep Processing Challenge: Volume of data to process Association complexity User queries: Challenge: Many users, many types of users, many types of queries, a lot of data to look through Maria A. Nieto-Santisteban / ADASS 2006
Maria A. Nieto-Santisteban / ADASS 2006 Alert Processing 1. Start alert clock when 2nd exposure ends 3 second readout while slewing to next field 2. Calibrate images (dark subtract, flat field) 201 CCDs = 3.2 Gpixel 3. Difference image analysis Identify and extract variable sources 4. Cross-match with object catalog Distinguish known variables and new objects Point - Time to do the cross-match and issue the alert is a small fraction of the 60 seconds budget. Maria A. Nieto-Santisteban / ADASS 2006
Maria A. Nieto-Santisteban / ADASS 2006 Alerts Data Flow Variable Catalog Deep Catalog Moving Objects 125K 6M ? Known Variable ? Known Object ? Known Mover ? DIA Sources No No Yes 128K 3K 1K Yes Yes No Alerts trigger on DIA sources Variables are a small fraction of all objects Alert ? Alert ? Alert ? Examples: Cataclysmic Variable Supernova Gamma Ray Burst Maria A. Nieto-Santisteban / ADASS 2006
Alert Simulation for Galactic Center Extrapolate USNO-B LSST FOV of 10 deg2 6 Million Stars (DR20) 126 K Variable Stars 128 K DIA sources 3200 New Variables 1000 Un-matched Moving Objects, New Objects, Transients (GRB) Match distance = 1 arcsec Maria A. Nieto-Santisteban / ADASS 2006
Alert Cross-Match Performance We can detect alerts in 40 seconds on my desktop computer. Partition by FOVs Maria A. Nieto-Santisteban / ADASS 2006
Maria A. Nieto-Santisteban / ADASS 2006 Summary Cone search != Neighbors Zones efficiently index and “join” spatial data e.g., SDSS DR5 vs. 2MASS in 80 minutes Zones are a convenient for partitioning data Simulated a LSST FOV in Galactic Center Cross-match catalogs smallest to largest Finds possible alerts in 40 sec on desktop Maria A. Nieto-Santisteban / ADASS 2006