SkyServer: Public Access to the Sloan Digital Sky Survey Alex Szalay, Jim Gray, Ani Thakar, Peter Kunszt, Tanu Malik, Tamas Budavari, Jordan Raddick, Chris.

Slides:



Advertisements
Similar presentations
Microsoft Research Microsoft Research Jim Gray Distinguished Engineer Microsoft Research San Francisco SKYSERVER.
Advertisements

Trying to Use Databases for Science Jim Gray Microsoft Research
Online Science -- The World-Wide Telescope Archetype
World Wide Telescope mining the Sky using Web Services Information At Your Fingertips for astronomers Jim Gray Microsoft Research Alex Szalay Johns Hopkins.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research Talk at
1 Online Science -- The World-Wide Telescope as an Archetype Jim Gray Microsoft Research Collaborating with: Alex Szalay, Peter Kunszt, Ani
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Global Hands-On Universe meeting July 15, 2007 Authentic Data in the Classroom with the Sloan Digital Sky Survey Jordan Raddick (Johns Hopkins University)
Scientific Collaborations in a Data-Centric World Alex Szalay The Johns Hopkins University.
Eötvös University Budapest in the Network.  Seniors: István Csabai (node coordinator): »Photometric redshift estimation, virtual observatories, science.
László Dobos 1,2, Tamás Budavári 2, Nolan Li 2, Alex Szalay 2, István Csabai 1 1 Eötvös Loránd University, Budapest,
1 Online Science The World-Wide Telescope Jim Gray Microsoft Research Collaborating with: Alex Szalay, Tamas Budavari, Tanu Malik Ani JHU George.
Jim Gray, Online Science: Onassis Foundation Science Lecture Series, FORTH, Heraklion, Crete, Greece, 17 July Online Science -- The World-Wide.
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
20 Spatial Queries for an Astronomer's Bench (mark) María Nieto-Santisteban 1 Tobias Scholl 2 Alexander Szalay 1 Alfons Kemper 2 1. The Johns Hopkins University,
A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D.
Sloan Digital Sky Survey Astronomy April 2006 Margaret Flynn.
SDSS Web Services Tamás Budavári Johns Hopkins University Coding against the Universe.
Teaching Science with Sloan Digital Sky Survey Data GriPhyN/iVDGL Education and Outreach meeting March 1, 2002 Jordan Raddick The Johns Hopkins University.
Alex Szalay Department of Physics and Astronomy The Johns Hopkins University The Sloan Digital Sky Survey.
Digitized Sky Survey Update Brian McLean : Archive Sciences Branch / Operations and Engineering Division.
Introduction to Sky Survey Problems Bob Mann. Introduction to sky survey database problems Astronomical data Astronomical databases –The Virtual Observatory.
Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick.
Sky Surveys and the Virtual Observatory Alex Szalay The Johns Hopkins University.
Spatial Indexing and Visualizing Large Multi-dimensional Databases I. Csabai, M. Trencséni, L. Dobos, G. Herczegh, P. Józsa, N. Purger Eötvös University,
Supported by the National Science Foundation’s Information Technology Research Program under Cooperative Agreement AST with The Johns Hopkins University.
Hopkins Storage Systems Lab, Department of Computer Science A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching Xiaodan Wang, Tanu.
National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:
1 New Frontiers with LSST: leveraging world facilities Tony Tyson Director, LSST Project University of California, Davis Science with the 8-10 m telescopes.
Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.
Spatial Indexing of large astronomical databases László Dobos, István Csabai, Márton Trencséni ELTE, Hungary.
1 The Terabyte Analysis Machine Jim Annis, Gabriele Garzoglio, Jun 2001 Introduction The Cluster Environment The Distance Machine Framework Scales The.
1 Four Talks The article we actually wrote Online Scientific Publication, Curation, Archiving The advertised talk Computer Science Challenges in the VO.
Alex Szalay Department of Physics and Astronomy The Johns Hopkins University and the SDSS Project The Sloan Digital Sky Survey.
1 Where The Rubber Meets the Sky Giving Access to Science Data Talk at National Institute of Informatics, Tokyo, Japan October 2005 Jim Gray Microsoft.
Future Directions of the VO Alex Szalay The Johns Hopkins University Jim Gray Microsoft Research.
Public Access to Large Astronomical Datasets Alex Szalay, Johns Hopkins Jim Gray, Microsoft Research.
EÖTVÖS UNIVERSITY BUDAPEST Department of Physics of Complex Systems VO Spectroscopy Workshop, ESAC Spectrum Services 2007 László Dobos (ELTE)
Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar Dept. of Physics.
LSST: Preparing for the Data Avalanche through Partitioning, Parallelization, and Provenance Kirk Borne (Perot Systems Corporation / NASA GSFC and George.
Issues in Large Online Image Databases Jim Gray Microsoft Research National Cancer Institute Workshop on Cancer Imaging Informatics
Science In An Exponential World Alexander Szalay, JHU Jim Gray, Microsoft Reserach Alexander Szalay, JHU Jim Gray, Microsoft Reserach.
Indexing and Visualizing Multidimensional Data I. Csabai, M. Trencséni, L. Dobos, G. Herczegh, P. Józsa, N. Purger Eötvös University,Budapest.
Source: Alex Szalay. Example: Sloan Digital Sky Survey The SDSS telescope array is systematically mapping ¼ of the entire sky Discoveries are made by.
1 Online Science The World-Wide Telescope Jim Gray Microsoft Research Collaborating with: Alex Szalay, Tamas Budavari, Tanu Malik Ani JHU George.
The Sloan Digital Sky Survey ImgCutout: The universe at your fingertips Maria A. Nieto-Santisteban Johns Hopkins University
Sky Survey Database Design National e-Science Centre Edinburgh 8 April 2003.
Real Web Services Jim Gray Microsoft Research 455 Market St, SF, CA, Talk at Charles Schwab.
Kevin Cooke.  Galaxy Characteristics and Importance  Sloan Digital Sky Survey: What is it?  IRAF: Uses and advantages/disadvantages ◦ Fits files? 
Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh1 High-Speed Access for an NVO Data Grid Node María A. Nieto-Santisteban, Aniruddha R.
William O’Mullane/ Tannu Malik - JHU IVOA Cambridge May 12-16, 2003 SkyQuery.Net SKYQUERY Federated Database Query System (using WebServices)
IPHAS Early Data Release E. A. Gonzalez-Solares IPHAS Consortium AstroGrid National Astronomy Meeting, 2007.
1 Where The Rubber Meets the Sky Giving Access to Science Data Jim Gray Microsoft Research Alex.
Lecture 3 With every passing hour our solar system comes forty-three thousand miles closer to globular cluster 13 in the constellation Hercules, and still.
Building Peta-Byte Data Stores Jim Claus Shira Anniversary European Media Lab 12 February 2001.
Wide-field Infrared Survey Explorer (WISE) is a NASA infrared- wavelength astronomical space telescope launched on December 14, 2009 It’s an Earth-orbiting.
Microsoft Research San Francisco (aka BARC: bay area research center) Jim Gray Researcher Microsoft Research Scalable servers Scalable servers Collaboration.
Spatial Searches in the ODM. slide 2 Common Spatial Questions Points in region queries 1.Find all objects in this region 2.Find all “good” objects (not.
Online Science -- The World-Wide Telescope Archetype
Mining the Sky The World-Wide Telescope
Online Science -- The World-Wide Telescope as an Archetype
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
BARC Scaleable Servers
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Rick, the SkyServer is a website we built to make it easy for professional and armature astronomers to access the terabytes of data gathered by the Sloan.
Presentation transcript:

SkyServer: Public Access to the Sloan Digital Sky Survey Alex Szalay, Jim Gray, Ani Thakar, Peter Kunszt, Tanu Malik, Tamas Budavari, Jordan Raddick, Chris Stoughton, Jan vandenBerg Sigmod 2002, Madison

Outline The Sloan Digital Sky Survey The SDSS database design –HTM – spatial queries –20 queries Demo of the SkyServer The next steps The World-Wide Telescope Web Services –Sky Query/Image Cutout

Goal Create the most detailed map of the Northern sky in 5 years 2.5m telescope, Apache Point, NM 3 degree field of view ¼ of the whole sky Two surveys in one Photometric survey in 5 bands Spectroscopic redshift survey Automated data reduction 150 man-years of development Very high data volume 40 TB of raw data 5 TB processed catalogs Data is public Features of the SDSS The University of Chicago Princeton University The Johns Hopkins University The University of Washington New Mexico State University Fermi National Accelerator Laboratory US Naval Observatory The Japanese Participation Group The Institute for Advanced Study Max Planck Inst, Heidelberg Sloan Foundation, NSF, DOE, NASA The University of Chicago Princeton University The Johns Hopkins University The University of Washington New Mexico State University Fermi National Accelerator Laboratory US Naval Observatory The Japanese Participation Group The Institute for Advanced Study Max Planck Inst, Heidelberg Sloan Foundation, NSF, DOE, NASA

Continuous data rate of 8 Mbytes/sec Northern Galactic Cap drift scan of 10,000 square degrees 24k x 1M pixel “panoramic” images in 5 colors – broad-band filters (u,g,r,i,z) exposure time: 55 sec pixel size: 0.4 arcsec astrometry: 60 mas calibration: 2% done only in best seeing (20 nights/year) Southern Galactic Cap multiple scans (> 30 times) of the same stripe The Imaging Survey

Expanding universe redshift = distance SDSS Redshift Survey 1 million galaxies 100,000 quasars 100,000 stars Two high throughput spectrographs spectral range Å 640 spectra simultaneously R=2000 resolution, 1.3 Å Features Automated reduction of spectra Very high sampling density and completeness The Spectroscopic Survey Elliptical galaxy

Pixel data collected by telescope Data Flow Sent to Fermilab for processing Beowulf Cluster produces catalog Loaded in a SQL database

Object catalog 6000 GB parameters of >10 8 objects Redshift Catalog 1 GB parameters of 10 6 objects Atlas Images1500 GB 5 color cutouts of >10 8 objects Spectra 60 GB in a one-dimensional form Derived Catalogs 20 GB clusters QSO absorption lines 4x4 Pixel All-Sky Map 60 GB heavily compressed Corrected Frames 15 TB Object catalog 6000 GB parameters of >10 8 objects Redshift Catalog 1 GB parameters of 10 6 objects Atlas Images1500 GB 5 color cutouts of >10 8 objects Spectra 60 GB in a one-dimensional form Derived Catalogs 20 GB clusters QSO absorption lines 4x4 Pixel All-Sky Map 60 GB heavily compressed Corrected Frames 15 TB SDSS Data Products

Spatial Data Access – SQL extension Szalay, Kunszt, Brunner Added Hierarchical Triangular Mesh (HTM) table-valued function for spatial joins Every object has a 20-deep Mesh ID Given a spatial definition, routine returns up to 10 covering triangles Spatial query is then up to 10 range queries Very fast: 10,000 triangles / second / cpu 2,2 2,1 2,0 2,3 2,3,0 2,3,1 2,3,22,3,3 2,2 2,1 2,0 2,32,2 2,1 2,0 2,3 2,3,0 2,3,1 2,3,22,3,3 2,3,0 2,3,1 2,3,22,3,3

20 Queries DB design started with “20 queries” in English These then dictated DB design –Spatial extensions, neighbors Then implemented in SQL –Heavy use of SP, UDF –All run in 10 mins, most under 1 min Tag tables –replaced by covering indices Sequential IO –The worst case, a full scan – reached 400MB/sec on Wintel

Q15: Fast Moving Objects Find near earth asteroids: Finds 3 objects in 11 minutes –(or 52 seconds with an index) Ugly, but consider the alternatives (c programs and files and time…) – SELECT r.objID as rId, g.objId as gId, dbo.fGetUrlEq(g.ra, g.dec) as url FROM PhotoObj r, PhotoObj g WHERE r.run = g.run and r.camcol=g.camcol and abs(g.field-r.field)<2 -- nearby -- the red selection criteria and ((power(r.q_r,2) + power(r.u_r,2)) > ) and r.fiberMag_r between 6 and 22 and r.fiberMag_r < r.fiberMag_g and r.fiberMag_r < r.fiberMag_i and r.parentID=0 and r.fiberMag_r < r.fiberMag_u and r.fiberMag_r < r.fiberMag_z and r.isoA_r/r.isoB_r > 1.5 and r.isoA_r> the green selection criteria and ((power(g.q_g,2) + power(g.u_g,2)) > ) and g.fiberMag_g between 6 and 22 and g.fiberMag_g < g.fiberMag_r and g.fiberMag_g < g.fiberMag_i and g.fiberMag_g < g.fiberMag_u and g.fiberMag_g < g.fiberMag_z and g.parentID=0 and g.isoA_g/g.isoB_g > 1.5 and g.isoA_g > the matchup of the pair and sqrt(power(r.cx -g.cx,2)+ power(r.cy-g.cy,2)+power(r.cz-g.cz,2))*(10800/PI())< 4.0 and abs(r.fiberMag_r-g.fiberMag_g)< 2.0

Demo of SkyServer Based on the TerraServer design Designed for high school students –Contains 150 hours of interactive courses Experiment for easy visual interfaces Opened June 5, 2001 After a year: –1.6M page views –60K visitors –4.7M page hits Added Web Services –Cutout –SkyQuery

Public Data Release June 2002: EDR –Early Data Release January 2003: DR1 –Contains 30% of final data –100 million photo objects 4 versions of the data –Target, best, runs, spectro Total catalog volume 1.7TB –See Terascale sneakernet paper… Published releases served forever –EDR, DR1, DR2, …. O(N 2 ) – only possible because of Moore’s Law! EDR DR1 DR2 DR3

Why Is Astronomy Data Special? It has no commercial value –No privacy concerns –Can freely share results with others –Great for experimenting with algorithms It is real and well documented –High-dimensional (with confidence intervals) –Spatial –Temporal Diverse and distributed –Many different instruments from many different places and many different times The questions are interesting There is a lot of it (petabytes) IRAS 100  ROSAT ~keV DSS Optical 2MASS 2  IRAS 25  NVSS 20cm WENSS 92cm GB 6cm

Living in an Exponential World Astronomers have a few hundred TB now –1 pixel (byte) / sq arc second ~ 4TB –Multi-spectral, temporal, … → 1PB They mine it looking for new (kinds of) objects or more of interesting ones (quasars), density variations in 400-D space correlations in 400-D space Data doubles every year Data is public after 1 year So, 50% of the data is public Some have private access to 5% more data So: 50% vs 55% access for everyone

Virtual Observatory Many new surveys are coming –SDSS is a dry run for the next ones –LSST will be 1TB/night All the data will be on the Internet –But how? ftp, webservice… Data and apps will be associated with the instruments –Distributed world wide –Cross-indexed –Federation is a must, but how? Will be the best telescope in the world –World Wide Telescope

SkyQuery: Experimental Federation Federated 5 Web Services –Portal unifies 3 archives and a cutout service to visualize results –Fermilab/SDSS, JHU/FIRST, Caltech/2MASS Archives –Multi-survey spatial join and SQL select –Distributed query optimization (T. Malik, T. Budavari) in 6 weeks Cutout web service: annotated SDSS images – SELECT o.objId, o.ra, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o, TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5 AND AREA(181.3,-0.76,6.5) AND o.type=3 AND o.I – t.m_j > 2

Relevant Papers Data Mining the SDSS SkyServer Database Jim Gray; Peter Kunszt; Donald Slutz; Alex Szalay; Ani Thakar; Jan Vandenberg; Chris Stoughton Jan p. An earlier paper described the Sloan Digital Sky Survey’s (SDSS) data management needs [Szalay1] by defining twenty database queries and twelve data visualization tasks that a good data management system should support. We built a database and interfaces to support both the query load and also a website for ad-hoc access. This paper reports on the database design, describes the data loading pipeline, and reports on the query implementation and performance. The queries typically translated to a single SQL statement. Most queries run in less than 20 seconds, allowing scientists to interactively explore the database. This paper is an in-depth tour of those queries. Readers should first have studied the companion overview paper “The SDSS SkyServer – Public Access to the Sloan Digital Sky Server Data” [Szalay2]. SDSS SkyServer–Public Access to Sloan Digital Sky Server Data Jim Gray; Alexander Szalay; Ani Thakar; Peter Z. Zunszt; Tanu Malik; Jordan Raddick; Christopher Stoughton; Jan Vandenberg November p.: Word 1.46 Mbytes PDF 456 KbytesWordPDF The SkyServer provides Internet access to the public Sloan Digital Sky Survey (SDSS) data for both astronomers and for science education. This paper describes the SkyServer goals and architecture. It also describes our experience operating the SkyServer on the Internet. The SDSS data is public and well- documented so it makes a good test platform for research on database algorithms and performance. The World-Wide Telescope Jim Gray; Alexander Szalay August p.: Word 684 Kbytes PDF 84 KbytesWordPDF All astronomy data and literature will soon be online and accessible via the Internet. The community is building the Virtual Observatory, an organization of this worldwide data into a coherent whole that can be accessed by anyone, in any form, from anywhere. The resulting system will dramatically improve our ability to do multi-spectral and temporal studies that integrate data from multiple instruments. The virtual observatory data also provides a wonderful base for teaching astronomy, scientific discovery, and computational science. Designing and Mining Multi-Terabyte Astronomy Archives Robert J. Brunner; Jim Gray; Peter Kunszt; Donald Slutz; Alexander S. Szalay; Ani Thakar June p.: Word (448 Kybtes) PDF (391 Kbytes)WordPDF The next-generation astronomy digital archives will cover most of the sky at fine resolution in many wavelengths, from X-rays, through ultraviolet, optical, and infrared. The archives will be stored at diverse geographical locations. One of the first of these projects, the Sloan Digital Sky Survey (SDSS) is creating a 5- wavelength catalog over 10,000 square degrees of the sky (see The 200 million objects in the multi-terabyte database will have mostly numerical attributes in a 100+ dimensional space. Points in this space have highly correlated distributions. The archive will enable astronomers to explore the data interactively. Data access will be aided by multidimensional spatial and attribute indices. The data will be partitioned in many ways. Small tag objects consisting of the most popular attributes will accelerate frequent searches. Splitting the data among multiple servers will allow parallel, scalable I/O and parallel data analysis. Hashing techniques will allow efficient clustering, and pair-wise comparison algorithms that should parallelize nicely. Randomly sampled subsets will allow de-bugging otherwise large queries at the desktop. Central servers will operate a data pump to support sweep searches touching most of the data. The anticipated queries will re-quire special operators related to angular distances and complex similarity tests of object properties, like shapes, colors, velocity vectors, or temporal behaviors. These issues pose interesting data management challenges.

References and Links SkyServer – – Virtual Observatory – – World-Wide Telescope –paper in Science V.293 pp Sept (MS-TR word or pdf.)wordpdf.) SDSS DB is a data mining challenge: –Get your personal copy at