Public Access to Large Astronomical Datasets Alex Szalay, Johns Hopkins Jim Gray, Microsoft Research.

Slides:



Advertisements
Similar presentations
Microsoft Research Microsoft Research Jim Gray Distinguished Engineer Microsoft Research San Francisco SKYSERVER.
Advertisements

Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.
Trying to Use Databases for Science Jim Gray Microsoft Research
Online Science -- The World-Wide Telescope Archetype
World Wide Telescope mining the Sky using Web Services Information At Your Fingertips for astronomers Jim Gray Microsoft Research Alex Szalay Johns Hopkins.
Web Services for the Virtual Observatory Alex Szalay, Tamas Budavari, Tanu Malik, Jim Gray, and Ani Thakar SPIE, Hawaii, 2002 (Living in an exponential.
1 Online Science the New Computational Science Jim Gray Microsoft Research Alex Szalay Johns Hopkins.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research Talk at
1 Online Science -- The World-Wide Telescope as an Archetype Jim Gray Microsoft Research Collaborating with: Alex Szalay, Peter Kunszt, Ani
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research

The Australian Virtual Observatory e-Science Meeting School of Physics, March 2003 David Barnes.
Astronomy Data Bases Jim Gray Microsoft Research.
Scientific Collaborations in a Data-Centric World Alex Szalay The Johns Hopkins University.
Development of China-VO ZHAO Yongheng NAOC, Beijing Nov
Virtual Observatory & Grid Technique ZHAO Yongheng (National Astronomical Observatories of China) CANS2002.
1 Online Science The World-Wide Telescope Jim Gray Microsoft Research Collaborating with: Alex Szalay, Tamas Budavari, Tanu Malik Ani JHU George.
Astrophysics with Terabytes of Data Alex Szalay The Johns Hopkins University.
20 Spatial Queries for an Astronomer's Bench (mark) María Nieto-Santisteban 1 Tobias Scholl 2 Alexander Szalay 1 Alfons Kemper 2 1. The Johns Hopkins University,
Long-Term Preservation of Astronomical Research Results Robert Hanisch US National Virtual Observatory Space Telescope Science Institute Baltimore, MD.
Galaxy Distributions Analysis of Large-scale Structure Using Visualization and Percolation Technique on the SDSS Early Data Release Database Yuk-Yan Lam.
A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D.
Sloan Digital Sky Survey Astronomy April 2006 Margaret Flynn.
Data-Intensive Computing in the Science Community Alex Szalay, JHU.
The aims of SC4DEVO and SC4DEVO-1 Bob Mann Institute for Astronomy and National e-Science Centre, University of Edinburgh.
SDSS Web Services Tamás Budavári Johns Hopkins University Coding against the Universe.
Astro-DISC: Astronomy and cosmology applications of distributed super computing.
Teaching Science with Sloan Digital Sky Survey Data GriPhyN/iVDGL Education and Outreach meeting March 1, 2002 Jordan Raddick The Johns Hopkins University.
1 Where The Rubber Meets the Sky Giving Access to Science Data Jim Gray Microsoft Research Alex.
Sky Surveys and the Virtual Observatory Alex Szalay The Johns Hopkins University.
Supported by the National Science Foundation’s Information Technology Research Program under Cooperative Agreement AST with The Johns Hopkins University.
Amdahl Numbers as a Metric for Data Intensive Computing Alex Szalay The Johns Hopkins University.
National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:
1 New Frontiers with LSST: leveraging world facilities Tony Tyson Director, LSST Project University of California, Davis Science with the 8-10 m telescopes.
Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.
Radio Galaxies and Quasars Powerful natural radio transmitters associated with Giant elliptical galaxies Demo.
Astronomical data curation and the Wide-Field Astronomy Unit Bob Mann Wide-Field Astronomy Unit Institute for Astronomy School of Physics University of.
Alex Szalay Department of Physics and Astronomy The Johns Hopkins University and the SDSS Project The Sloan Digital Sky Survey.
Future Directions of the VO Alex Szalay The Johns Hopkins University Jim Gray Microsoft Research.
Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar Dept. of Physics.
LSST: Preparing for the Data Avalanche through Partitioning, Parallelization, and Provenance Kirk Borne (Perot Systems Corporation / NASA GSFC and George.
Science In An Exponential World Alexander Szalay, JHU Jim Gray, Microsoft Reserach Alexander Szalay, JHU Jim Gray, Microsoft Reserach.
Federation and Fusion of astronomical information Daniel Egret & Françoise Genova, CDS, Strasbourg Standards and tools for the Virtual Observatories.
Wiss. Beirat AIP, ClusterFinder & VO-Methods H. Enke German Astrophysical Virtual Observatory ClusterFinder VO Methods for Astronomical Applications.
EScience May 2007 From Photons to Petabytes: Astronomy in the Era of Large Scale Surveys and Virtual Observatories R. Chris Smith NOAO/CTIO, LSST.
1 Online Science The World-Wide Telescope Jim Gray Microsoft Research Collaborating with: Alex Szalay, Tamas Budavari, Tanu Malik Ani JHU George.
The Sloan Digital Sky Survey ImgCutout: The universe at your fingertips Maria A. Nieto-Santisteban Johns Hopkins University
Web Services for the National Virtual Observatory Tamás Budavári Johns Hopkins University.
SkyServer: Public Access to the Sloan Digital Sky Survey Alex Szalay, Jim Gray, Ani Thakar, Peter Kunszt, Tanu Malik, Tamas Budavari, Jordan Raddick, Chris.
Sky Survey Database Design National e-Science Centre Edinburgh 8 April 2003.
Real Web Services Jim Gray Microsoft Research 455 Market St, SF, CA, Talk at Charles Schwab.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Microsoft “information at your fingertips” for scientists Collaborating with Scientists to build better ways to organize, analyze, and understand.
January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh1 High-Speed Access for an NVO Data Grid Node María A. Nieto-Santisteban, Aniruddha R.
William O’Mullane/ Tannu Malik - JHU IVOA Cambridge May 12-16, 2003 SkyQuery.Net SKYQUERY Federated Database Query System (using WebServices)
IPHAS Early Data Release E. A. Gonzalez-Solares IPHAS Consortium AstroGrid National Astronomy Meeting, 2007.
1 Where The Rubber Meets the Sky Giving Access to Science Data Jim Gray Microsoft Research Alex.
Lecture 3 With every passing hour our solar system comes forty-three thousand miles closer to globular cluster 13 in the constellation Hercules, and still.
Building Peta-Byte Data Stores Jim Claus Shira Anniversary European Media Lab 12 February 2001.
Wide-field Infrared Survey Explorer (WISE) is a NASA infrared- wavelength astronomical space telescope launched on December 14, 2009 It’s an Earth-orbiting.
Color Magnitude Diagram VG. So we want a color magnitude diagram for AGN so that by looking at the color of an AGN we can get its luminosity –But AGN.
Microsoft Research San Francisco (aka BARC: bay area research center) Jim Gray Researcher Microsoft Research Scalable servers Scalable servers Collaboration.
Online Science -- The World-Wide Telescope as an Archetype
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
BARC Scaleable Servers
Rick, the SkyServer is a website we built to make it easy for professional and armature astronomers to access the terabytes of data gathered by the Sloan.
Jim Gray Microsoft Research
Google Sky.
Presentation transcript:

Public Access to Large Astronomical Datasets Alex Szalay, Johns Hopkins Jim Gray, Microsoft Research

Outline Trends The Sloan Digital Sky Survey –The `Cosmic Genome Project’ The SDSS database design The World-Wide Telescope –Virtual Observatory: Federating archives over the world Exploring Web Services –Sky Query, Image Cutout

Living in an Exponential World Astronomers have a few hundred TB now –1 pixel (byte) / sq arc second ~ 4TB –Multi-spectral, temporal, … → 1PB They mine it looking for new (kinds of) objects or more of interesting ones (quasars), density variations in 400-D space correlations in 400-D space Data doubles every year Data is public after 1 year So, 50% of the data is public Some have private access to 5% more data So: 50% vs 55% access for everyone

Science is hitting a wall FTP and GREP are not adequate –You can GREP 1 MB in a second –You can GREP 1 GB in a minute –You can GREP 1 TB in 2 days –You can GREP 1 PB in 3 years. Oh!, and 1PB ~10,000 disks At some point you need indices to limit search parallel data search and analysis This is where databases can help You can FTP 1 MB in 1 sec You can FTP 1 GB / min (= 1 $/GB) … 2 days and 1K$ … 3 years and 1M$

Making Discoveries When and where are discoveries made? –Always at the edges and boundaries –Going deeper, using more colors…. Metcalfe’s law –Utility of computer networks grows as the number of possible connections: O(N 2 ) VO: Federation of N archives –Possibilities for new discoveries grow as O(N 2 ) Current sky surveys have proven this –Very early discoveries from SDSS, 2MASS, DPOSS

Publishing Data Roles Authors Publishers Curators Consumers Traditional Scientists Journals Libraries Scientists Emerging Collaborations Project www site Bigger Archives Scientists

Changing Roles Exponential growth: –Projects last at least 3-5 years –Data sent upwards only at the end of the project –Data will be never centralized More responsibility on projects –Becoming Publishers and Curators –Larger fraction of budget spent on software –Lot of development duplicated, wasted –All documentation is contained in the archive More standards are needed –Easier data interchange, fewer tools More templates are needed –Develop less software on your own

Emerging New Concepts Standardizing distributed data –Web Services, supported on all platforms –Custom configure remote data dynamically –XML: Extensible Markup Language –SOAP: Simple Object Access Protocol –WSDL: Web Services Description Language Standardizing distributed computing –Grid Services –Custom configure remote computing dynamically –Build your own remote computer, and discard –Virtual Data: new data sets on demand

Goal Create the most detailed map of the Northern sky in 5 years 2.5m telescope, Apache Point, NM 3 degree field of view ¼ of the whole sky Two surveys in one Photometric survey in 5 bands Spectroscopic redshift survey Automated data reduction 150 man-years of development Very high data volume 40 TB of raw data 5 TB processed catalogs Data is public Features of the SDSS The University of Chicago Princeton University The Johns Hopkins University The University of Washington New Mexico State University Fermi National Accelerator Laboratory US Naval Observatory The Japanese Participation Group The Institute for Advanced Study Max Planck Inst, Heidelberg Sloan Foundation, NSF, DOE, NASA The University of Chicago Princeton University The Johns Hopkins University The University of Washington New Mexico State University Fermi National Accelerator Laboratory US Naval Observatory The Japanese Participation Group The Institute for Advanced Study Max Planck Inst, Heidelberg Sloan Foundation, NSF, DOE, NASA

Continuous data rate of 8 Mbytes/sec Northern Galactic Cap drift scan of 10,000 square degrees 24k x 1M pixel “panoramic” images in 5 colors – broad-band filters (u,g,r,i,z) exposure time: 55 sec pixel size: 0.4 arcsec astrometry: 60 mas calibration: 2% done only in best seeing (20 nights/year) Southern Galactic Cap multiple scans (> 30 times) of the same stripe The Imaging Survey

Expanding universe redshift = distance SDSS Redshift Survey 1 million galaxies 100,000 quasars 100,000 stars Two high throughput spectrographs spectral range Å 640 spectra simultaneously R=2000 resolution, 1.3 Å Features Automated reduction of spectra Very high sampling density and completeness The Spectroscopic Survey Elliptical galaxy

Pixel data collected by telescope Sent to Fermilab for processing Beowulf Cluster produces catalog Loaded in a SQL database Data Flow

Public Data Release June 2002: EDR –Early Data Release January 2003: DR1 –Contains 30% of final data –200 million photo objects 4 versions of the data –Target, best, runs, spectro Total catalog volume 1.7TB –See Terascale sneakernet paper… Published releases served forever –EDR, DR1, DR2, …. –Soon to include archives, annotations O(N 2 ) – only possible because of Moore’s Law! EDR DR1 DR2 DR3

Why Is Astronomy Data Special? It has no commercial value –No privacy concerns –Can freely share results with others –Great for experimenting with algorithms It is real and well documented –High-dimensional (with confidence intervals) –Spatial –Temporal Diverse and distributed –Many different instruments from many different places and many different times The questions are interesting There is a lot of it (petabytes) IRAS 100  ROSAT ~keV DSS Optical 2MASS 2  IRAS 25  NVSS 20cm WENSS 92cm GB 6cm

Virtual Observatory Many new surveys are coming –SDSS is a dry run for the next ones –LSST will be 1TB/night All the data will be on the Internet –But how? ftp, webservice… Data and apps will be associated with the instruments –Distributed world wide –Cross-indexed –Federation is a must, but how? Will be the best telescope in the world –World Wide Telescope

SkyQuery: Experimental Federation Federated 5 Web Services –Portal unifies 3 archives and a cutout service to visualize results –Fermilab/SDSS, JHU/FIRST, Caltech/2MASS Archives –Multi-survey spatial join and SQL select –Distributed query optimization (T. Malik, T. Budavari) in 6 weeks Cutout web service: annotated SDSS images SELECT o.objId, o.ra, o.r, o.type, o.I, t.objId, t.j_m FROM SDSS:PhotoPrimary o, TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5 AND AREA(181.3,-0.76,6.5) AND o.type=3 AND o.I – t.j_m > 2

Summary The data is public and largely self-documenting –Get your own copy! The SDSS database and web app are interesting –Data mining challenge –Data visualization challenge –Educational challenge –Web services `poster-child’ Information at your fingertips –Students see the same data as professional astronomers More data coming –1.7 TB+ public data by Jan 2003, 6TB+ coming The World-Wide Telescope –Federating the astronomy archives is a CS challenge