1 Where The Rubber Meets the Sky Giving Access to Science Data Jim Gray Microsoft Research Alex.

Slides:



Advertisements
Similar presentations
Closing the Gap Between Global Environmental Sensing Needs and Cyber Infrastructure Tools Jim Gray Jeff Burch Mark Ellisman Miron Livny David Maidment.
Advertisements

Microsoft Research Microsoft Research Jim Gray Distinguished Engineer Microsoft Research San Francisco SKYSERVER.
Trying to Use Databases for Science Jim Gray Microsoft Research
Online Science -- The World-Wide Telescope Archetype
World Wide Telescope mining the Sky using Web Services Information At Your Fingertips for astronomers Jim Gray Microsoft Research Alex Szalay Johns Hopkins.
Web Services for the Virtual Observatory Alex Szalay, Tamas Budavari, Tanu Malik, Jim Gray, and Ani Thakar SPIE, Hawaii, 2002 (Living in an exponential.
1 Online Science the New Computational Science Jim Gray Microsoft Research Alex Szalay Johns Hopkins.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research Talk at
1 Online Science -- The World-Wide Telescope as an Archetype Jim Gray Microsoft Research Collaborating with: Alex Szalay, Peter Kunszt, Ani
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Global Hands-On Universe meeting July 15, 2007 Authentic Data in the Classroom with the Sloan Digital Sky Survey Jordan Raddick (Johns Hopkins University)
John Cunniffe Dunsink Observatory Dublin Institute for Advanced Studies Evert Meurs (Dunsink Observatory) Aaron Golden (NUI Galway) Aus VO 18/11/03 Efficient.
The Australian Virtual Observatory e-Science Meeting School of Physics, March 2003 David Barnes.
Astronomy Data Bases Jim Gray Microsoft Research.
Scientific Collaborations in a Data-Centric World Alex Szalay The Johns Hopkins University.
László Dobos 1,2, Tamás Budavári 2, Nolan Li 2, Alex Szalay 2, István Csabai 1 1 Eötvös Loránd University, Budapest,
1 Online Science The World-Wide Telescope Jim Gray Microsoft Research Collaborating with: Alex Szalay, Tamas Budavari, Tanu Malik Ani JHU George.
Astrophysics with Terabytes of Data Alex Szalay The Johns Hopkins University.
20 Spatial Queries for an Astronomer's Bench (mark) María Nieto-Santisteban 1 Tobias Scholl 2 Alexander Szalay 1 Alfons Kemper 2 1. The Johns Hopkins University,
A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D.
Data-Intensive Computing in the Science Community Alex Szalay, JHU.
SDSS Web Services Tamás Budavári Johns Hopkins University Coding against the Universe.
Astro-DISC: Astronomy and cosmology applications of distributed super computing.
Aus-VO: Progress in the Australian Virtual Observatory Tara Murphy Australia Telescope National Facility.
eScience -- A Transformed Scientific Method"
1 Where The Rubber Meets the Sky Giving Access to Science Data Jim Gray Microsoft Research Alex.
The Structure of the Universe AST 112. Galaxy Groups and Clusters A few galaxies are all by themselves Most belong to groups or clusters Galaxy Groups:
Introduction to Sky Survey Problems Bob Mann. Introduction to sky survey database problems Astronomical data Astronomical databases –The Virtual Observatory.
Sky Surveys and the Virtual Observatory Alex Szalay The Johns Hopkins University.
Amdahl Numbers as a Metric for Data Intensive Computing Alex Szalay The Johns Hopkins University.
Hopkins Storage Systems Lab, Department of Computer Science A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching Xiaodan Wang, Tanu.
Big Data in Science (Lessons from astrophysics) Michael Drinkwater, UQ & CAASTRO 1.Preface Contributions by Jim Grey Astronomy data flow 2.Past Glories.
National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:
1 New Frontiers with LSST: leveraging world facilities Tony Tyson Director, LSST Project University of California, Davis Science with the 8-10 m telescopes.
Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.
Functions and Demo of Astrogrid 1.1 China-VO Haijun Tian.
Radio Galaxies and Quasars Powerful natural radio transmitters associated with Giant elliptical galaxies Demo.
1 Managing Data for the World Wide Telescope aka: The Virtual Observatory Jim Gray Alex Szalay SLAC Data Management Workshop.
1 Four Talks The article we actually wrote Online Scientific Publication, Curation, Archiving The advertised talk Computer Science Challenges in the VO.
Prototype system of the Japanese Virtual Observatory The Japanese Virtual Observatory (JVO) aims at providing easy access to federated astronomical databases.
1 Where The Rubber Meets the Sky Giving Access to Science Data Talk at National Institute of Informatics, Tokyo, Japan October 2005 Jim Gray Microsoft.
1 Online Science the New Computational Science Jim Gray Microsoft Research Alex Szalay Johns Hopkins.
Public Access to Large Astronomical Datasets Alex Szalay, Johns Hopkins Jim Gray, Microsoft Research.
Making the Sky Searchable: Automatically Organizing the World’s Astronomical Data Sam Roweis, Dustin Lang &
Fourth Paradigm Science-based on Data-intensive Computing.
The Data Avalanche Jim Gray Microsoft Research Talk at HP Labs/MSR: Research Day July 2004.
Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar Dept. of Physics.
Science In An Exponential World Alexander Szalay, JHU Jim Gray, Microsoft Reserach Alexander Szalay, JHU Jim Gray, Microsoft Reserach.
1 Online Science The World-Wide Telescope Jim Gray Microsoft Research Collaborating with: Alex Szalay, Tamas Budavari, Tanu Malik Ani JHU George.
The Sloan Digital Sky Survey ImgCutout: The universe at your fingertips Maria A. Nieto-Santisteban Johns Hopkins University
NVO Review -- San Diego Jan The VO compared to Other O‘s Jim Gray Microsoft T HE US N ATIONAL V IRTUAL O BSERVATORY.
1 Where The Rubber Meets the Sky Giving Access to Science Data Jim Gray Microsoft Research Alex.
AstroGrid NAM 2001 Andy Lawrence Cambridge NAM 2001 Andy Lawrence Cambridge Belfast Cambridge Edinburgh Jodrell Leicester MSSL.
Mireille Louys et al., OV-France Theory WG meeting, LYON, June DALIA : an observation vs simulation comparison frame work What could be a Model.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Microsoft “information at your fingertips” for scientists Collaborating with Scientists to build better ways to organize, analyze, and understand.
Gijs Verdoes Kleijn Edwin Valentijn Marjolein Cuppen for the Astro-WISE consortium.
Microsoft Research San Francisco (aka BARC: bay area research center) Jim Gray Researcher Microsoft Research Scalable servers Scalable servers Collaboration.
Online Science -- The World-Wide Telescope Archetype
How much information? Adapted from a presentation by:
Online Science -- The World-Wide Telescope as an Archetype
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Jim Gray Alex Szalay SLAC Data Management Workshop
BARC Scaleable Servers
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Rick, the SkyServer is a website we built to make it easy for professional and armature astronomers to access the terabytes of data gathered by the Sloan.
Jim Gray Researcher Microsoft Research
Jim Gray Microsoft Research
Presentation transcript:

1 Where The Rubber Meets the Sky Giving Access to Science Data Jim Gray Microsoft Research Alex Szalay Johns Hopkins University

2 New Science Paradigms Thousand years ago: science was empirical describing natural phenomena Last few hundred years: theoretical branch using models, generalizations Last few decades: a computational branch simulating complex phenomena Today: data exploration (eScience) unify theory, experiment, and simulation using data management and statistics –Data captured by instruments Or generated by simulator –Processed by software –Scientist analyzes database / files

3 The Virtual Observatory Premise: most data is (or could be online) The Internet is the world’s best telescope: –It has data on every part of the sky –In every measured spectral band: optical, x-ray, radio.. –As deep as the best instruments (2 years ago). –It is up when you are up –The “seeing” is always great –It’s a smart telescope: links objects and data to literature Software is the capital expense –Share, standardize, reuse..

4 The Big Picture Data ingest Managing a petabyte Common schema How to organize it? How to reorganize it? How to coexist with others? Data Query and Visualization tools Support/training Performance –Execute queries in a minute –Batch (big) query scheduling The Big Problems Experiments & Instruments Simulations facts answers questions ? Literature Other Archives facts

5 What X-info Needs from us (cs) (not drawn to scale) Science Data & Questions Scientists Database To store data Execute Queries Plumbers Data Mining Algorithms Miners Question & Answer Visualization Tools

6 Data Access Hitting a Wall Current science practice based on data download (FTP/GREP) Will not scale to the datasets of tomorrow You can GREP 1 MB in a second You can GREP 1 GB in a minute You can GREP 1 TB in 2 days You can GREP 1 PB in 3 years. Oh!, and 1PB ~5,000 disks At some point you need indices to limit search parallel data search and analysis This is where databases can help You can FTP 1 MB in 1 sec You can FTP 1 GB / min (~1$) … 2 days and 1K$ … 3 years and 1M$

7 Next-Generation Data Analysis Looking for –Needles in haystacks – the Higgs particle –Haystacks: dark matter, dark energy, turbulence, ecosystem dynamics Needles are easier than haystacks Global statistics have poor scaling –Correlation functions are N 2, likelihood techniques N 3 As data and computers grow at Moore’s Law, we can only keep up with N logN A way out? –Relax optimal notion (data is fuzzy, answers are approximate) –Don’t assume infinite computational resources or memory Requires combination of statistics & computer science

8 Smart Data: Unifying DB and Analysis There is too much data to move around Do data manipulations at database –Build custom procedures and functions into DB –Unify data Access & Analysis –Examples Statistical sampling and analysis Temporal and spatial indexing Pixel processing Automatic parallelism Auto (re)organize Scalable to Petabyte datasets Move Mohamed to the mountain, not the mountain to Mohamed.

9 Experiment Budgets ¼…½ Software Software for Instrument scheduling Instrument control Data gathering Data reduction Database Analysis Visualization Millions of lines of code Repeated for experiment after experiment Not much sharing or learning Let’s work to change this Identify generic tools Workflow schedulers Databases and libraries Analysis packages Visualizers … Simulation (computational science) are > ½ software

10 How to Help? Can’t learn the discipline before you start (takes 4 years.) Can’t go native – you are a CS person not a bio,… person Have to learn how to communicate Have to learn the language Have to form a working relationship with domain expert(s) Have to find problems that leverage your skills

11 Working Cross-Culture A Way to Engage With Domain Scientists Find someone who is desperate for help Communicate in terms of scenarios Work on a problem that gives 100x benefit –Weeks/task vs hours/task Solve 20% of the problem –The other 80% will take decades Prototype Go from working-to-working, Always have –Something to show –Clear next steps –Clear goal Avoid death-by-collaboration-meetings.

12 Working Cross-Culture Questions: A Way to Engage With Domain Scientists Astronomers proposed 20 questions Typical of things they want to do Each would require a week or more in old way (programming in tcl / C++/ FTP) Goal, make it easy to answer questions This goal motivates DB and tools design

13 The 20 Queries Q11: Find all elliptical galaxies with spectra that have an anomalous emission line. Q12: Create a grided count of galaxies with u-g>1 and r<21.5 over 60<declination<70, and 200<right ascension<210, on a grid of 2’, and create a map of masks over the same grid. Q13: Create a count of galaxies for each of the HTM triangles which satisfy a certain color cut, like 0.7u-0.5g-0.2i<1.25 && r<21.75, output it in a form adequate for visualization. Q14: Find stars with multiple measurements and have magnitude variations >0.1. Scan for stars that have a secondary object (observed at a different time) and compare their magnitudes. Q15: Provide a list of moving objects consistent with an asteroid. Q16: Find all objects similar to the colors of a quasar at 5.5<redshift<6.5. Q17: Find binary stars where at least one of them has the colors of a white dwarf. Q18: Find all objects within 30 arcseconds of one another that have very similar colors: that is where the color ratios u-g, g-r, r-I are less than 0.05m. Q19: Find quasars with a broad absorption line in their spectra and at least one galaxy within 10 arcseconds. Return both the quasars and the galaxies. Q20: For each galaxy in the BCG data set (brightest color galaxy), in 160<right ascension<170, -25<declination<35 count of galaxies within 30"of it that have a photoz within 0.05 of that galaxy. Q1: Find all galaxies without unsaturated pixels within 1' of a given point of ra=75.327, dec= Q2: Find all galaxies with blue surface brightness between and 23 and 25 mag per square arcseconds, and - 10<super galactic latitude (sgb) <10, and declination less than zero. Q3: Find all galaxies brighter than magnitude 22, where the local extinction is >0.75. Q4: Find galaxies with an isophotal surface brightness (SB) larger than 24 in the red band, with an ellipticity>0.5, and with the major axis of the ellipse having a declination of between 30” and 60”arc seconds. Q5: Find all galaxies with a deVaucouleours profile (r ¼ falloff of intensity on disk) and the photometric colors consistent with an elliptical galaxy. The deVaucouleours profile Q6: Find galaxies that are blended with a star, output the deblended galaxy magnitudes. Q7: Provide a list of star-like objects that are 1% rare. Q8: Find all objects with unclassified spectra. Q9: Find quasars with a line width >2000 km/s and 2.5<redshift<2.7. Q10: Find galaxies with spectra that have an equivalent width in Ha >40Å (Ha is the main hydrogen spectral line.) Also some good queries at:

14 SkyQuery ( Distributed Query tool using a set of web services Many astronomy archives from Pasadena, Chicago, Baltimore, Cambridge (England) Has grown from 4 to 15 archives, now becoming international standard Allows queries like: SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o, TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5 AND AREA(181.3,-0.76,6.5) AND o.type=3 and (o.I - t.m_j)>2

15 2MASS INT SDSS FIRST SkyQuery Portal Image Cutout SkyQuery Structure Each SkyNode publishes –Schema Web Service –Database Web Service Portal is –Plans Query (2 phase) –Integrates answers –Is itself a web service

16 MyDB: eScience Workbench Prototype of bringing analysis to the data Everybody gets a workspace (database) –Executes analysis at the data –Store intermediate results there –Long queries run in batch –Results shared within groups Only fetch the final results Extremely successful – matches work patterns