Astrophysics with Terabytes of Data Alex Szalay The Johns Hopkins University.

Slides:



Advertisements
Similar presentations
Microsoft Research Microsoft Research Jim Gray Distinguished Engineer Microsoft Research San Francisco SKYSERVER.
Advertisements

Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.
Trying to Use Databases for Science Jim Gray Microsoft Research
World Wide Telescope mining the Sky using Web Services Information At Your Fingertips for astronomers Jim Gray Microsoft Research Alex Szalay Johns Hopkins.
Web Services for the Virtual Observatory Alex Szalay, Tamas Budavari, Tanu Malik, Jim Gray, and Ani Thakar SPIE, Hawaii, 2002 (Living in an exponential.
1 Online Science the New Computational Science Jim Gray Microsoft Research Alex Szalay Johns Hopkins.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research Talk at
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research

The Australian Virtual Observatory e-Science Meeting School of Physics, March 2003 David Barnes.
Astronomy Data Bases Jim Gray Microsoft Research.
Scientific Collaborations in a Data-Centric World Alex Szalay The Johns Hopkins University.
Applications of UDFs in Astronomical Databases and Research Manuchehr Taghizadeh-Popp Johns Hopkins University.
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
C van Ingen, D Agarwal, M Goode, J Gupchup, J Hunt, R Leonardson, M Rodriguez, N Li Berkeley Water Center John Hopkins University Lawrence Berkeley Laboratory.
Bin Fu Eugene Fink, Julio López, Garth Gibson Carnegie Mellon University Astronomy application of Map-Reduce: Friends-of-Friends algorithm A distributed.
Long-Term Preservation of Astronomical Research Results Robert Hanisch US National Virtual Observatory Space Telescope Science Institute Baltimore, MD.
A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D.
Data-Intensive Computing in the Science Community Alex Szalay, JHU.
Data Mining – Intro.
Astro-DISC: Astronomy and cosmology applications of distributed super computing.
Aus-VO: Progress in the Australian Virtual Observatory Tara Murphy Australia Telescope National Facility.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
The Statistical Properties of Large Scale Structure Alexander Szalay Department of Physics and Astronomy The Johns Hopkins University.
N Tropy: A Framework for Analyzing Massive Astrophysical Datasets Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey.
Sky Surveys and the Virtual Observatory Alex Szalay The Johns Hopkins University.
ROOT: A Data Mining Tool from CERN Arun Tripathi and Ravi Kumar 2008 CAS Ratemaking Seminar on Ratemaking 17 March 2008 Cambridge, Massachusetts.
Spatial Indexing and Visualizing Large Multi-dimensional Databases I. Csabai, M. Trencséni, L. Dobos, G. Herczegh, P. Józsa, N. Purger Eötvös University,
A long tradition. e-science, Data Centres, and the Virtual Observatory why is e-science important ? what is the structure of the VO ? what then must we.
E-Science: Stuart Anderson National e-Science Centre Stuart Anderson National e-Science Centre.
Supported by the National Science Foundation’s Information Technology Research Program under Cooperative Agreement AST with The Johns Hopkins University.
Amdahl Numbers as a Metric for Data Intensive Computing Alex Szalay The Johns Hopkins University.
The Cosmic Simulator Daniel Kasen (UCB & LBNL) Peter Nugent, Rollin Thomas, Julian Borrill & Christina Siegerist.
Big Data in Science (Lessons from astrophysics) Michael Drinkwater, UQ & CAASTRO 1.Preface Contributions by Jim Grey Astronomy data flow 2.Past Glories.
National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:
1 New Frontiers with LSST: leveraging world facilities Tony Tyson Director, LSST Project University of California, Davis Science with the 8-10 m telescopes.
Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.
Astronomical data curation and the Wide-Field Astronomy Unit Bob Mann Wide-Field Astronomy Unit Institute for Astronomy School of Physics University of.
Science with the Virtual Observatory Brian R. Kent NRAO.
Future Directions of the VO Alex Szalay The Johns Hopkins University Jim Gray Microsoft Research.
Public Access to Large Astronomical Datasets Alex Szalay, Johns Hopkins Jim Gray, Microsoft Research.
EÖTVÖS UNIVERSITY BUDAPEST Department of Physics of Complex Systems VO Spectroscopy Workshop, ESAC Spectrum Services 2007 László Dobos (ELTE)
Astro / Geo / Eco - Sciences Illustrative examples of success stories: Sloan digital sky survey: data portal for astronomy data, 1M+ users and nearly 1B.
LSST: Preparing for the Data Avalanche through Partitioning, Parallelization, and Provenance Kirk Borne (Perot Systems Corporation / NASA GSFC and George.
Science In An Exponential World Alexander Szalay, JHU Jim Gray, Microsoft Reserach Alexander Szalay, JHU Jim Gray, Microsoft Reserach.
Indexing and Visualizing Multidimensional Data I. Csabai, M. Trencséni, L. Dobos, G. Herczegh, P. Józsa, N. Purger Eötvös University,Budapest.
Federation and Fusion of astronomical information Daniel Egret & Françoise Genova, CDS, Strasbourg Standards and tools for the Virtual Observatories.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Research Networks and Astronomy Richard Schilizzi Joint Institute for VLBI in Europe
EScience May 2007 From Photons to Petabytes: Astronomy in the Era of Large Scale Surveys and Virtual Observatories R. Chris Smith NOAO/CTIO, LSST.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
Data-Intensive Statistical Challenges in Astrophysics Alex Szalay The Johns Hopkins University Collaborators: T. Budavari, C-W Yip (JHU), M. Mahoney (Stanford),
Data Archives: Migration and Maintenance Douglas J. Mink Telescope Data Center Smithsonian Astrophysical Observatory NSF
DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011.
German Astrophysical Virtual Observatory Overview and Results So Far W. Voges, G. Lemson, H.-M. Adorf.
AstroGrid NAM 2001 Andy Lawrence Cambridge NAM 2001 Andy Lawrence Cambridge Belfast Cambridge Edinburgh Jodrell Leicester MSSL.
Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
1 Where The Rubber Meets the Sky Giving Access to Science Data Jim Gray Microsoft Research Alex.
F. Genova, AstroNET meeting, Poitiers The Astrophysical Virtual Observatory.
Budapest Group Eötvös University MAGPOP kick-off meeting Cassis 2005 January
Spatial Searches in the ODM. slide 2 Common Spatial Questions Points in region queries 1.Find all objects in this region 2.Find all “good” objects (not.
Moving towards the Virtual Observatory Paolo Padovani, ST-ECF/ESO
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Rick, the SkyServer is a website we built to make it easy for professional and armature astronomers to access the terabytes of data gathered by the Sloan.
Data Warehousing and Data Mining
Jim Gray Microsoft Research
Google Sky.
Presentation transcript:

Astrophysics with Terabytes of Data Alex Szalay The Johns Hopkins University

Living in an Exponential World Astronomers have a few hundred TB now –1 pixel (byte) / sq arc second ~ 4TB –Multi-spectral, temporal, … → 1PB They mine it looking for new (kinds of) objects or more of interesting ones (quasars), density variations in 400-D space correlations in 400-D space Data doubles every year Data is public after 1 year Same access for everyone But: how long can this continue?

Evolving Science Thousand years ago: science was empirical describing natural phenomena Last few hundred years: theoretical branch using models, generalizations Last few decades: a computational branch simulating complex phenomena Today: data exploration (eScience) synthesizing theory, experiment and computation with advanced data management and statistics

The Challenges Data Collection Discovery and Analysis Publishing Exponential data growth: Distributed collections Soon Petabytes New analysis paradigm: Data federations, Move analysis to data New publishing paradigm: Scientists are publishers and Curators

Publishing Data Exponential growth: –Projects last at least 3-5 years –Data sent upwards only at the end of the project –Data will never be centralized More responsibility on projects –Becoming Publishers and Curators Data will reside with projects –Analyses must be close to the data Roles Authors Publishers Curators Consumers Traditional Scientists Journals Libraries Scientists Emerging Collaborations Project www site Bigger Archives Scientists

Accessing Data If there is too much data to move around, take the analysis to the data! Do all data manipulations at database –Build custom procedures and functions in the database Automatic parallelism guaranteed Easy to build-in custom functionality –Databases & Procedures being unified –Example temporal and spatial indexing –Pixel processing Easy to reorganize the data –Multiple views, each optimal for certain analyses –Building hierarchical summaries are trivial Scalable to Petabyte datasets active databases!

Making Discoveries Where are discoveries made? –At the edges and boundaries –Going deeper, collecting more data, using more colors…. Metcalfe’s law –Utility of computer networks grows as the number of possible connections: O(N 2 ) Federating data –Federation of N archives has utility O(N 2 ) –Possibilities for new discoveries grow as O(N 2 ) Current sky surveys have proven this –Very early discoveries from SDSS, 2MASS, DPOSS

Federation Data Federations Massive datasets live near their owners: –Near the instrument’s software pipeline –Near the applications –Near data knowledge and curation –Super Computer centers become Super Data Centers Each Archive publishes (web) services –Schema: documents the data –Methods on objects (queries) Scientists get “personalized” extracts Uniform access to multiple Archives –A common “global” schema

The Virtual Observatory Premise: most data is (or could be online) Federating the different surveys will provide opportunities for new science It’s a smart telescope: links objects and data to literature on them Software became the capital expense –Share, standardize, reuse.. It has to be SIMPLE You can form your own small collaborations

Strong International Collaboration Similar efforts now in 15 countries: –USA, UK, Canada, France, Germany, Italy, Holland, Japan, Australia, India, China, Russia, Hungary, South Korea, ESO, Spain Total awarded funding world-wide is over $60M Active collaboration among projects –Standards, common demos –International VO roadmap being developed –Regular telecons over 10 timezones Formal collaboration International Virtual Observatory Alliance (IVOA)

Dealing with the astronomy legacy u FITS data format u Software systems Standards driven by evolving new technologies u Exchange of rich and structured data (XML…) u DB connectivity, Web Services, Grid computing External funding climate Boundary Conditions Application to astronomy domain –Data dictionaries (UCDs) –Data models –Protocols –Registries and resource/service discovery –Provenance, data quality, DATA CURATION!!!! Boundary conditions

Current VO Challenges How to avoid trying to be everything for everybody? Database connectivity is essential –Bring the analysis to the data Core web services, higher level applications on top Use the rule: –Define the standards and interfaces –Build the framework –Build the 10% of services that are used by 90% –Let the users build the rest from the components Rapidly changing “outside world” Make it simple!!!

Where are we going? Relatively easy to predict until 2010 –Exponential growth continues –Most ground based observatories join the VO –More and more sky surveys in different wavebands –Simulations will have VO interfaces: can be ‘observed’ Much harder beyond 2010 –PetaSurveys are coming on line (PANSTarrs, VISTA, LSST) –Technological predictions much harder –Changing funding climate –Changing sociology

Similarities to HEP HEP Van de Graaf Cyclotrons National Labs International (CERN) SSC vs LHC Optical Astronomy 2.5m telescopes 4m telescopes 8m class telescopes Surveys/Time Domain m telescopes Similar trends with a 20 year delay, fewer and ever bigger projects… increasing fraction of cost is in software… more conservative engineering… Can the exponential continue, or will be logistic? What can astronomy learn from High Energy Physics?

Why Is Astronomy Different? Especially attractive for the wide public It has no commercial value – No privacy concerns, freely share results with others – Great for experimenting with algorithms Data has more dimensions –Spatial, temporal, cross-correlations Diverse and distributed – Many different instruments from many different places and many different times Many different interesting questions

Trends CMB Surveys 1990 COBE Boomerang 10, CBI 50, WMAP 1 Million 2008 Planck10 Million Galaxy Redshift Surveys 1986 CfA LCRS dF SDSS Angular Galaxy Surveys 1970 Lick 1M 1990 APM 2M 2005 SDSS200M 2008 VISTA 1000M 2012 LSST 3000M Time Domain QUEST SDSS Extension survey Dark Energy Camera PanStarrs SNAP… LSST… Petabytes/year by the end of the decade…

Challenges Real-Time Detection for 3B objects Pixels (exponential growth slowing down) Size projection: 100PB by 2020 Data Transfer (grows slower than data) Data Access (hierarchical usage) Fault Tolerance and Data Protection Tier0Tier1 Fast 1% 10% 100% Tier2

SkyServer Sloan Digital Sky Survey: Pixels + Objects About 500 attributes per “object”, 300M objects Spectra for 1M objects Currently 2TB fully public Prototype eScience lab –Moving analysis to the data –Fast searches: color, spatial Visual tools –Join pixels with objects Prototype in data publishing –70 million web hits in 3.5 years

Public Data Release: Versions! June 2001: EDR –Early Data Release July 2003: DR1 –Contains 30% of final data –150 million photo objects July 2005: DR4 at 3.5TB –60% of data 4 versions of the data –Target, best, runs, spectro Total catalog volume 5TB –See Terascale sneakernet paper… Published releases served forever –EDR, DR1, DR2, …. –Soon to include archives, annotations O(N 2 ) – only possible because of Moore’s Law! EDR DR1 DR2 DR3

Spatial Features Precomputed Neighbors –All objects within 30” Boundaries, Masks and Outlines –27,000 spatial objects –Stored as spatial polygons Time Domain: Precomputed Match –All objects with 1”, observed at different times –Found duplicates due to telescope tracking errors –Manual fix, recorded in the database MatchHead –The first observation of the linked list used as unique id to chain of observations of the same object

Things Can Get Complex

3 Ways To Do Spatial Hierarchical Triangular Mesh (extension to SQL) –Uses table valued stored procedures –Acts as a new “spatial access method” –Ported to Yukon CLR for a 17x speedup. Zones: fits SQL well –Surprisingly simple & good on a fixed scale Constraints: a novel idea –Lets us do algebra on regions., implemented in pure SQL Paper: There Goes the Neighborhood: Relational Algebra for Spatial Data Search There Goes the Neighborhood: Relational Algebra for Spatial Data Search

2MASS 471 Mrec 140 GB USNOB 1.1 Brec 233 GB Next zone 0:-1 Source Tables Zones 2MASS:USNOB Zone:Zone Comparison 0:0 0:+1 64 Mrec 2 GB 260 Mrec 9 GB 26 Mrec 1 GB 350 Mrec 12 GB 350 Mrec 12 GB 2MASS→USNOB USNOB→2MASS Merge Answer Build Index 2 hours.5 hour Pipeline Parallelism: 2.5 hours Or… as fast as we can read USNOB +.5 hours Next zone

Next-Generation Data Analysis Looking for –Needles in haystacks – the Higgs particle –Haystacks: Dark matter, Dark energy Needles are easier than haystacks ‘Optimal’ statistics have poor scaling –Correlation functions are N 2, likelihood techniques N 3 –For large data sets main errors are not statistical As data and computers grow with Moore’s Law, we can only keep up with N logN A way out? –Discard notion of optimal (data is fuzzy, answers are approximate) –Don’t assume infinite computational resources or memory Requires combination of statistics & computer science

Organization & Algorithms Use of clever data structures (trees, cubes): –Up-front creation cost, but only N logN access cost –Large speedup during the analysis –Tree-codes for correlations (A. Moore et al 2001) –Data Cubes for OLAP (all vendors) Fast, approximate heuristic algorithms –No need to be more accurate than cosmic variance –Fast CMB analysis by Szapudi et al (2001) N logN instead of N 3 => 1 day instead of 10 million years Take cost of computation into account –Controlled level of accuracy –Best result in a given time, given our computing resources

Today’s Questions Discoveries –need fast outlier detection Spatial statistics –Fast correlation and power spectrum codes (CMB + galaxies) –Cross-correlations among different surveys (sky pixelization + fast harmonic transforms on sphere) Time-domain: –Transients, supernovae, periodic variables –Moving objects, killer’ asteroids, Kuiper-belt objects….

Other Challenges Statistical noise is smaller and smaller –Error matrix larger and larger (Planck…) Systematic errors becoming dominant –De-sensitize against known systematic errors –Optimal subspace filtering (…SDSS stripes…) Comparisons of spectra to models –10 6 spectra vs 10 8 models (Charlot…) Detection of faint sources in multi-spectral images –How to use all information optimally (QUEST…) Efficient visualization of ensembles of 100M+ data points

Systematic Errors SDSS P(k), main issue: –Effects of zero points, flat field vectors result in large scale, correlated patterns Two tasks: –Estimate how large is the effect –De-sensitize statistics Monte-Carlo simulations: –100 million random points, assigned to stripes, runs, camcols, fields, x,y positions and redshifts => database –Build MC error matrix due to zeropoint errors Include error matrix in the KL basis –Some modes sensitive to zero points (# of free pmts) –Eliminate those modes from the analysis => projection Statistics insensitive to zero points afterwards

Simulations Cosmological simulations have 10 9 particles and produce over 30TB of data (Millennium) Build up dark matter halos Track merging history of halos Use it to assign star formation history Combination with spectral synthesis Too few realizations Hard to analyze the data afterwards What is the best way to compare to the real universe

Summary Databases became an essential part of astronomy: most data access will soon be via digital archives Data at separate locations, distributed worldwide, evolving in time: move analysis, not data! Good scaling of statistical algorithms essential Many outstanding problems in astronomy are statistical, current techniques inadequate, we need help! The Virtual Observatory is a new paradigm for doing science: the science of Data Exploration!