Big Data in Science (Lessons from astrophysics) Michael Drinkwater, UQ & CAASTRO 1.Preface Contributions by Jim Grey Astronomy data flow 2.Past Glories.

Slides:



Advertisements
Similar presentations
Microsoft Research Microsoft Research Jim Gray Distinguished Engineer Microsoft Research San Francisco SKYSERVER.
Advertisements

Trying to Use Databases for Science Jim Gray Microsoft Research
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
The Australian Virtual Observatory e-Science Meeting School of Physics, March 2003 David Barnes.
Brian Schmidt The Research School of Astronomy and Astrophysics Mount Stromlo & Siding Spring Observatories.
Space …. are big. Really big. You just won't believe how vastly, hugely, mindbogglingly big they are. Massive data streams Douglas Adams – Hitchhiker’s.
Presentation at WebEx Meeting June 15,  Context  Challenge  Anticipated Outcomes  Framework  Timeline & Guidance  Comment and Questions.
CSIRO ASKAP Science Data Archive (CASDA) Project Kick-Off IM&T AND CASS Dan Miller| Project Manager 17 July 2014.
Current NIST Definition NIST Big data consists of advanced techniques that harness independent resources for building scalable data systems when the characteristics.
All these Sky Pixels Are Yours The evolution of telescopes and CCD Arrays: The Coming Data Nightmare.
All these Sky Pixels Are Yours The evolution of telescopes and CCD Arrays: The Coming Data Nightmare.
20 Spatial Queries for an Astronomer's Bench (mark) María Nieto-Santisteban 1 Tobias Scholl 2 Alexander Szalay 1 Alfons Kemper 2 1. The Johns Hopkins University,
Planning for the Virtual Observatory Tara Murphy … with input from other Aus-VO members …
Leicester Database & Archive Service J. D. Law-Green, J. P. Osborne, R. S. Warwick X-Ray & Observational Astronomy Group, University of Leicester What.
Data-Intensive Science (eScience) Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering University of Washington August 2011.
Leicester Database & Archive Service J. D. Law-Green, S. W. Poulton, J. Osborne, R. S. Warwick Dept. of Physics & Astronomy, University of Leicester LEDAS.
Panel Summary Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University XLDB 23-October-07.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
The Mind Map of a Data Scientist Rebecca Perry and Carlota Valdivieso, Work Experience Students July 2013 What qualifies Data Science? Many things qualify.
Aus-VO: Progress in the Australian Virtual Observatory Tara Murphy Australia Telescope National Facility.
CERN IT Department CH-1211 Geneva 23 Switzerland t XLDB 2010 (Extremely Large Databases) conference summary Dawid Wójcik.
THE CAASTRO TEAM IS PURSUING THREE INTERLINKED SCIENCE PROGRAMS: THE EVOLVING UNIVERSE When did the first galaxies form, and how have they then evolved?
1 Building National Cyberinfrastructure Alan Blatecky Office of Cyberinfrastructure EPSCoR Meeting May 21,
The Case for Data Stewardship: Preserving the Scientific Record Matthew Mayernik National Center for Atmospheric Research Version 2.0 [Review Date]
1 | CSIRO ASKAP Science Data Archive (CASDA) – Stage 0 Project Intent Statement Confirm the necessary requirements, use cases, workflows, business processes,
Amdahl Numbers as a Metric for Data Intensive Computing Alex Szalay The Johns Hopkins University.
National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:
1 New Frontiers with LSST: leveraging world facilities Tony Tyson Director, LSST Project University of California, Davis Science with the 8-10 m telescopes.
Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.
Radio Galaxies and Quasars Powerful natural radio transmitters associated with Giant elliptical galaxies Demo.
Astronomical data curation and the Wide-Field Astronomy Unit Bob Mann Wide-Field Astronomy Unit Institute for Astronomy School of Physics University of.
The DARTS database Yoshihiro Ueda (ISAS) on behalf of the PLAIN members
Paul Alexander & Jaap BregmanProcessing challenge SKADS Wide-field workshop SKA Data Flow and Processing – a key SKA design driver Paul Alexander and Jaap.
Fourth Paradigm Science-based on Data-intensive Computing.
Astro / Geo / Eco - Sciences Illustrative examples of success stories: Sloan digital sky survey: data portal for astronomy data, 1M+ users and nearly 1B.
LSST: Preparing for the Data Avalanche through Partitioning, Parallelization, and Provenance Kirk Borne (Perot Systems Corporation / NASA GSFC and George.
Morris Aizenman Senior Scientist Directorate for Mathematical & Physical Sciences National Science Foundation Physics and Engineering Sciences Committee.
Source: Alex Szalay. Example: Sloan Digital Sky Survey The SDSS telescope array is systematically mapping ¼ of the entire sky Discoveries are made by.
Federated Discovery and Access in Astronomy Robert Hanisch (NIST), Ray Plante (NCSA)
“Big Data” and Data-Intensive Science (eScience) Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering University of Washington July.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
Astronomy, Petabytes, and MySQL MySQL Conference Santa Clara, CA April 16, 2008 Kian-Tat Lim Stanford Linear Accelerator Center.
Making the Sky Searchable: Automatically Organizing the World’s Astronomical Data Sam Roweis, Dustin Lang &
A Data Centre for Science and Industry Roadmap. INNOVATION NETWORKING DATA PROCESSING DATA REPOSITORY.
Data Archives: Migration and Maintenance Douglas J. Mink Telescope Data Center Smithsonian Astrophysical Observatory NSF
Peter Lee Head, Computer Science Department Carnegie Mellon University.
EScience: Techniques and Technologies for 21st Century Discovery Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering Computer Science.
Grids 2003 The Great Academia/Industry Grid Debate Dan Fay | Microsoft Research Grid, grid, everywhere a Grid Blocking out the scenery, breaking my mind.
The Large Synoptic Survey Telescope Project Bob Mann Wide-Field Astronomy Unit University of Edinburgh.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Entering the Data Era; Digital Curation of Data-intensive Science…… and the role Publishers can play The STM view on publishing datasets Bloomsbury Conference.
AAO Future options 2 x 2 matrix Governance by AAL (AAL takes role currently played by AATB) Governance by CSIRO Colocated with CSIRO in Sydney Located.
1 LSST Town Hall 227 th meeting of the AAS 1/7/2016 Pat Eliason, LSSTC Executive Office Pat Osmer, LSSTC Senior Advisor.
Mid-Scale Projects Vernon Pankonin Team Leader. Mid-Scale Projects Programmatic Characteristics Not a formal funding program. Collection of proposals.
RI EGI-InSPIRE RI Astronomy and Astrophysics Dr. Giuliano Taffoni Dr. Claudio Vuerli.
Microsoft Research San Francisco (aka BARC: bay area research center) Jim Gray Researcher Microsoft Research Scalable servers Scalable servers Collaboration.
LSST CORPORATION Patricia Eliason LSSTC Executive Officer Belgrade, Serbia 2016.
T. Axelrod, NASA Asteroid Grand Challenge, Houston, Oct 1, 2013 Improving NEO Discovery Efficiency With Citizen Science Tim Axelrod LSST EPO Scientist.
Building the Square Kilometer Array – a truly global project
Optical Survey Astronomy DATA at NCSA
Data Centres in the Virtual Observatory Age
Moving towards the Virtual Observatory Paolo Padovani, ST-ECF/ESO
Introduction to Data Programming
For more information, visit
For more information, visit
BARC Scaleable Servers
Rick, the SkyServer is a website we built to make it easy for professional and armature astronomers to access the terabytes of data gathered by the Sloan.
Brian Matthews STFC EOSCpilot Brian Matthews STFC
McGraw-Hill Technology Education
Presentation transcript:

Big Data in Science (Lessons from astrophysics) Michael Drinkwater, UQ & CAASTRO 1.Preface Contributions by Jim Grey Astronomy data flow 2.Past Glories Why it was easy to be world-leading 3.Future challenges Why really big data makes us worry! CSIRO Parkes radio telescope

1. Preface: Jim Grey (Microsoft eScience) ›Much of what I discuss was already said by the late Jim Grey: ›“I have been hanging out with astronomers for about the last 10 years… I look at their telescopes… $15-20M worth of capital equipment with about people operating the instrument… millions of lines of code are needed to analyse all this information. In fact the software cost dominates the capital expenditure!” ›Jim Grey on eScience, in The Fourth Paradigm, eds Hey, Tansley & Tolle, (emphasis added) research.microsoft.com Jim Grey, Microsoft Research

1. Preface: Astronomy Data Flow Telescope  Raw Images  Output Image  Science  Database  Catalogues

2. Past Glories ›20 years ago -Easy to lead the world! ›UKST photographic all sky survey -1 image = 1 GB -All-sky image = 1 TB -All-sky catalogue = 100 MB -Put online with two summer student projects

2. Past Glories ›Why did astronomy lead the way with (old) big data? ›1) Telescopes are expensive so only a few data sources -Data complex so only a few software packages, especially for national projects -=> easy to adopt a common data file format ›2) Astronomers had strong computing skills -=> easy to search relatively large discovery space CSIRO's ASKAP radio telescope with its innovative phased array receiver technology. (Image: Dragonfly Media)

2. Past Glories ›Problems with the old approach in astronomy -Most team projects underestimate or ignore database budget -Astronomers too independent – skeptical of computer science expertise -Bespoke solutions not scalable or sustainable The Anglo-Australian Telescope (Image: AAO) – used for many team projects

2. Past Glories ›WiggleZ Dark Energy Survey -5 year observing project -$5M facility time + $1.5M grants + 20 team salaries -Database $40k (donated by host as not funded) ›Success! -4 tests proving Einstein’s General Relativity correct -Many other results citations ›Failure! -Database failed as not supported

3. Future Challenges ›New projects so large astronomy must change… Schmidt photographic survey: 1 TB Sloan Digital Sky Survey: 25 TB -… Large Synoptic Survey Telescope 130 PB in 10 years ? Square Kilometre Array radio telescope: 10 PB per day! -More data per day than entire internet per year The LSST: 8.4 m telescope mirror, 3.2Gpixel camera

3. Future Challenges ›Challenges we know how to solve (Jim Gray predicted most of these) -Realistic funding -Scalable database structure: how to avoid i/o limits -Must move the query to the data -Efficient database design (Jim’s 20 questions to define functionality)

3. Future Challenges ›Nasty challenges we are yet to solve… -Complex data mining way beyond SQL -“Teaching software engineering to the whole community” 1 -Real-time analysis for transient events -Cross-matching different large databases in different locations “The data collected by the SKA in a single day would take nearly two million years to play back on an iPod.” skatelescop.org 1. Mario Juric, LSST Data Management Project Scientist

Postscript: Jim Grey (Microsoft eScience) ›Jim Gray’s rules for large data design: -Scientific computing is increasingly data intensive -Solution is a “scale-out” architecture -Bring computations to the data, rather than data to the computations -Start the design with the 20 top questions -Go from "working to working" -From “Gray’s Laws: Database-centric Computing in Science”, Szalay & Blakeley,, in The Fourth Paradigm, eds Hey, Tansley & Tolle, research.microsoft.com Jim Grey, Microsoft Research