Big Data in Science (Lessons from astrophysics) Michael Drinkwater, UQ & CAASTRO 1.Preface Contributions by Jim Grey Astronomy data flow 2.Past Glories Why it was easy to be world-leading 3.Future challenges Why really big data makes us worry! CSIRO Parkes radio telescope
1. Preface: Jim Grey (Microsoft eScience) ›Much of what I discuss was already said by the late Jim Grey: ›“I have been hanging out with astronomers for about the last 10 years… I look at their telescopes… $15-20M worth of capital equipment with about people operating the instrument… millions of lines of code are needed to analyse all this information. In fact the software cost dominates the capital expenditure!” ›Jim Grey on eScience, in The Fourth Paradigm, eds Hey, Tansley & Tolle, (emphasis added) research.microsoft.com Jim Grey, Microsoft Research
1. Preface: Astronomy Data Flow Telescope Raw Images Output Image Science Database Catalogues
2. Past Glories ›20 years ago -Easy to lead the world! ›UKST photographic all sky survey -1 image = 1 GB -All-sky image = 1 TB -All-sky catalogue = 100 MB -Put online with two summer student projects
2. Past Glories ›Why did astronomy lead the way with (old) big data? ›1) Telescopes are expensive so only a few data sources -Data complex so only a few software packages, especially for national projects -=> easy to adopt a common data file format ›2) Astronomers had strong computing skills -=> easy to search relatively large discovery space CSIRO's ASKAP radio telescope with its innovative phased array receiver technology. (Image: Dragonfly Media)
2. Past Glories ›Problems with the old approach in astronomy -Most team projects underestimate or ignore database budget -Astronomers too independent – skeptical of computer science expertise -Bespoke solutions not scalable or sustainable The Anglo-Australian Telescope (Image: AAO) – used for many team projects
2. Past Glories ›WiggleZ Dark Energy Survey -5 year observing project -$5M facility time + $1.5M grants + 20 team salaries -Database $40k (donated by host as not funded) ›Success! -4 tests proving Einstein’s General Relativity correct -Many other results citations ›Failure! -Database failed as not supported
3. Future Challenges ›New projects so large astronomy must change… Schmidt photographic survey: 1 TB Sloan Digital Sky Survey: 25 TB -… Large Synoptic Survey Telescope 130 PB in 10 years ? Square Kilometre Array radio telescope: 10 PB per day! -More data per day than entire internet per year The LSST: 8.4 m telescope mirror, 3.2Gpixel camera
3. Future Challenges ›Challenges we know how to solve (Jim Gray predicted most of these) -Realistic funding -Scalable database structure: how to avoid i/o limits -Must move the query to the data -Efficient database design (Jim’s 20 questions to define functionality)
3. Future Challenges ›Nasty challenges we are yet to solve… -Complex data mining way beyond SQL -“Teaching software engineering to the whole community” 1 -Real-time analysis for transient events -Cross-matching different large databases in different locations “The data collected by the SKA in a single day would take nearly two million years to play back on an iPod.” skatelescop.org 1. Mario Juric, LSST Data Management Project Scientist
Postscript: Jim Grey (Microsoft eScience) ›Jim Gray’s rules for large data design: -Scientific computing is increasingly data intensive -Solution is a “scale-out” architecture -Bring computations to the data, rather than data to the computations -Start the design with the 20 top questions -Go from "working to working" -From “Gray’s Laws: Database-centric Computing in Science”, Szalay & Blakeley,, in The Fourth Paradigm, eds Hey, Tansley & Tolle, research.microsoft.com Jim Grey, Microsoft Research