Desperately Trying to Cope with the Data Explosion in Astronomical Sciences Ray Norris CSIRO Australia Telescope National Facility
Overview Background: astronomical data Good news Bad news Data Manifesto
Astronomical Data Q: How did the first galaxies in the Universe form?
Need many wavelengths:
Source “c” at 3 cm wavelength
The mysterious “source c”
WFPC2 image 2 arcsec
The hard questions: Give me the WFPC image to normalise my spectral line cube –Obviously best to do computation locally Give me every source in NED with J-k>4 –Obviously best to do computation at host Give the me the radio spectral indices (using ATCA data) of all the objects in SLOAN which have J-K>4 in available ESO/STScI databases” –Some computations local, some on hosts –VO needs to make sensible decisions –VO needs grid computing standards Terabyte database in Baltimore Local megabyte dataset NASA Extragalactic Database in Pasadena Terabyte database in Sydney Terabyte database in New Mexico Multi-terabyte databases in Europe & US
Good News The Virtual Observatory Astronomical Data Centres Public-domain data
The Virtual Observatory (VO) The FITS standard (~1980) paved the way in interoperability International Virtual Observatory Alliance involves all major astronomical observatories worldwide –IVOA established 2002 VO is a collection of interoperating data archives and software tools which are linked to form a research environment in which astronomical research programs can be conducted. It includes terabyte distributed databases, data dictionaries, standards, protocols, tools, algorithms, web services, etc.
Examples of VO operations Give me a list of all the objects which satisfy: –Criterion A in the CDS database (in Strasbourg, France), –Criterion B in the Parkes HIPASS survey (in Australia) –Criterion C in the Hubble archive (in Baltimore, USA) P.S. –Each of these databases has a different format, coordinate system, and ontology, and each is several Tbyte in size. –Metadata is of variable quality –The object names will be different in each database.
VO Status VO is not a project-managed project – it is a collaboration of different groups, with different drivers, but united by a common goal. Several groups worldwide are now defining standards, tools, protocols, etc. Some prototype tools and web services already available (e.g. ) More will become available over the next 1-2 years See
Good News The Virtual Observatory Astronomical Data Centres Public-domain data
Astronomical Data Centres Centre de Données astronomiques de Strasbourg, France (CDS) –attempts to hold electronic copies of all published astronomical data, surveys, etc NASA Astronomical Data Centre (ADC) Baltimore, USA NASA Extragalactic Database (NED) –Interprets and combines extragalactic data Astronomical Data System (ADS) –All published astronomical literature Others
Good News The Virtual Observatory Astronomical Data Centres Public-domain data Security, confidentiality, and IP protection are not major issues in astronomy – most data are in the public domain – hence VO is interesting to Microsoft etc.
Bad News Intellectual Property controls. Journal data Bad planning of new instruments Digital Divide Legacy data Lack of awareness "Why should I share my data with my competitors?"
Bad News Intellectual Property controls. Journal data Bad planning of new instruments Digital Divide Legacy data Lack of awareness "Why should I share my data with my competitors?"
Intellectual Property Protection Patents –protect inventions Copyright –protects written work and creative work Proposed database protection –protects information (about anything) –No “fair use” provisions –You cannot cite someone else’s data without obtaining their permission –Each paper will need a paper-trail showing rights to cite data
ICSU International Council of Science United Nations IAUIUGGetc... CODATA WIPO United Nations National Representatives Committee on Data for Science and Technology World Intellectual Property Organisation
Bad News Intellectual Property controls. Journal data Bad planning of new instruments Digital Divide Legacy data Lack of awareness "Why should I share my data with my competitors?"
Journal Data Most data published in journals never make it to the data centres When they do appear in data centres, they rarely carry the metadata or ontology that enable machine-understanding Journals need to impose standards (e.g. VOTable) on authors
Bad News Intellectual Property controls. Journal data Bad planning of new instruments Digital Divide Legacy data Lack of awareness "Why should I share my data with my competitors?" Many new instruments are planned without sufficient planning or funding for data management (decreasing scientific productivity)
Bad News Intellectual Property controls. Journal data Bad planning of new instruments Digital Divide Legacy data Lack of awareness "Why should I share my data with my competitors?") We take for granted instant access to literature and databases. Our colleagues in developing countries still dream of it (thus disadvantaging them even further)
Bad News Intellectual Property controls. Journal data Bad planning of new instruments Digital Divide Legacy data Lack of awareness "Why should I share my data with my competitors?" Digitising old data competes for funding with new instruments
Bad News Intellectual Property controls. Journal data Bad planning of new instruments Digital Divide Legacy data Lack of awareness "Why should I share my data with my competitors?" BORING!
Bad News Intellectual Property controls. Journal data Bad planning of new instruments Digital Divide Legacy data Lack of awareness "Why should I share my data with my competitors?"
The Data Manifesto AstronomersManifesto We, the global community of astronomy, aspire to the following guidelines for managing astronomical data, believing that this would maximise the rate and cost-effectiveness of scientific discovery…
1. All major tables, images, and spectra published in journals should appear in the astronomical data centres. Journals should, in collaboration with data centres, define formats, table descriptions, and metadata that are easy for authors to adhere to, and can automatically be translated into a format (e.g. VOTable, FITS, etc) that can be entered by the data centre into their database.
2. All data obtained with publicly- funded observatories should, after appropriate proprietary periods, be placed in the public domain. Consistent with ICSU and OECD recommendations …to which Australia is a signatory
3. In any new major astronomical construction project, the data processing, storage, migration, and management requirements should be built in at an early stage of the project plan, and costed along with other parts of the project Isn’t this obvious? –apparently not!
4. Astronomers in all countries should have the same access to astronomical data and information.
5. Legacy astronomical data can be valuable, and high-priority legacy data should be preserved and stored in digital form in the data centres. How do you prioritise?
6. The IAU should work with other international organisations to achieve our common goals and learn from our colleagues in other fields. Use bodies such as CODATA to cross-fertilise
But the major challenge to coping with the data explosion remains…
Why can’t someone else do it?