Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh

Slides:



Advertisements
Similar presentations
Trying to Use Databases for Science Jim Gray Microsoft Research
Advertisements

Registries Work Package 2 Requirements, Science Cases, Use Cases, Test Cases Charter: Focus on science case scenarios, and use cases related specifically.
26 January 2007 CAL 07 Garching 1 The VISTA Data Flow System Jim Lewis, Mike Irwin, Peter Bunclark, Simon Hodgkin Cambridge Astronomy Survey Unit.
Components of a Data Analysis System Scientific Drivers in the Design of an Analysis System.
Jeroen Stil Department of Physics & Astronomy University of Calgary Stacking of Radio Surveys.
Astronomy and the Electromagnetic Spectrum
MAST-VizieR/NED cross correlation tutorial 1. Introduction Figure 1: Screenshot of the MAST VizieR Catalog Search Form. or enter here as object class:
Bob MannChicago Provenance Workshop Non-bio (necro-?) sciences (Jim Frew, Bob Mann) Examples of current practice and issues Examples of current practice.
Software for Science Support Systems EVLA Advisory Committee Meeting, March 19-20, 2009 David M. Harland & Bryan Butler.
VISTA/WFCAM pipelines summit pipeline: real time DQC verified raw product to Garching standard pipeline: instrumental signature removal, catalogue production,
NOAO/Gemini Data workshop – Tucson,  Hosted by CADC in Victoria, Canada.  Released September 2004  Gemini North data from May 2000  Gemini.
Nicholas Cross, Rob Blake, Ross Collins, Mark Holliman, Mike Read, Eckhard Sutorius, Nigel Hambly, Andy Lawrence, Bob Mann, Keith Noddle Wide Field Astronomy.
HAWCPol / SuperHAWC Software & Operations J. Dotson July 28, 2007.
15 December 2008Science from UKIDSS II WFCAM Science Pipeline Update WFCAM Science Pipeline Update Jim Lewis, Mike Irwin & Marco Riello Cambridge Astronomy.
VISTA pipelines summit pipeline: real time DQC verified raw product to Garching standard pipeline: instrumental signature removal, catalogue production,
Introduction to Spitzer and some applications Data products Pipelines Preliminary work K. Nilsson, J.M. Castro Cerón, J.P.U. Fynbo, D.J. Watson, J. Hjorth.
Data Management: Documentation & Metadata Types of Documentation.
18 April 2007 Second Generation VLT Instruments 1 VIRCAM & CPL: Lessons Learned Jim Lewis and Peter Bunclark Cambridge Astronomy Survey Unit.
Data Processing and User Software Ken Ebisawa (Astro-E2 GOF) presentation and demonstration.
METADATA Research Data Management. What is metadata? Metadata is additional information that is required to make sense of your files – it’s data about.
Introduction to Sky Survey Problems Bob Mann. Introduction to sky survey database problems Astronomical data Astronomical databases –The Virtual Observatory.
GAUDI Ground-based Asteroseismology Uniform Database Interface E. Solano Bases de données en spectroscopie stellaire. Paris.
E-Science: Stuart Anderson National e-Science Centre Stuart Anderson National e-Science Centre.
Survey Data Management and Combined use of DDI and SDMX DDI and SDMX use case Labor Force Statistics.
Avalanche Internet Data Management System. Presentation plan 1. The problem to be solved 2. Description of the software needed 3. The solution 4. Avalanche.
Unit 3 – Information Systems
Gene Expression Omnibus (GEO)
National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:
MASSACHUSETTS INSTITUTE OF TECHNOLOGY NASA GODDARD SPACE FLIGHT CENTER ORBITAL SCIENCES CORPORATION NASA AMES RESEARCH CENTER SPACE TELESCOPE SCIENCE INSTITUTE.
SPACE TELESCOPE SCIENCE INSTITUTE Operated for NASA by AURA COS Pipeline Language(s) We plan to develop CALCOS using Python and C Another programming language?
Lecture Four: Steps 3 and 4 INST 250/4.  Does one look for facts, or opinions, or both when conducting a literature search?  What is the difference.
Functions and Demo of Astrogrid 1.1 China-VO Haijun Tian.
Jim Lewis and Guy Rixon, CASU. 24 April, 2001 Data-reduction Pipeline for the INT WFC: slide 1 The Data-reduction Pipeline for the INT Wide Field Camera.
Data Management Subsystem Jeff Valenti (STScI). DMS Context PRDS - Project Reference Database PPS - Proposal and Planning OSS - Operations Scripts FOS.
Science Archive for Sky Surveys Data Providers and the VO - NeSC 2003 March Wide Field Astronomy Unit Institute for Astronomy.
Astronomical data curation and the Wide-Field Astronomy Unit Bob Mann Wide-Field Astronomy Unit Institute for Astronomy School of Physics University of.
NEON Obs School 11-Aug-2005 Archival Data and Virtual Observatories 1 Virtual Observatories...or how to do your research from a beach in the Bahamas rather.
WFCAM Science Archive Critical Design Review, April 2003 The SuperCOSMOS Science Archive (SSA) WFCAM Science Archive prototype Existing ad hoc flat file.
Making the Sky Searchable: Automatically Organizing the World’s Astronomical Data Sam Roweis, Dustin Lang &
Planetary Science Archive PSA User Group Meeting #1 PSA UG #1  July 2 - 3, 2013  ESAC PSA Archiving Standards.
LSST: Preparing for the Data Avalanche through Partitioning, Parallelization, and Provenance Kirk Borne (Perot Systems Corporation / NASA GSFC and George.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Discussion - Survey Design Survey product equation: #fields = fld/nt x useable x (%xnights/yr) x years = 4 x 0.5 x (0.75 x 13 x 18) x 3 = 4 x 0.5 x 175.
1 Digital Preservation Testbed Database Preservation Issues Remco Verdegem Bern, 9 April 2003.
Astronomical Data Archiving and Curation Clive Page AstroGrid Project University of Leicester 2004 March 22.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
● Radio telescope arrays – km diameter – Resolution arcmin to micro-arcsec at radio wavelengths ● Similar (baseline/ wavelength) for other regimes.
March 1st, 2006Prospective PNG PNG: Databases - Virtual Observatory.
Data Archives: Migration and Maintenance Douglas J. Mink Telescope Data Center Smithsonian Astrophysical Observatory NSF
NEON School NEON Archive Observing School Alberto Micol ESA Space Telescope Operations Division 15 July 04 ESO & HST Archives.
MOS Data Reduction Michael Balogh University of Durham.
COS PIPELINE CDR Jim Rose July 23, 2001OPUS Science Data Processing Space Telescope Science Institute 1 of 12 Science Data Processing
UCL DEPARTMENT OF SPACE AND CLIMATE PHYSICS MULLARD SPACE SCIENCE LABORATORY Taverna Plugin VAMDC and HELIO (part of the ‘taverna-astronomy’ edition) Kevin.
AstroGrid NAM 2001 Andy Lawrence Cambridge NAM 2001 Andy Lawrence Cambridge Belfast Cambridge Edinburgh Jodrell Leicester MSSL.
CRISTAL Andrew Branson University of the West of England.
How to represent coverage: temporal, spectral, positional Clive Page AstroGrid Project University of Leicester 2003 March 19.
Publishing Combined Image & Spectral Data Packages Introduction to MEx M. Sierra, J.-C. Malapert, B. Rino VO ESO - Garching Virtual Observatory Info-Workshop.
Interoperability meeting S. Derriere, Cambridge, 2003 May UCD - lessons learned What was learned from trying to assign UCDs to: - large catalogues/databases.
Faculty meeting - 13 Dec 2006 The Hubble Legacy Archive Harald Kuntschner & ST-ECF staff 13 December 2006.
Annotation of “special structures” in astronomy Bob Mann Institute for Astronomy and National e-Science Centre University of Edinburgh.
June 27-29, DC2 Software Workshop - 1 Tom Stephens GSSC Database Programmer GSSC Data Servers for DC2.
Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.
1 CAA 2009 Cross Cal 9, Jesus College, Cambridge, UK, March 2009 Caveats, Versions, Quality and Documentation Specification Chris Perry.
Metadata for the SKA - Niruj Mohan Ramanujam, NCRA.
Simulation Production System
From LSE-30: Observatory System Spec.
The INES Archive in the era of Virtual Observatories
What is FITS? FITS = Flexible Image Transport System
First Public Data Releases from the VISTA Science Archive
INAF Long Term Preservation
Presentation transcript:

Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh

2/24 Outline  Data and databases in astronomy  Case Study: UK Infrared Deep Sky Survey  Conclusions

3/24 Outline  Data and databases in astronomy  Case Study: UK Infrared Deep Sky Survey  Conclusions

4/24 Astronomers observe across the whole electromagnetic spectrum  Galaxy images look different across spectrum, due to:  Inherent angular resolution of the telescope  Different emission processes

5/24 Astronomical data: original form  Different detector technologies used across the spectrum, yielding different types of data: e.g.  Ultraviolet/optical/infrared  Image: array of pixel values  X-ray  Event list: positions, arrival times, energies of all detected photons  Radio  Interferometric visibilities: sparse Fourier transform of a region of the sky

6/24 Astronomical data: final form  Most research done using catalogue data  i.e. tables of attributes of detected sources – mainly discrete sources (stars, galaxies, etc)  Data compression  Catalogue – few % of image data volume  Amenable to representation in relational DB  Natural indexing by location in sky  …but original data products (images, spectra, event lists) sometimes needed

7/24 Astronomical databases  Telescope archives  Heterogeneous collections of raw data files from all observations taken  Download data for reduction and analysis  Sky survey archives  Homogeneous data and pipeline reduction  “Science Archive” – do science on DB  Bibliographic archives – scans of journals

8/24 Astronomical data processing  Data reduction  Remove instrumental signatures from raw data and produce “science-ready” data  Software packages written for specific instruments  Data analysis  Derive scientific results from science-ready data products – e.g. statistical analyses  Some astro-specific packages/environments – e.g. IRAF  Some use of programming languages  Fortran, C/C++, Python, Java  Some use of commercial packages  e.g. Interactive Data Language (IDL)

9/24 Outline  Data and databases in astronomy  Case Study: UKIDSS  Introduction to UKIDSS  Data life-cycle in UKIDSS  Provenance in UKIDSS  Conclusions

10/24 UK Infrared Deep Sky Survey  Set of five infrared sky surveys  Covering ~1/6 of the sky  From large/shallow to very small/very deep  See  Observations: using Wide Field Camera (WFCAM) on UK Infrared Telescope (UKIRT) in Hawaii

11/24 UKIDSS data life-cycle (1)  Summit of Mauna Kea  Data acquired from 4 WFCAM detectors  Summit pipeline: instrument health  Data written to LTO tape in NDF format  Tapes couriered to Cambridge weekly  Cambridge  Raw data converted from NDF to FITS  Data reduction pipeline run on nightly basis: ~100Gb/night  Remove instrumental signatures, combine images, detect and classify objects, calibrate positions & fluxes

12/24 UKIDSS data life-cycle (2)  Edinburgh  Ingest data from Cambridge: catalogues into RDBMS; image metadata into RDBMS; images on disk  Combine data from multiple nights: generate new catalogues from stacked images  Prepare release databases for WFCAM Science Archive (WSA): see  Users worldwide  Extract raw images from Cambridge  Extract image and catalogues in FITS files from Edinburgh  Run queries on catalogues & image metadata in WSA

13/24 Provenance in UKIDSS  Why is provenance important in UKIDSS?  What provenance information is recorded?  How will this be used?...and by whom?  …and is this adequate?

14/24 Importance of provenance  Much UKIDSS science is rare object search Ratio of fluxes in H & K bands Ratio of fluxes in J & H bands Objects with these colours would be very unusual – and possibly very interesting. Are they real? Need ability to trace back to reduced image within which object was detected – maybe back to raw image.

15/24 Structure of a FITS file Extensions Primary Header Primary Data Array Header Data Header Data Header: composed of 80-character ASCII records Data units can be images or tables

16/24 FITS header records  Almost all records of the form KEYWORD = ‘ value ‘ / COMMENT  Some standard keywords defined, but considerable freedom to define new ones  Relevant metadata for particular instruments  Amongst standard set is HISTORY  Format: HISTORY free text  Provenance information can be stored in a series of HISTORY records

17/24 UKIDSS FITS files (1)  Raw image files  Primary header: telescope/instrument set-up, observing conditions, target, observational parameters  Primary data array: empty  Extensions: (header,data) pairs for each of four detectors: header has detector-specific metadata; data is compressed image  Header keywords defined in Interface Control Document between Hawaii & Cambridge

18/24 UKIDSS FITS files (2)  Reduced image files  Primary header & data array: metadata propagated from raw data file  Headers of extensions include HISTORY records for data reduction steps run at Cambridge, e.g HISTORY :30:02 HISTORY $Id: cir_stage1.c,v /12/15 14:44:04 jim Exp $ HISTORY :31:04 HISTORY $Id: cir_qblkmed.c,v /08/12 14:35:19 jim Exp $ HISTORY :32:36 HISTORY $Id: cir_xtalk.c,v /10/17 14:58:50 jim Exp $ HISTORY :01:58 HISTORY $Id: cir_arith.c,v /02/25 10:14:55 jim Exp $ What WhenWho

19/24 UKIDSS FITS files (3)  Catalogue files  Primary header: metadata propagated from raw image  Primary data array: empty  Headers of extensions include metadata for catalogue generation process – invocations of software modules in HISTORY records, with parameter values in separate records  Header keywords for both reduced images and catalogues are defined in an Interface Control Document between Cambridge & Edinburgh

20/24 User access to provenance info  All header records from all FITS files ingested into WSA except HISTORY records  So, users can track provenance through queries against WSA, and can get HISTORY records by downloading files  Hopefully enough to determined whether unusual object is real, but this is this good enough?

21/24 Recap: Astronomical data processing  Data reduction  Remove instrumental signatures from raw data and produce “science-ready” data  Software packages written for specific instruments  Data analysis  Derive scientific results from science-ready data products – e.g. statistical analyses  Some astro-specific packages/environments – e.g. IRAF  Some use of programming languages  Fortran, C/C++, Python, Java  Some use of commercial packages  e.g. Interactive Data Language (IDL) ?

22/24 Provenance in data analysis: Two main problems  Less controlled software environment  Little bits of code written for a specific analysis, not tried and tested pipeline modules  Use of data from many sources  UKIDSS/WSA is state-of-the-art for provenance  Many (esp. older) data resources not so good  Provenance of combined dataset only as good as provenance of worst constituent dataset?

23/24 Does this matter?  Provenance information for data analysis is recorded in the journal paper (sort of)  Improving links between online literature and data sources  Increasing importance of large sky surveys with well controlled environments  Moving more of the data analysis from the user’s desktop to the data centre

24/24 Conclusions  Modern sky survey systems record & publish extensive provenance for data reduction  Very little provenance recorded from data analysis – except description in journal paper  More could surely be done – but would researchers support overhead of doing so?  Improvements as more analysis in data centre  Could/should we be doing more?