Data Management Challenges in Gaia Jose Hernandez Alexander Hutton Gaia Science Operations Centre ESA/ESAC/SRE-OOO
Gaia Observing strategy Data flow and Pipelines Data Challenges Outline Gaia Observing strategy Data flow and Pipelines Data Challenges Data Tracking Tools Examples
Gaia Observing strategy Survey Mission at L2 Scan the sky along great circles Accumulate the data on-board Download to Earth every night Full Sky Observed every 6 months Repeat it for at least 5 years => 10 Full Sky Maps
Gaia Observing strategy
Gaia: Some numbers after 5 years 100 Tb of raw data We expect to observe 109 Sources (could end up being 2x109) Spectra for 2x108 sources 80 Observations per source on average: 1011 Astro/Photo Observations 2x1010 Spectra
Operaciones ESOC DPAC ESAC Satellite New Norcia Cebreros Launcher Mission Operation Centre (MOC) ESOC Science Operation Centre (SOC) ESAC Launcher Satellite Data Processing & Analysis Consortium DPAC
Data flow ESOC DPAC ESAC Malargüe Satellite New Norcia Cebreros Mission Operation Centre (MOC) ESOC Science Operation Centre (SOC) ESAC Launcher Satellite Data Processing & Analysis Consortium DPAC Malargüe
Figure Courtesy A. Brown, DPAC Data flow Figure Courtesy A. Brown, DPAC
Data Processing Cycles Daily Pipelines MOC SOC MDB-00 MDB-01 MDB-02 <=8.5 Mbit/s DPCs DPCs DPCs
Sheer number of Observations Ensuring No Data Loss Some Challenges Sheer number of Observations Ensuring No Data Loss Managing the Daily Data Flow Data Tracking DPCs Autonomous and Geographically Distributed
Single Data Model/ICD with DPCs MDB Dictionary Tool on-line: Tools: Data Modeling Single Data Model/ICD with DPCs MDB Dictionary Tool on-line: Keeps track of versions, changes,… Immediate visibility Automatic generation of DM classes, DB schema, Data Consumers… DM evolution controlled by CCB
Data Management and Tracking All Data tagged with a barcode Named “Solution Identifier” It is just a Long (64bit) Number Each solutionId has some metadata
Data Tracking: solutionId Used to identify data Who, when, where generated the data What SW version, environment, run number, at what time We also use it to manage the daily data flow Related data gets same solutionId, this is a form of doing “data binning”
Data Tracking: solutionId Track Data Provenance Verify correct calibrations get used Find what was affected by incorrect data Remove incorrect data from the pipelines
Data Integrity and Completness Current Numbers: 10.4x109 Astro/Photo Observations 1.3x109 Spectra Received 6.3 Tb RAW Science Data 144Gb of HouseKeeping Data 21Tb Generated in the processing Typically the daily pipelines are writing thousands object/sec
Data Integrity and Completeness Challenges: Ensuring there are no data leakages Data consistency and completeness Within the pipelines and wrt the MDB DPCB DPCC MOC SOC MDB DPCG DPCT DPCI
All Gaia Data can be related to On Board Time, examples: Time Data Binning All Gaia Data can be related to On Board Time, examples: At time x the source image crosses CCD At time y Charge Injections occur Spacecraft attitude Use OBMT to collapse records of the same time together and count the number of Objects per bin
Time Data Binning
Time Data Binning Galactic Centre Crossings FOV-P FOV-F Galaxy Tail
We can then compare the TimeLine data at different points Time Data Binning Data Binning gets done on the fly as the pipeline stores it, no overhead We can then compare the TimeLine data at different points We can also check Data Consistency All the checks can be automated and alarms raised if problems found
Examples: Omega Centauri
Time Data Binning Galactic Plane Omega Centauri Crossing (FOV-P)
Omega Centauri observation 100,000 Observations 50 sec
Omega Centauri observation
Questions? NGC 1818 in LMC