Operations Hub data flow tools and concepts Reinhard Hanuschik Head, QC and Data Processing Group (QCG)
OPSHUB August 2018: New data processing unit in Vitacura: operations hub OPSHUB several workstations in RAF room, access to opshub1@godot.sc.eso.org multi-processor number cruncher, 250 TB storage powerful platform for data processing projects and tasks
OPSHUB Task: execute data processing project in an efficient and easy way Primary use cases for data processing @OPSHUB: UC1: Process SCIENCE data for defined dates, any instrument UC2: Process recent CALIB data on the HC monitor, to follow up red scores UC3: Compare a new prototype of a recipe, or a new pipeline version, to the operational version
OPSHUB tool: design drivers We have at ESO: Automatic pipelines @ Paranal (but no way to interact) Esorex and reflex support @ Vitacura (but no easy data access, scripting required) QC loop @ Garching [automatic, CALIBs (only)] Easy-to-use general tool to process data? None.
OPSHUB tool: design drivers We want: data access & data delivery classification and association rules & tools pipeline access & interfaces database access easy overview of tasks & products
OPSHUB tool: design drivers We have at QCG: DFS tools providing fundamental tasks like data access, pipeline interfacing, database queries DFOS tools: Designed and maintained by QC group Daily QC workflow on CALIBs: part 1: automatic (24/7) daily CALIB processing workflow up to scores and HC monitor part 2: interactive certification workflow incl. archiving of master calibrations Processing of SCIENCE data (IDPs), for selected instruments: SCIENCE processing including QC and scores
OPSHUB tool: design drivers Existing workflow tools: ‘autoDaily’ for QC: focusses on current CALIB data no SCIENCE, no historical data ‘phoenix’ for SCIENCE (IDPs): fine-tuned for specific IDPs selected instruments (science-grade pipelines) None of them provides what OPSHUB needs → new workflow tool for OPSHUB workflow
OPSHUB tool: design choices Principles: Take what we have (DFS tools, QC tools = “DFOS”) Add a wrapper for the workflow Design a workflow script that encodes the workflow combines existing components knows all subtleties of where to find what … … while the user can focus on results and make decisions Shell script, config files, project files, documentation Easy maintenance by QCG/OPSHUB
Data flows: existing and new DTS Paranal NGAS MCALIBs X RAW RAW, MCALIBs OPSHUB QCG X storage
Data flows: existing and new QC data flow: Once per hour, 24/7: query for new CALIB data in NGAS process them, leave in certification area Review, certify, ingest into NGAS (5/7) Paranal data flow: New raw data flow to DTS and NGAS Pipeline workstation: automatic processing, products left until deletion OPSHUB data flow: No need for a new data channel (historical data: not an option; new data: gains just a few minutes) Get all data from NGAS, on demand (no automatic download)
Data access Data required for association: processing Data required for association: Headers, no fits files Rules Association tools work on header keywords We download the required headers (instrument, dates) from header repository SAFIQ in Garching On demand only (no automatic process needed) Headers get updated if necessary (hotfly) FITS files downloaded from NGAS upon processing
Creating ABs AB = Association Block Generalized Reduction Block data association processing AB = Association Block Generalized Reduction Block Contains: Grouped RAW FILEs (single, all from template, …) Required MCALIBs (master calibrations) Recipe name, RAW_TYPE etc. AB = the fundamental unit for association and processing
OPSHUB tool Process CALIB data: Process SCIENCE data: extract instrument and atmospheric conditions Check quality Process SCIENCE data: Apply correction for ins and atm conditions: mcalibs Extract (concentrate) signal Map from pixels to physical units Data processing in an abstract sense: correct, extract, improve signal From grapes to wine or pisco: distillery
OPSHUB tool distillery Supports various data association schemes Delivers data (headers, fits) in the background Connects to the instrument pipelines, without requiring expert knowledge Provides performant data reduction Parallel processing of 60 or more ABs MUSE processing in 24 parallel threads Data download in up to 8 parallel threads Supports all current VLT instruments Supports multiple pipeline versions Offers storage or comfortable data cleanup Built on data expertise of QC group
Association I: DOWNLOAD data association processing Three association methods supported (called AB_METHODs): DOWNLOAD AB download from qcweb server, no AB creation CALIB ABs: the ones executed by QCG Used for certification Have QC information, scores, comments Fine-tuned parameters Also linked to the HC monitor SCIENCE ABs: produced by QCG but never executed Produced to check for completeness of OCA rules Fastest method
Association II: CALSELECTOR data association processing CALSELECTOR Powerful association engine OCA rules: database, archive-based, automatic versioning Performance: runs as a local tool Driven by SCIENCE (not usable for most CALIB projects) No choice of OCA rule 3 CALSEL_MODEs: MASTER (Raw2Master using mcalibs, always certified) RAW_CERTIF (Raw2Raw, using certified raw calibs) RAW_ANY (Raw2Raw, ignoring certification flag) Certification flag used in MASTER and RAW_CERTIF
Association III: CREATE_CASCADE data association processing CREATE_CASCADE All data of a given DATE are associated Delivers complete ABs only if the cascade for the chosen DATE is complete Useful for testing new pipeline versions, or OCA rules Choice of standard OCA rule (QC_DEFAULT), or modified, local one No certification information evaluated
Association IV: data types processing Data types can be: CALIB, SCIENCE, ALL CALIB: AB_METHOD DOWNLOAD or CREATE_CASCADE CALIB data defined by DPR.CATG=CALIB/TEST/TECHNICAL SELECT option for DPR_TYPE (e.g. WAVE) SCIENCE: Alll 3 AB_METHODs: DOWNLOAD, CALSELECTOR, CREATE_CASCADE SELECT option for OBS_PROG_ID, OBS_ID, RAW_TYPE, TPL_ID ALL: First CALIB, then SCIENCE
AB_METHOD=CALSELECTOR CALSEL_MODE=MASTER SCIENCE Mcalibs taken from NGAS
AB_METHOD=CALSELECTOR CALSEL_MODE=RAW_CERTIF or RAW_ANY CALIB SCIENCE
End of part 1: motivation data flow association schemes and cascades OPSHUB tool distillery Part 2: Processing Workflow Projects Data downloads
Processing data association processing Cascade: needed also for processing (dependencies, efficiency) configuration of DRS (data reduction system): CON (HTCondor): system for (massively) parallel execution, respecting dependencies (cascade) CPL: simple serial processing (one after the other, still needs dependencies) [name chosen for history only] INT: internal parallelization (for the tools this is like CPL, but for the pipeline this is a mode different from CPL and CON) Standard case: CONDOR Up to N parallel jobs, N being limited by the number of cores, memory etc. (godot: about 60)
PROJECTS I Typical job: Process all SCIENCE KMOS data from 2018-07-01, with the entire calibration cascade required for the science data, since we suspect a quality issue with IFU illumination Required specifications: INSTRUMENT=KMOS MODE=SCIENCE DATE=2018-07-01 AB_METHOD=CALSELECTOR (a choice) CALSEL_METHOD=RAW_CERTIF (a choice) SELECT=ALL (no specific OB or PROG_ID) Specifying all these parameters on the command line? Better: define a PROJECT
PROJECTS II distillery project definition file: One or several lines in ~/config/projects.distillery PROJECT_NAME unique at runtime Tool takes all required information from this entry Can be one or several dates Can be a full month PROJECT _NAME INSTRUMENT MODE DATE SELECT AB_METHOD CALSEL_MODE OCA_RLS _METHOD ACCEPT_060 KMOS _PROJECT1 KMOS SCIENCE 2018-07-01 ANY CALSELECTOR RAW_CERTIF NONE NO PROJECT _NAME INSTRUMENT MODE DATE SELECT AB_METHOD CALSEL_MODE OCA_RLS _METHOD ACCEPT_060 KMOS _PROJECT1 KMOS SCIENCE 2018-07-01 ANY CALSELECTOR RAW_CERTIF NONE NO 2018-07-03 PROJECT _NAME INSTRUMENT MODE DATE SELECT AB_METHOD CALSEL_MODE OCA_RLS _METHOD ACCEPT_060 KMOS _PROJECT1 KMOS SCIENCE 2018-07 ANY CALSELECTOR RAW_CERTIF NONE NO
PROJECTS III Mapping of use cases into project files: UC1: Process SCIENCE data for defined dates for a given instrument Standard AB_METHOD is CALSELECTOR DOWNLOAD also ok if MCALIBs, faster Another example, with OBS_ID filtering PROJECT _NAME INSTRUMENT MODE DATE SELECT AB_METHOD CALSEL_MODE OCA_RLS _METHOD ACCEPT_060 KMOS _PROJECT1 KMOS SCIENCE 2018-07-01 ANY CALSELECTOR MASTER NONE NO PROJECT _NAME INSTRUMENT MODE DATE SELECT AB_METHOD CALSEL_MODE OCA_RLS _METHOD ACCEPT_060 UVES _PROJECT1 UVES SCIENCE 2016-06-02 ANY DOWNLOAD NONE NO PROJECT _NAME INSTRUMENT MODE DATE SELECT AB_METHOD CALSEL_MODE OCA_RLS _METHOD ACCEPT_060 GIRAFFE _GAIA1 GIRAFFE SCIENCE 2016-11-15 OBS_ID=2005621 DOWNLOAD NONE NO
PROJECTS IV UC2: Process recent CALIB data with red scores on the HC monitor Standard AB_METHOD is DOWNLOAD CALSELECTOR cannot be used for CALIB data Works for data for which QCG has done certification already AB_METHOD CREATE_CASCADE: is the only method for very recent data Limited by completeness of daytime calibrations PROJECT _NAME INSTRUMENT MODE DATE SELECT AB_METHOD CALSEL_MODE OCA_RLS _METHOD ACCEPT_060 XSHOOTER _PROJECT1 XSHOOTER CALIB 2018-08-01 ANY DOWNLOAD NONE NO PROJECT _NAME INSTRUMENT MODE DATE SELECT AB_METHOD CALSEL_MODE OCA_RLS _METHOD ACCEPT_060 MUSE _PROJECT1 MUSE CALIB 2018-10-11 (today) ANY CREATE _CASCADE NONE QC_DEFAULT NO
PROJECTs V UC3: Compare a new prototype of a recipe, or a new pipeline version, to the operational version Standard AB_METHOD is DOWNLOAD Do the processing with pipe_v3.1, then store products under this PROJECT_NAME Then define a second project, configure pipe_v3.5, execute Products are now ready to be compared PROJECT _NAME INSTRUMENT MODE DATE SELECT AB_METHOD CALSEL_MODE OCA_RLS _METHOD ACCEPT_060 VISIR _PROJECT _pipe_v3.1 VISIR ANY 2012-11-23 DOWNLOAD NONE NO PROJECT _NAME INSTRUMENT MODE DATE SELECT AB_METHOD CALSEL_MODE OCA_RLS _METHOD ACCEPT_060 VISIR _PROJECT _pipe_v3.5 VISIR ANY 2012-11-23 DOWNLOAD NONE NO
Calling distillery I call the first step: creation of ABs distillery –p <project> -C [-i –V] call the first step: creation of ABs -i: interactive (after major steps, tools waits for confirmation) -V: verbose (detailed logging, e.g. of file names) after executing -C: headers are downloaded, ABs are created, processing queue is created and waits for execution you can check everything without actually downloading fits files and without calling the pipeline
Calling distillery II starts from –C: downloads fits files distillery –p <project> -P starts from –C: downloads fits files processes ABs in proper sequence as many in parallel as possible or as configured after execution: all products are collected in $DFS_PRODUCT available for inspection, collection of QC parameters etc. also possible: no parameters, the same as C+P in sequence distillery –p <project> [-i –V]
Calling distillery III distillery –p <project> -M [-i –V] storing of results for deeper analysis on data storage, under project name all products, logs, monitors on demand also all raw fits files delete results, no storage delete a project that was previously stored with -M distillery –p <project> -X distillery –p <project> -Y
Workflows I: method DOWNLOAD distillery –p <project> -C -C -P … -M/-X/-Y
Workflows II: method CALSELECTOR distillery –p <project> -C -C -P … -M/-X/-Y
Workflows III: method CREATE_CASCADE distillery –p <project> -C -C -P … -M/-X/-Y
Workflows IV: step by step Example: AB_METHOD=CALSELECTOR - download all headers for DATE and instr. - call CalSelector tool - filter for OBS_ID etc. if configured - download all hdr in CalSelector xml results - create ABs from CalSelector xml files - create the processing jobs&queue - execute the queue (download fits first) - enjoy products - clean up or store results
Data downloads: headers All classification done on headers bulk download: All headers for raw files of a specified date and instrument Used for classification File by file: Headers of mcalibs (as read from CalSelector xml files) Headers of static calibrations (as read from isql query) No need for “calibDb” as on Paranal CalSelector works on database with ALL mcalibs ever ingested
Data downloads: fits files Fits files downloaded for processing Initial bulk download: Before start of AB processing queue with N jobs, better to have initial dataset of fits files downloaded in controlled manner Done in batches of 8 download jobs (load balance) The first 50 raw files are downloaded this way (if configured, all raw files are downloaded) Then, all required mcalibs are downloaded later: incremental, file by file download of missing fits files when needed in AB, transparent in background
End of part 2 Processing Workflow Projects Data downloads Part 3: OCA rules Tool monitors Technical aspects
OCA rules I OCA (organization, classification, association) rules: set of rules defining the relationship between raw and product files in the ESO data flow system three steps: Classification: rules to define the types of input (raw) data, based on fits keys (typically the DPR keys DPR.CATG, DPR.TYPE, DPR.TECH) Organization: rules for the datasets (grouping of input files). The most prominent grouping rules are: single, and template (TPL.START) Association: rules about the processing of the datasets. The association rules define which mcalibs have to be found, including validity rules and match rules.
OCA rules II CALSELECTOR rules come versioned (by epoch) SCIENCE only http://www.eso.org/qc/ALL/OCA/oca_rule_sets.html CALSELECTOR rules come versioned (by epoch) SCIENCE only Always go to certified calibrations (raw or master), unless RAW_ANY is chosen in projects.distillery This is the standard method for most SCIENCE use cases Other: QC_DEFAULT rules (the ones used by QCG) More flexible: can be used to associate any data (e.g. CALIB) Even a local, modified OCA rule can be used, like myXSHOOTER.RLS
TOOL MONITORS I distillery tool: Calls internally some QC monitoring tools giving a process and status overview OPSHUB project monitor (called dfoMonitor internally)
TOOL MONITORS II Project monitor Project overview and status Yellow: waiting (after –C) Green: processed (after –P)
TOOL MONITORS III One project monitor per instrument Usually auto-updated Other instrument: jump menue Call/refresh on command-line: dfoMonitor –i <instr> Related links:
TOOL MONITORS IV Processing monitor (called AB monitor internally) : Overview of all current ABs Status and logs Connects to AB-related info Cloned from similar tool for QCG
TOOL MONITORS V Processing monitor: Clickable links to ABs, logs, calibration maps, product directories Collection of historical QC info (for CALIBs) ! This is historical QC content from qcweb server (NOT scored or certified by distillery)
TOOL MONITORS VI Processing monitor, expert mode: Sortable Filter/select Stop auto-refresh if you search
TOOL MONITORS VII Processing monitor, SCIENCE: If CALSELECTOR chosen: checks for completeness, validity, certified flag X – XML file: assocs complete & certified T – TXT file: assocs within calib_plan validity VIRT: warning for VIRT calibrations (ignore)
TOOL MONITORS VIII Processing monitor, DATE selection: SCIENCE ABs always from selected DATE CALIB ABs: related to SCIENCE ABs, they could also come from earlier or later DATEs, marked then If SCIENCE-driven (CALSELECTOR modes): CALIBs in general NOT complete for selected DATE, only for selected SCIENCE In particular: HC etc. might be missing
TOOL MONITORS IX Processing monitor: more links ‘distillery’ processing log file This PROJECT has 3 nights of SCIENCE, uses CALSELECTOR, RAW_ANY mode Data directories, OCA rules …
‘distillery’ processing log file
TOOL MONITORS X Processing monitor: compare OPSHUB and QCG results Link to corresponding page on QCWEB server QCG ‘autoDaily’ results (daily QC processing) OPSHUB ‘distillery’ results
TOOL MONITORS XI Project and proc. monitors: Purpose: Think of 2 levels Down … … and up again Purpose: giving overview of processing Access to data products
Other functions quick help on the command line overview of installed & configured pipelines, their recipes, their processing parameters menue for special download options headers, raw fits files, master calibration files (latest version of mcalibs for refresh of Paranal calibDB) distillery -h distillery -E distillery -D
TECHNICAL ASPECTS I config file for distillery: $HOME/config/config.distillery Key information for tool: Per instrument Pipeline file name conventions Subtleties Maintained by distillery maintainer (NOT normal user)
TECHNICAL ASPECTS II project file for distillery: $HOME/config/projects.distillery Key information for projects: INSTRUMENT DATEs MODE (CALIB/SCIENCE/ALL) AB_METHOD … Maintained by user
DOCUMENTATION http://www.eso.org/~qc/dfos/opshub.html Which is a branch of the DFOS web site: http://www.eso.org/~qc/dfos/details.html Quick user guide
distillery: what we have data access & data delivery √ classification and association rules & tools √ pipeline access and interfaces √ database access √ overview of tasks and products √ Generic and very flexible platform to process VLT data from raw to products
What distillery can’t do No QC info extracted, no scores (but extraction from QC site is done) All data processing is project driven, no stream (yet, TBD) will be implemented as recurrent projects [Note added: implemented with v1.1] Data are delivered automatically within a project, but not as an external, automatic flow
Known issues CalSelector uses OCA ruleset #1, but for ABs to be created we use OCA ruleset #2 which are not 100% identical (will disappear soon, with new version of Abbuilder) We cannot currently map complex science processing cascades like for MUSE (just the first 2 steps out of 4 or 5) (simple treatment now with little scripts called PGI’s)
Current implementation User opshub1 @ godot.sc.eso.org More users (opshub2,…) planned Also planned: project or personal accounts Thank you and … Happy distilling!
HINTS Choice of AB_METHOD depends on USE case How to get to QC1 parameters Select products Need some knowledge about QC parameter names etc. Best in stored PROJECT but possible as well in $DFS_PRODUCT area
ASSOCIATION V: cascade CALIB SCIENCE raw products
AB_METHOD=CREATE_CASCADE CALIB AB_METHOD=CREATE_CASCADE SCIENCE raw products
AB_METHOD=DOWNLOAD MODE=SCIENCE SCIENCE Mcalibs taken from NGAS