Presentation is loading. Please wait.

Presentation is loading. Please wait.

Operations Hub data flow tools and concepts

Similar presentations


Presentation on theme: "Operations Hub data flow tools and concepts"— Presentation transcript:

1 Operations Hub data flow tools and concepts
Reinhard Hanuschik Head, QC and Data Processing Group (QCG)

2 OPSHUB August 2018: New data processing unit in Vitacura: operations hub OPSHUB several workstations in RAF room, access to multi-processor number cruncher, 250 TB storage powerful platform for data processing projects and tasks

3 OPSHUB Task: execute data processing project in an efficient and easy way Primary use cases for data UC1: Process SCIENCE data for defined dates, any instrument UC2: Process recent CALIB data on the HC monitor, to follow up red scores UC3: Compare a new prototype of a recipe, or a new pipeline version, to the operational version

4 OPSHUB tool: design drivers
We have at ESO: Automatic Paranal (but no way to interact) Esorex and reflex Vitacura (but no easy data access, scripting required) QC Garching [automatic, CALIBs (only)] Easy-to-use general tool to process data? None.

5 OPSHUB tool: design drivers
We want: data access & data delivery classification and association rules & tools pipeline access & interfaces database access easy overview of tasks & products

6 OPSHUB tool: design drivers
We have at QCG: DFS tools providing fundamental tasks like data access, pipeline interfacing, database queries DFOS tools: Designed and maintained by QC group Daily QC workflow on CALIBs: part 1: automatic (24/7) daily CALIB processing workflow up to scores and HC monitor part 2: interactive certification workflow incl. archiving of master calibrations Processing of SCIENCE data (IDPs), for selected instruments: SCIENCE processing including QC and scores

7 OPSHUB tool: design drivers
Existing workflow tools: ‘autoDaily’ for QC: focusses on current CALIB data no SCIENCE, no historical data ‘phoenix’ for SCIENCE (IDPs): fine-tuned for specific IDPs selected instruments (science-grade pipelines) None of them provides what OPSHUB needs → new workflow tool for OPSHUB workflow

8 OPSHUB tool: design choices
Principles: Take what we have (DFS tools, QC tools = “DFOS”) Add a wrapper for the workflow Design a workflow script that encodes the workflow combines existing components knows all subtleties of where to find what … … while the user can focus on results and make decisions Shell script, config files, project files, documentation Easy maintenance by QCG/OPSHUB

9 Data flows: existing and new
DTS Paranal NGAS MCALIBs X RAW RAW, MCALIBs OPSHUB QCG X storage

10 Data flows: existing and new
QC data flow: Once per hour, 24/7: query for new CALIB data in NGAS process them, leave in certification area Review, certify, ingest into NGAS (5/7) Paranal data flow: New raw data flow to DTS and NGAS Pipeline workstation: automatic processing, products left until deletion OPSHUB data flow: No need for a new data channel (historical data: not an option; new data: gains just a few minutes) Get all data from NGAS, on demand (no automatic download)

11 Data access Data required for association:
processing Data required for association: Headers, no fits files Rules Association tools work on header keywords We download the required headers (instrument, dates) from header repository SAFIQ in Garching On demand only (no automatic process needed) Headers get updated if necessary (hotfly) FITS files downloaded from NGAS upon processing

12 Creating ABs AB = Association Block Generalized Reduction Block
data association processing AB = Association Block Generalized Reduction Block Contains: Grouped RAW FILEs (single, all from template, …) Required MCALIBs (master calibrations) Recipe name, RAW_TYPE etc. AB = the fundamental unit for association and processing

13 OPSHUB tool Process CALIB data: Process SCIENCE data:
extract instrument and atmospheric conditions Check quality Process SCIENCE data: Apply correction for ins and atm conditions: mcalibs Extract (concentrate) signal Map from pixels to physical units Data processing in an abstract sense: correct, extract, improve signal From grapes to wine or pisco: distillery

14 OPSHUB tool distillery
Supports various data association schemes Delivers data (headers, fits) in the background Connects to the instrument pipelines, without requiring expert knowledge Provides performant data reduction Parallel processing of 60 or more ABs MUSE processing in 24 parallel threads Data download in up to 8 parallel threads Supports all current VLT instruments Supports multiple pipeline versions Offers storage or comfortable data cleanup Built on data expertise of QC group

15 Association I: DOWNLOAD
data association processing Three association methods supported (called AB_METHODs): DOWNLOAD AB download from qcweb server, no AB creation CALIB ABs: the ones executed by QCG Used for certification Have QC information, scores, comments Fine-tuned parameters Also linked to the HC monitor SCIENCE ABs: produced by QCG but never executed Produced to check for completeness of OCA rules Fastest method

16 Association II: CALSELECTOR
data association processing CALSELECTOR Powerful association engine OCA rules: database, archive-based, automatic versioning Performance: runs as a local tool Driven by SCIENCE (not usable for most CALIB projects) No choice of OCA rule 3 CALSEL_MODEs: MASTER (Raw2Master using mcalibs, always certified) RAW_CERTIF (Raw2Raw, using certified raw calibs) RAW_ANY (Raw2Raw, ignoring certification flag) Certification flag used in MASTER and RAW_CERTIF

17 Association III: CREATE_CASCADE
data association processing CREATE_CASCADE All data of a given DATE are associated Delivers complete ABs only if the cascade for the chosen DATE is complete Useful for testing new pipeline versions, or OCA rules Choice of standard OCA rule (QC_DEFAULT), or modified, local one No certification information evaluated

18 Association IV: data types
processing Data types can be: CALIB, SCIENCE, ALL CALIB: AB_METHOD DOWNLOAD or CREATE_CASCADE CALIB data defined by DPR.CATG=CALIB/TEST/TECHNICAL SELECT option for DPR_TYPE (e.g. WAVE) SCIENCE: Alll 3 AB_METHODs: DOWNLOAD, CALSELECTOR, CREATE_CASCADE SELECT option for OBS_PROG_ID, OBS_ID, RAW_TYPE, TPL_ID ALL: First CALIB, then SCIENCE

19 AB_METHOD=CALSELECTOR CALSEL_MODE=MASTER SCIENCE
Mcalibs taken from NGAS

20 AB_METHOD=CALSELECTOR CALSEL_MODE=RAW_CERTIF or RAW_ANY
CALIB SCIENCE

21 End of part 1: motivation data flow association schemes and cascades OPSHUB tool distillery Part 2: Processing Workflow Projects Data downloads

22 Processing data association processing Cascade: needed also for processing (dependencies, efficiency) configuration of DRS (data reduction system): CON (HTCondor): system for (massively) parallel execution, respecting dependencies (cascade) CPL: simple serial processing (one after the other, still needs dependencies) [name chosen for history only] INT: internal parallelization (for the tools this is like CPL, but for the pipeline this is a mode different from CPL and CON) Standard case: CONDOR Up to N parallel jobs, N being limited by the number of cores, memory etc. (godot: about 60)

23 PROJECTS I Typical job:
Process all SCIENCE KMOS data from , with the entire calibration cascade required for the science data, since we suspect a quality issue with IFU illumination Required specifications: INSTRUMENT=KMOS MODE=SCIENCE DATE= AB_METHOD=CALSELECTOR (a choice) CALSEL_METHOD=RAW_CERTIF (a choice) SELECT=ALL (no specific OB or PROG_ID) Specifying all these parameters on the command line? Better: define a PROJECT

24 PROJECTS II distillery project definition file:
One or several lines in ~/config/projects.distillery PROJECT_NAME unique at runtime Tool takes all required information from this entry Can be one or several dates Can be a full month PROJECT _NAME INSTRUMENT MODE DATE SELECT AB_METHOD CALSEL_MODE OCA_RLS _METHOD ACCEPT_060 KMOS _PROJECT1 KMOS SCIENCE ANY CALSELECTOR RAW_CERTIF NONE NO PROJECT _NAME INSTRUMENT MODE DATE SELECT AB_METHOD CALSEL_MODE OCA_RLS _METHOD ACCEPT_060 KMOS _PROJECT1 KMOS SCIENCE ANY CALSELECTOR RAW_CERTIF NONE NO PROJECT _NAME INSTRUMENT MODE DATE SELECT AB_METHOD CALSEL_MODE OCA_RLS _METHOD ACCEPT_060 KMOS _PROJECT1 KMOS SCIENCE ANY CALSELECTOR RAW_CERTIF NONE NO

25 PROJECTS III Mapping of use cases into project files:
UC1: Process SCIENCE data for defined dates for a given instrument Standard AB_METHOD is CALSELECTOR DOWNLOAD also ok if MCALIBs, faster Another example, with OBS_ID filtering PROJECT _NAME INSTRUMENT MODE DATE SELECT AB_METHOD CALSEL_MODE OCA_RLS _METHOD ACCEPT_060 KMOS _PROJECT1 KMOS SCIENCE ANY CALSELECTOR MASTER NONE NO PROJECT _NAME INSTRUMENT MODE DATE SELECT AB_METHOD CALSEL_MODE OCA_RLS _METHOD ACCEPT_060 UVES _PROJECT1 UVES SCIENCE ANY DOWNLOAD NONE NO PROJECT _NAME INSTRUMENT MODE DATE SELECT AB_METHOD CALSEL_MODE OCA_RLS _METHOD ACCEPT_060 GIRAFFE _GAIA1 GIRAFFE SCIENCE OBS_ID= DOWNLOAD NONE NO

26 PROJECTS IV UC2: Process recent CALIB data with red scores on the HC monitor Standard AB_METHOD is DOWNLOAD CALSELECTOR cannot be used for CALIB data Works for data for which QCG has done certification already AB_METHOD CREATE_CASCADE: is the only method for very recent data Limited by completeness of daytime calibrations PROJECT _NAME INSTRUMENT MODE DATE SELECT AB_METHOD CALSEL_MODE OCA_RLS _METHOD ACCEPT_060 XSHOOTER _PROJECT1 XSHOOTER CALIB ANY DOWNLOAD NONE NO PROJECT _NAME INSTRUMENT MODE DATE SELECT AB_METHOD CALSEL_MODE OCA_RLS _METHOD ACCEPT_060 MUSE _PROJECT1 MUSE CALIB (today) ANY CREATE _CASCADE NONE QC_DEFAULT NO

27 PROJECTs V UC3: Compare a new prototype of a recipe, or a new pipeline version, to the operational version Standard AB_METHOD is DOWNLOAD Do the processing with pipe_v3.1, then store products under this PROJECT_NAME Then define a second project, configure pipe_v3.5, execute Products are now ready to be compared PROJECT _NAME INSTRUMENT MODE DATE SELECT AB_METHOD CALSEL_MODE OCA_RLS _METHOD ACCEPT_060 VISIR _PROJECT _pipe_v3.1 VISIR ANY DOWNLOAD NONE NO PROJECT _NAME INSTRUMENT MODE DATE SELECT AB_METHOD CALSEL_MODE OCA_RLS _METHOD ACCEPT_060 VISIR _PROJECT _pipe_v3.5 VISIR ANY DOWNLOAD NONE NO

28 Calling distillery I call the first step: creation of ABs
distillery –p <project> -C [-i –V] call the first step: creation of ABs -i: interactive (after major steps, tools waits for confirmation) -V: verbose (detailed logging, e.g. of file names) after executing -C: headers are downloaded, ABs are created, processing queue is created and waits for execution you can check everything without actually downloading fits files and without calling the pipeline

29 Calling distillery II starts from –C: downloads fits files
distillery –p <project> -P starts from –C: downloads fits files processes ABs in proper sequence as many in parallel as possible or as configured after execution: all products are collected in $DFS_PRODUCT available for inspection, collection of QC parameters etc. also possible: no parameters, the same as C+P in sequence distillery –p <project> [-i –V]

30 Calling distillery III
distillery –p <project> -M [-i –V] storing of results for deeper analysis on data storage, under project name all products, logs, monitors on demand also all raw fits files delete results, no storage delete a project that was previously stored with -M distillery –p <project> -X distillery –p <project> -Y

31 Workflows I: method DOWNLOAD
distillery –p <project> -C -C -P … -M/-X/-Y

32 Workflows II: method CALSELECTOR
distillery –p <project> -C -C -P … -M/-X/-Y

33 Workflows III: method CREATE_CASCADE
distillery –p <project> -C -C -P … -M/-X/-Y

34 Workflows IV: step by step
Example: AB_METHOD=CALSELECTOR - download all headers for DATE and instr. - call CalSelector tool - filter for OBS_ID etc. if configured - download all hdr in CalSelector xml results - create ABs from CalSelector xml files - create the processing jobs&queue - execute the queue (download fits first) - enjoy products - clean up or store results

35 Data downloads: headers
All classification done on headers bulk download: All headers for raw files of a specified date and instrument Used for classification File by file: Headers of mcalibs (as read from CalSelector xml files) Headers of static calibrations (as read from isql query) No need for “calibDb” as on Paranal CalSelector works on database with ALL mcalibs ever ingested

36 Data downloads: fits files
Fits files downloaded for processing Initial bulk download: Before start of AB processing queue with N jobs, better to have initial dataset of fits files downloaded in controlled manner Done in batches of 8 download jobs (load balance) The first 50 raw files are downloaded this way (if configured, all raw files are downloaded) Then, all required mcalibs are downloaded later: incremental, file by file download of missing fits files when needed in AB, transparent in background

37 End of part 2 Processing Workflow Projects Data downloads Part 3: OCA rules Tool monitors Technical aspects

38 OCA rules I OCA (organization, classification, association) rules:
set of rules defining the relationship between raw and product files in the ESO data flow system three steps: Classification: rules to define the types of input (raw) data, based on fits keys (typically the DPR keys DPR.CATG, DPR.TYPE, DPR.TECH) Organization: rules for the datasets (grouping of input files). The most prominent grouping rules are: single, and template (TPL.START) Association: rules about the processing of the datasets. The association rules define which mcalibs have to be found, including validity rules and match rules.

39 OCA rules II CALSELECTOR rules come versioned (by epoch) SCIENCE only
CALSELECTOR rules come versioned (by epoch) SCIENCE only Always go to certified calibrations (raw or master), unless RAW_ANY is chosen in projects.distillery This is the standard method for most SCIENCE use cases Other: QC_DEFAULT rules (the ones used by QCG) More flexible: can be used to associate any data (e.g. CALIB) Even a local, modified OCA rule can be used, like myXSHOOTER.RLS

40 TOOL MONITORS I distillery tool:
Calls internally some QC monitoring tools giving a process and status overview OPSHUB project monitor (called dfoMonitor internally)

41 TOOL MONITORS II Project monitor Project overview and status
Yellow: waiting (after –C) Green: processed (after –P)

42 TOOL MONITORS III One project monitor per instrument
Usually auto-updated Other instrument: jump menue Call/refresh on command-line: dfoMonitor –i <instr> Related links:

43 TOOL MONITORS IV Processing monitor (called AB monitor internally) :
Overview of all current ABs Status and logs Connects to AB-related info Cloned from similar tool for QCG

44 TOOL MONITORS V Processing monitor:
Clickable links to ABs, logs, calibration maps, product directories Collection of historical QC info (for CALIBs) ! This is historical QC content from qcweb server (NOT scored or certified by distillery)

45 TOOL MONITORS VI Processing monitor, expert mode: Sortable
Filter/select Stop auto-refresh if you search

46 TOOL MONITORS VII Processing monitor, SCIENCE:
If CALSELECTOR chosen: checks for completeness, validity, certified flag X – XML file: assocs complete & certified T – TXT file: assocs within calib_plan validity VIRT: warning for VIRT calibrations (ignore)

47 TOOL MONITORS VIII Processing monitor, DATE selection:
SCIENCE ABs always from selected DATE CALIB ABs: related to SCIENCE ABs, they could also come from earlier or later DATEs, marked then If SCIENCE-driven (CALSELECTOR modes): CALIBs in general NOT complete for selected DATE, only for selected SCIENCE In particular: HC etc. might be missing

48 TOOL MONITORS IX Processing monitor: more links
‘distillery’ processing log file This PROJECT has 3 nights of SCIENCE, uses CALSELECTOR, RAW_ANY mode Data directories, OCA rules …

49 ‘distillery’ processing log file

50 TOOL MONITORS X Processing monitor: compare OPSHUB and QCG results
Link to corresponding page on QCWEB server QCG ‘autoDaily’ results (daily QC processing) OPSHUB ‘distillery’ results

51 TOOL MONITORS XI Project and proc. monitors: Purpose:
Think of 2 levels Down … … and up again Purpose: giving overview of processing Access to data products

52 Other functions quick help on the command line
overview of installed & configured pipelines, their recipes, their processing parameters menue for special download options headers, raw fits files, master calibration files (latest version of mcalibs for refresh of Paranal calibDB) distillery -h distillery -E distillery -D

53 TECHNICAL ASPECTS I config file for distillery:
$HOME/config/config.distillery Key information for tool: Per instrument Pipeline file name conventions Subtleties Maintained by distillery maintainer (NOT normal user)

54 TECHNICAL ASPECTS II project file for distillery:
$HOME/config/projects.distillery Key information for projects: INSTRUMENT DATEs MODE (CALIB/SCIENCE/ALL) AB_METHOD Maintained by user

55 DOCUMENTATION http://www.eso.org/~qc/dfos/opshub.html
Which is a branch of the DFOS web site: Quick user guide

56 distillery: what we have
data access & data delivery √ classification and association rules & tools √ pipeline access and interfaces √ database access √ overview of tasks and products √ Generic and very flexible platform to process VLT data from raw to products

57 What distillery can’t do
No QC info extracted, no scores (but extraction from QC site is done) All data processing is project driven, no stream (yet, TBD) will be implemented as recurrent projects [Note added: implemented with v1.1] Data are delivered automatically within a project, but not as an external, automatic flow

58 Known issues CalSelector uses OCA ruleset #1, but for ABs to be created we use OCA ruleset #2 which are not 100% identical  (will disappear soon, with new version of Abbuilder) We cannot currently map complex science processing cascades like for MUSE (just the first 2 steps out of 4 or 5) (simple treatment now with little scripts called PGI’s)

59 Current implementation
User godot.sc.eso.org More users (opshub2,…) planned Also planned: project or personal accounts Thank you and … Happy distilling!

60

61 HINTS Choice of AB_METHOD depends on USE case
How to get to QC1 parameters Select products Need some knowledge about QC parameter names etc. Best in stored PROJECT but possible as well in $DFS_PRODUCT area

62 ASSOCIATION V: cascade
CALIB SCIENCE raw products

63 AB_METHOD=CREATE_CASCADE
CALIB AB_METHOD=CREATE_CASCADE SCIENCE raw products

64 AB_METHOD=DOWNLOAD MODE=SCIENCE SCIENCE
Mcalibs taken from NGAS


Download ppt "Operations Hub data flow tools and concepts"

Similar presentations


Ads by Google