1 Joint Technical Meeting 2/23/2016 L1 Systems Donald Petravick LSST JTM February 23, 2016
2 Joint Technical Meeting 2/23/2016 Design Articulation, Phasing (software only) ConOps High–Level Functions Second–Level Functions Main programs Program High Level Functions Program Detail Functions Common Functionality Concerns Message Topology Base-NCSA Coordination Data-Passing System Solidification Comprehensive Test Delivery Validated Early or Mock L1 code, L1 DB Alerts to “Mocked” Brokers Phased WBS Nightly Setup Production OCS, CDS
3 Joint Technical Meeting 2/23/2016 High-Level Functions of the L1 System − We see 4 distinct functions of the L1 System: – Processing AP is one use case – Archiving – EFD Replication – Observatory Operations Server
4 Joint Technical Meeting 2/23/2016 Processing System
5 Joint Technical Meeting 2/23/2016 Timescales of the Processing System Time ScaleMajor Functions Daily CadenceAcquire system from Observing ops. Ingest any remaining L1 processing data into permanent archive. Load/Purge caches. Insert and test any changes into system. Yield system to Observing ops, become functional OCS device. ModalOCS selects “clusters of use-cases”: Science Observing, Nightly Calibrations, Narrow Band Calibration, Doughnuts, etc. Logical VisitBegin: Marry Workers to Forwarder/Distributors, start apropos science codes. End: Assemble telemetry from individual CCDs. Send visit-level telemetry to OCS. Free Workers for new assignment. ExposureForwarders acquire rafts, forward rafts to Distributors. Strings of Workers pull CCD data from Distributors, load data into Butler repository, yield control to science codes. Exposure-level telemetry is assembled, passed to OCS. Alerts are sent to authors for AP use cases.
6 Joint Technical Meeting 2/23/2016 Salient Behaviors of the Processing System − An operational policy provides Observatory operations with a degree of control over the system. – If L1 falls behind, the system will keep a list of exposures internally, and dispatch them to processing according to a policy. – The system may be run in “scan mode” for processing that is not immediate. − Daytime processing in a batch system is a backup. – Observing operations may exit L1 processing. Unprocessed exposures are marked for daytime processing. – Processing may occur in event mode (sensitive to nextVisit events) or scan mode (OCS events are not needed). – Certain error cases currently may leave rafts unprocessed.
7 Joint Technical Meeting 2/23/2016 L1 Telemetry − L1 computes telemetry data that is fed back into OCS system. – Image quality parameters are processed eventually and provide feedback to the scheduler. These parameters need to be computed. We are prepared to process offline in batch if needed. We understand that the parameters are a side effect of the L1 image processing code, not a separate module. Configuration control must provide controls should multiple codes emerge. – L1 codes may run out of order, or in batch, after the fact. Operational control of the flow of messages is needed. Concerns: non-real-time WCS feedback, batch access to the OCS bridges, etc.
8 Joint Technical Meeting 2/23/2016 Details of Alert Distribution
9 Joint Technical Meeting 2/23/2016 Brokers − The LSST broker has been described as “primitive” or “basic”. − The LSST brokers only serve LSST authorized users. − There are operational distinctions between feeding a community broker vs. running a broker that serves individual users. − Since we have to deal with community brokers, we see the functional requirement as naturally supporting multiple brokers, and therefore multiple instances. – We see the following instances: A reliable feed for EPO A reliable feed for broker awardees A best effort feed accessible by any data rights holder − We understand PU is delivering a broker.
10 Joint Technical Meeting 2/23/2016 Archiving System
11 Joint Technical Meeting 2/23/2016 Archiving System − Archiving is now a distinct service offered to Observatory Operations. – No longer coupled to the L1 processing system technically. – Can be run independently as well. − Functions: – Acquires images from the camera buffer. – Ingests into the data backbone. Primary ingest is Chilean Base Center Archive Backup ingest is NCSA Archive Synchronization policy in the archives assures – replication – migration to tape
12 Joint Technical Meeting 2/23/2016 Additional L1 System Elements − The baseline provides for an engine to replicate the Engineering and Facility Database. – We understand this is to be made of ~20 independent relational databases and a file archive. – We understand this is to require a 25-node cluster. − Observatory Operations Server – Provides special path into archive and just-born information in L1 caches. – Different authorization and authentication mechanisms are required in the baseline.
13 Joint Technical Meeting 2/23/2016 Additional and Related Work − A conops (concept of operations) narrative is – a narrative that requires no training in a methodology to understand. – accessible to anyone with basic concepts of the LSST construction project. – useful support for operational planning. − We have found writing a conops for the subsystem to be – a helpful check that the context of the system is understood. – a basis for use case and functional breakdown. – for us, this is old hat.
14 Joint Technical Meeting 2/23/2016 Conops narratives are underway − We are currently developing conops for – the data backbone – the security system, which includes the authentication and authorization work − We plan on doing this for the remaining infrastructure systems as well – Batch – L3 hosting – The rest
15 Joint Technical Meeting 2/23/2016 Summary − We are looking for hallway or organized input to the L1 system briefly presented here. – For DM we are interested especially in the Brokers Science code payloads − We see necessary elements of interface to the project include – Conops narrative – The V methodology – WBS that will be effort-loaded
16 Joint Technical Meeting 2/23/2016 Phase 1: Basic Message Topology − Initial implementation of behaviors within a “logical visit,” but not reading out data. Basic messaging and interactions, including data dictionary and message patterns. – Common main classes, startup scaffolding, message dictionary, and prototype of main program for all entities, excluding EFD replication and Observatory Operations Server. – Minimal framework to do fault injection. – Basic framework for health and status display, including status event recorder. – Inclusion of telemetry messaging conditional upon final definition of requirements specification. − Use single resource instances to check message flow.
17 Joint Technical Meeting 2/23/2016 Phase 2: Coordination of Base and NCSA − Implementation of internal scoreboards and scoreboard snapshots (Base DMCS (L1 Processing), Base Foreman, Archiver DMCS, NCSA L1 Foreman, Cluster Manager). – Message payload processing for “logical visit.” – Implementation of stage-in component to stage calibrations and templates from caches to Workers. – Initial implementation of Cluster Manager wrapper. – Addition of component reliability faults to fault injection framework. − Use all resources and appropriate logic applied to messages for their use.
18 Joint Technical Meeting 2/23/2016 Phase 3: Data-passing between Camera, Base, and NCSA -Detailed implementation of behaviors for exposure-level processing. -Integration of DDS and CDS software deliverables into DM development system. -Includes marshalling of Workers, data acquisition from the Camera Data Buffer for both archiving and L1 processing, coordination of Archivers, Forwarders and Distributors, and coordination of data passing between Distributors, Workers, and NCSA Archive. -Analytics framework for telemetry. -FITS file generation for L1 processing and archiving. -Functioning (albeit, not final) system!
19 Joint Technical Meeting 2/23/2016 Phase 4: System Solidification -Initial implementation of behaviors in “scan mode” for archiving and L1 processing. -Final implementation of Cluster Manager wrapper. -State machines for commandable entities (L1 processing and archiving entities). -Configuration manager to assign personalities to machines. -Complete startup and tear-down implementation through state machine in DMCS, including drain for L1 processing and archiving.
20 Joint Technical Meeting 2/23/2016 Phase 5: Self-Integration and Concentrated, Comprehensive Testing -End-to-end test of L1 processing and archiving with OCS messages, including fault injection. -Internal test of Alert Distribution. -Final health and status display and after-action review (AAR) capabilities. -Implementation of log recording for Archiving framework and L1 Processing framework.
21 Joint Technical Meeting 2/23/2016 Phase 6: End-to-End System Integration and Testing − End-to-end test of L1 processing and Alert Distribution with science algorithms, including fault injection. – Final log recording, including science algorithm logs.
22 Joint Technical Meeting 2/23/2016 Concerns/Questions: intra-DM − Data coupling between L1 codes generating telemetry and OCS system. – Want to talk to Butler developers. − Assembly of pixels into file-level packages, esp. for archiving. Thinking granularity is raft, depends on metadata capabilities. − How to handle variation of codes as observing program varies? e.g., essential telemetry as a side effect of science code? − Catchup processing possible in the production batch env. − Note: – Need to model Observatory Operations Server and EFD – Haven’t modeled offline L1 processing, e.g., DayMOPS
23 Joint Technical Meeting 2/23/2016 Concerns/Questions: OCS − Further understanding of TBD Trigger, which shifts Base DMCS from OfflineState to StandbyState. − Protections for OCS bridge, which penetrates SCADA enclave. − How is the number of exposures in a logical visit conveyed to the system? Relationship to “nextVisit” message? − Details about required application responses to DDS messages. (when to issue acks, max times outs, and similar) − OCS support for the design concept of “logical visit.” − How does the metadata arrive from OCS in L1 processing and archiving, including scan mode. − MEP relating to ending of a major mode. − Possible concerns w.r.t readout, pending discussions with CDS. − Use cases where data persist in camera.
24 Joint Technical Meeting 2/23/2016 Concerns/Questions: Camera-Related − Claim management for camera buffer - unknown where this function lives. Granularity amp -> exposure. − Understand within “logical visit” the parameters that are invariant within each image, and the parameters that vary with the exposure. Sources for each. − What is the unit of recovery in the face of partial failure - e.g., CCD? Raft? Exposure? − Review the bandwidth needed, e.g., can the camera CDS handle unsynchronized access by archiving and processing? − General review of the CDS API w.r.t. the existing L1 design.