DQM Architecture From Online Perspective EvF wkg 11/10/2006 E. Meschi – CERN PH/CMD
E.M. - DQM Online View2 DQM Requirements 1.Primary goal: provide “fast” feedback to shift crew and subsystem experts about the quality of event data being taken 2.Provide global and subsystem-specific “quality flags” for each unit of event data (aka Luminosity Section) 3.Provide a uniform environment and a modular structure for DQM code (DQM code reusability) 4.Provide a common working environment for expert and generic monitoring alike 5.Integrate well into online operations (e.g. core activities started automatically by RunControl) 6.Provide a hierarchical online view of the status of the experiment 7.Provide a uniform look and feel for DQM GUIs 8.Enable seamless integration of offline DQM activities (see 3.) 9.Enable remote DQM shifts
E.M. - DQM Online View3 DQM Infrastructure DQMServices –Fully integrated with CMSSW –Modularity of user code imposed by framework –Uniform interface for creation/management of DQM objects –Bookkeeping, transport and collation of DQM data –Quality test and status tracking –Web interface toolkit, xdaq integration –Visual client integrated with Iguana –See C.L. presentation 80% of the requirements in previous slide are covered –How to get the remaining 20% is one of the subjects of this workshop.
E.M. - DQM Online View4 DQMServices use cases data subscriptions CRATE CONTROLLER PC COLLECTOR CONSUMERS data subscriptions Event CONSUMERS COLLECTOR DQM CONSUMERS EVENT SERVER / SM events data subscriptions directory COLLECTOR CONSUMERS ONLINE QUASI - ONLINE FWK FWK + XDAQ STANDALONE XDAQ/WRAPPED TCP/TMessage Event Data FILTER FARM CONSUMERS STORAGE MANAGER
E.M. - DQM Online View5 Frequent Questions Which network will I be running on ? Can I / should I use CMSSW ? How is my process going to be started / controlled ? Do I get to access OMDS ? ORCON ? Do I have access to DCS data ? Do I have access to DAQ monitoring data ?
E.M. - DQM Online View6 DQM Modes of Operation Online at crate controller level –Input rate: limited by VME access (*) –Event Building: No –CPU: crate controller PCs –Bw: consistent with experiment network –Delay: virtually 0 Online in Filter Farm –Input rate: up to 100 kHz –Event Building: Yes –CPU: 10-0% of HLT CPU –Bw: 5-0% of total bw (1 GB/s) –Delay: 0 Online in Event Consumer –Input rate: 1-10 Hz aggregate –Event Building: Yes –CPU: subsystem CPUs –Bw: consistent with experiment network –Delay: seconds EXP. NETWORK CAN USE CMSSW CAN USE RC (SUB-DET) FREE ACCESS TO DB DCS: via PSX DAQmon: via DB EXP. NETWORK MUST USE CMSSW MUST USE RC LIMITED ACCESS TO DB DCS: NO DAQmon: NO EXP. OR CAMPUS NETWORK MUST USE CMSSW CAN USE RC FREE ACCESS TO DB (EXP) DCS: via PSX or DB DAQmon: via DB
E.M. - DQM Online View7 DQM Modes of Operation Quasi-online processing local file from SM –Input rate: O(10) Hz aggregate –Event Building: Yes –CPU: subsystem CPUs –Bw: consistent with experiment network –Delay: minutes Offline processing –Input rate: virtually all data stored (O(100Hz)) –Event Building: Yes –CPU: batch farm –Bw: consistent with campus network –Delay: ~ 1 hour EXP. OR CAMPUS NETWORK MUST USE CMSSW CAN USE RC FREE ACCESS TO DB (EXP) DCS: via DB DAQmon: via DB GRID MUST USE CMSSW CANNOT USE RC ACCESS TO OFFLINE DB ONLY DCS: indirectly via condDB DAQmon: NO
E.M. - DQM Online View8 DQM in the FF The one and only way to get 100 % of the events from L1 Embedding DQM in the HLT has however the following disadvantages: 1.It must be accounted for in the HLT CPU budget 2.It affects the robustness of the HLT: DQM code to be run like that is going to be subject to much stricter requirements and will not be allowed to change frequently 3.DQM data is scattered over many sources: the bandwidth to the collector is limited, and a standard collation operation must be carried out in the collector to reduce data volume. It should be reserved for cases where The entire L1 accept rate is needed or Big statistics must be accumulated over a short period (e.g. at the beginning of a run)
E.M. - DQM Online View9 Filter Farm Data Operation EVENT/DQM SERVER DATALOGGER event data EVENT DATA BUFFERS DQM data SPECIAL STREAMS BUFFERS EVENT/DQM PROXY/CACHING SERVER DQM SNAPSHOT BUFFERS EVENT CONSUMERS DQM CONSUMERS STORAGE MANAGERS
E.M. - DQM Online View10 FF DQM Data Handling First Level of DQM Collection in Storage Manager –Does collation of many FU copies Proxy/Caching Server collects collated updates from all SMs –Does final collation –Saves snapshot per LS –Serves individual consumers –It’s only point of access from outside the experiment network Consumers of FF DQM –Can subscribe to individual DQM “folders” –Only have access to collated information –Are responsible for processing DQM information (Qtests, status variables, presentation etc.)
E.M. - DQM Online View11 Other Online Sources of DQM Data Event and non-event DQM from crate controllers –Should be part of the sub-detector online configuration (and thus be controlled by the sub-det FM) –Including collection and collation Event Consumers (both using Event Server or disk streams) –Should be controlled by RunControl –Should be grouped in few individual processes by functionality and input –E.g. all DQM modules that use a zero-bias special stream are run by the same process One or multiple collectors Collation in case of multiple identical sources is delegated to client
E.M. - DQM Online View12 DQM Clients Two types of consumers of DQM information –Intelligent clients (Superclients) Do data manipulation Are themselves producers of DQM data Can act as servers Can write into CondDB Can (but do not necessarily) provide graphical feedback Can (but do not necessarily) provide interactive control (e.g. switch to expert mode…) Should be xdaq applications so they can be best controlled by RunControl Can be FW applications to gain access to FW services (e.g. ORCON) See S.B. talk Can run unattended and provide feedback to operator via warning/error messages –Dumb clients (e.g. GUI) Do not add information or manipulate data Cannot act as servers Cannot write in CondDB Provide interactive feedback
E.M. - DQM Online View13 Client Operation DQM is controlled as a separate sub-system of DAQ (excluding DQM in FF) –Sources (event consumers) –Collectors –Intelligent clients If full state machine binding for xdaq applications (e.g. derived from DQMBaseClient) –Get configure, run start/stop commands Otherwise limited to start/stop of processes if no xdaq binding As a minimum gives a report line to know if a process is alive Control is on a “best-effort” base, I.e. DAQ will not stop if a DQM component crashes Each Superclient must provide a non-graphic synoptic view of the status of the sub-system it monitors Key plots (used in the status calculation) are stored in a snapshot (at every LS) Plus a navigable hierarchy of status information based on the folder organization (e.g. one folder per chamber: status calculated based on status of contained histograms, etc.) TOP DAQ DQM FF Subsystem CRATE CONTROLLER DQM SOURCEs EVENT CONSUMERS COLLECTORS SUPERCLIENTS HLT as DQM SOURCE SM as DQM COLLECTOR CLIENT CONCENTRATOR GLOBAL STATUS DISPLAY
E.M. - DQM Online View14 Organization of Online DQM Hardware –Online DQM PCs must be connected to the experiment network –They are in general a responsibility of the sub-detector –System management is carried out centrally by DAQ team –Disk space for monitor streams and DQM snapshots is managed centrally (as part of the Storage Manager complex) Software –XDAQ and CMSSW central installations are provided –Sub-systems can derive project trees for fast development –NO flexibility for code running on the filter farm –SOME flexibility for code to run in “quasi-online” mode (compatible with centralized configuration/control) –Freedom for applications under sub-system responsibility (e.g. DQM in crate controller under sub-detector FM control) DB –Database access by individual DQM processes MUST happen via one of the approved mechanisms (Tstore for OMDS and POOL-ORA for ORCON) –Database access bandwidth for DQM MUST be negotiated with the DB group –General rule of thumb is NO DEADTIME due to db stuck on dqm access
E.M. - DQM Online View15 Summary Existing infrastructure covers 80% of DQM requirements Standardization of DQM data generation is achieved (using DQMServices/FW components) Standardization of “SuperClients” must be achieved –Enforce hierarchy of views –Enforce use of quality test and status tools –Enforce use of standard entrypoints for data/status manipulation –Define policies for combining status information Standardization of control –Use Run control to drive DQM processes –DQM becomes a “subsystem” –Line of reporting for critical errors Standardization of look and feel –GUI: development needed for production-level use –Color codes, etc.