Srećko Morović Institute Ruđer Bošković

Slides:



Advertisements
Similar presentations
Sander Klous on behalf of the ATLAS Collaboration Real-Time May /5/20101.
Advertisements

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
GNAM and OHP: Monitoring Tools for the ATLAS Experiment at LHC GNAM and OHP: Monitoring Tools for the ATLAS Experiment at LHC M. Della Pietra, P. Adragna,
CHEP 2012 – New York City 1.  LHC Delivers bunch crossing at 40MHz  LHCb reduces the rate with a two level trigger system: ◦ First Level (L0) – Hardware.
1 Databases in ALICE L.Betev LCG Database Deployment and Persistency Workshop Geneva, October 17, 2005.
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.
CHEP04 - Interlaken - Sep. 27th - Oct. 1st 2004T. M. Steinbeck for the Alice Collaboration1/27 A Control Software for the ALICE High Level Trigger Timm.
First year experience with the ATLAS online monitoring framework Alina Corso-Radu University of California Irvine on behalf of ATLAS TDAQ Collaboration.
Włodzimierz Funika, Filip Szura Automation of decision making for monitoring systems.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Data Quality Monitoring of the CMS Tracker
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
Storage Manager Overview L3 Review of SM Software, 28 Oct Storage Manager Functions Event data Filter Farm StorageManager DQM data Event data DQM.
DQM Architecture From Online Perspective EvF wkg 11/10/2006 E. Meschi – CERN PH/CMD.
Control in ATLAS TDAQ Dietrich Liko on behalf of the ATLAS TDAQ Group.
Recent Software Issues L3 Review of SM Software, 28 Oct Recent Software Issues Occasional runs had large numbers of single-event files. INIT message.
ALICE, ATLAS, CMS & LHCb joint workshop on
OFFLINE TRIGGER MONITORING TDAQ Training 5 th November 2010 Ricardo Gonçalo On behalf of the Trigger Offline Monitoring Experts team.
AgentsAnd Daemons Automating Data Quality Monitoring Operations Agents And Daemons Automating Data Quality Monitoring Operations Since 2009 when the LHC.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.
CMS pixel data quality monitoring Petra Merkel, Purdue University For the CMS Pixel DQM Group Vertex 2008, Sweden.
The BaBar Prompt Reconstruction Manager: a Real Life Example of a Constructive Approach to Software Development. Francesco Safai Tehrani Istituto Nazionale.
ALICE Pixel Operational Experience R. Santoro On behalf of the ITS collaboration in the ALICE experiment at LHC.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
Online Monitoring for the CDF Run II Experiment T.Arisawa, D.Hirschbuehl, K.Ikado, K.Maeshima, H.Stadie, G.Veramendi, W.Wagner, H.Wenzel, M.Worcester MAR.
Jean-Roch Vlimant, CERN Physics Performance and Dataset Project Physics Data & MC Validation Group McM : The Evolution of PREP. The CMS tool for Monte-Carlo.
Pixel DQM Status R.Casagrande, P.Merkel, J.Zablocki (Purdue University) D.Duggan, D.Hidas, K.Rose (Rutgers University) L.Wehrli (ETH Zuerich) A.York (University.
Online Monitoring System at KLOE Alessandra Doria INFN - Napoli for the KLOE collaboration CHEP 2000 Padova, 7-11 February 2000 NAPOLI.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
General requirements for BES III offline & EF selection software Weidong Li.
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
A Validation System for the Complex Event Processing Directives of the ATLAS Shifter Assistant Tool G. Anders (CERN), G. Avolio (CERN), A. Kazarov (PNPI),
IceCube DAQ Mtg. 10,28-30 IceCube DAQ: Implementation Plan.
The ALICE data quality monitoring Barthélémy von Haller CERN PH/AID For the ALICE Collaboration.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
M. Caprini IFIN-HH Bucharest DAQ Control and Monitoring - A Software Component Model.
Online Data Monitoring Framework Based on Histogram Packaging in Network Distributed Data Acquisition Systems Tomoyuki Konno 1, Anatael Cabrera 2, Masaki.
Barthélémy von Haller CERN PH/AID For the ALICE Collaboration The ALICE data quality monitoring system.
CT-PPS DB Info (Preliminary) DB design will be the same as currently used for CMS Pixels, HCAL, GEM, HGCAL databases DB is Oracle based A DB for a sub-detector.
THIS MORNING (Start an) informal discussion to -Clearly identify all open issues, categorize them and build an action plan -Possibly identify (new) contributing.
Remote execution of long-running CGIs
Online remote monitoring facilities for the ATLAS experiment
N-Tier Architecture.
Operating System.
CMS High Level Trigger Configuration Management
ALICE Monitoring
Controlling a large CPU farm using industrial tools
RT2003, Montreal Niko Neufeld, CERN-EP & Univ. de Lausanne
Hannes Sakulin, CERN/EP on behalf of the CMS DAQ group
Pixel DQM Status & Plans
LCGAA nightlies infrastructure
Data Quality Monitoring of the CMS Silicon Strip Tracker Detector
#01 Client/Server Computing
Monitoring of the infrastructure from the VO perspective
Grid Canada Testbed using HEP applications
湖南大学-信息科学与工程学院-计算机与科学系
Lecture 1: Multi-tier Architecture Overview
CMS Pixel Data Quality Monitoring
DQM for the RPC subdetector
AIMS Equipment & Automation monitoring solution
The Performance and Scalability of the back-end DAQ sub-system
LHC BLM Software audit June 2008.
CMS Pixel Data Quality Monitoring
Chapter 13: I/O Systems.
MapReduce: Simplified Data Processing on Large Clusters
#01 Client/Server Computing
Presentation transcript:

CMS Online Data Quality Monitoring: Real-Time Event Processing Infrastructure Srećko Morović Institute Ruđer Bošković On behalf of the CMS Collaboration CHEP 2010 October 19, 2010. Academia Sinica, Taipei, Taiwan

Introduction CMS Online DQM system facilitates efficient detector operation by providing live data quality and integrity information during data taking (Dt ~minute live updates) provides CMS-wide real time access (experiment, CERN, remote) through web server (DQM GUI) archival of runs, display of past runs DQM system is fully integrated horizontally (all detector subsystems) and vertically (online, prompt and re-reco, simulation, software validation DQM), based on common infrastructure Online results and initial data quality assessment are available to offline processing and analysis (Offline DQM) DQM GUI display of Online/Offline data quality information →poster Valdas Rapsevicius Run Registry online / offline run book-keeping and summary tool CMS Online Data Quality Monitoring: Real-Time Event Processing Infrastructure

CMS Data Acquisition System L1 selection Facilitates CMS data taking Detector data provider for the DQM (event, some histograms) High Level trigger (HLT) – System for online building, reconstruction, analysis and filtering of events (passed by the hardware L1 trigger) Storage Manager (SM) Application - stores and distributes HLT-accepted events multiple instances (~16) split the work on event stream from HLT support registration for event/histogram stream (DQM is a client) →Remi Mommsen: The Data Acquisition System of the CMS experiment at LHC CMS Online Data Quality Monitoring: Real-Time Event Processing Infrastructure

Online DQM Server: Overview Event processing and histogramming system of the DQM produces all Online DQM data quality and integrity information Output: delivery of histograms, quality tests to the GUI Processing done on a DQM cluster: paralelly runing over 20 different subdetector processing jobs (DQM Applications) on several machines (up to 100 Hz event rate per job) producing ~300k histograms, ~50k displayed in the GUI Shifters: access smaller number of relevant summary histograms Quality tests for automated problem detection and notification For experts: huge sets of subdetector diagrams for diagnostics Streamlined, daily testing and deployment of subdetector code updates through the separate Integration System (Online replica) DQM Server DAQ Online DQM Input: DAQ event stream (+histograms) CMS Online Data Quality Monitoring: Real-Time Event Processing Infrastructure

DQM Server architecture registration, data requests events, HLT histograms SM DQM function manager (Run Control) built out of individual (C++) event serving and processing Applications tied together by a DB Configuration DQM Function Manager Component of the “Run Control and Monitoring System” (RCMS): CMS detector control system (hierarchy of subdetector FSM’s ) responsible for Instantiation and state control of the DQM Server For online data-taking:itself controlled by DAQ Server configuration: RCMS (XML) configuration stored in global RCMS DB Storage Manager Proxy Server (SMPS) handles Registration management to DAQ SM cluster Decoupling the task from DQM Applications registers for all available data (all SM’s) sensitivity to low rate events sets data rate, event stream selection (HLT trigger based) Intermediate buffering of events and histograms ( and their summing) instance running on the each server machine locally to the subsystem application avoiding extra network data transfer CMS Online Data Quality Monitoring: Real-Time Event Processing Infrastructure

DQM Server architecture: Event Processor FU Event Processor (FU) DQM Application histograms events/histograms DQM GUI DQM Collector DQM function manager DQM Applications subsystem-specific event processing and histogram production software written in standard CMS software framework (CMSSW): C++, python code Detector Online/Offline code reuse (required to access and process event information!) “InputSource” module: connection to SMPS DQM Core: Implementation of histogramming, quality test facilities (and much more) DQM Network module: facility for histogram transfer to Collector/GUI Filter Unit Event Processor Online equivalent of CMS execution environment HLT component, reused to run DQM Applications DQM Collector / GUI Collector receives processing results and delivers to GUI Long lived components, independent of RCMS and the detector run cycle CMS Online Data Quality Monitoring: Real-Time Event Processing Infrastructure

Server robustness features FU Event Processor FU subprocess DQM Application Requirement: system flexibility and robustness to a fast changing environment and detector conditions →need for frequent updates of DQM Application code, framework DQM system is designed to be fault-tolerant: should not go to error state due to a single Application failure FU Event Processor Master-Slave model: Application runs as a forked child process of the FU: Application crash handling Event Processor master instance always remains alive in case of a sub-process crash Able to do automatic Application restart if configured to do so robustly finishes own Stop transition when Application does not stop gracefully often the least tested part of code DQM function manager tolerant to individual FU EP instance failure (stays in Running state) parallel startup of all FU’s each able to independently start data processing reliable server stopping timeout to wait for FU’s to Stop, then finish transition robustness to sub-component failure run end transition reliability CMS Online Data Quality Monitoring: Real-Time Event Processing Infrastructure

Parallel event processing model Existing system: data processing within DQM Application is a serial chain of processing modules, done event-by-event Application can easily become CPU limited when doing intensive calculations (e.g. track reconstruction) →limits event rate per subsystem motivation: higher rate is good; better histogram statistics, better sensitivity to rare events: esp. needed with increasing amount of interesting data with tighter triggers (higher LHC luminosity) strategy: increase event rate by splitting event stream to multiple CPU cores for processing parallelization model carried over from the HLT Approaches: event funneling and histogram summing FU EP: spawns multiple child process copies special care taken to split event stream from SMPS to avoid data duplication DAQ ResourceBroker: data from FU’s received through shared memory buffer Storage Manager (local): receives all data, combines into single stream data forwarded to “collector” FU →exporting histograms to the GUI CMS Online Data Quality Monitoring: Real-Time Event Processing Infrastructure

Parallel event processing (continued) EVENT FUNNELING “parallel” FU’s: event processing → stream combining → histograms filled in the “collector” FU Advantages: Output result correctness (excluding exact time ordering of events) Drawbacks: Only event processing chain runs in parallel, not histogram calculation, quality tests... suitable only when histogram filling is quick and event processing slow high performance cost of de/serialization of complex ROOT objects needed for inter-process data transfer HISTOGRAM SUMMING histograms from the split event stream summed by the Storage Manager Advantages: whole DQM Application runs in parallel no event de/serialization Drawbacks: possibly incorrect results due to summing correctness issues: No general method to combine histograms, implemented only for averages and cumulative data not applicable on non-summable (on-statistical information e. g. detector status diagrams Still experimental feature Needs per-Application performance assessment and “tuning” multicore hardware already hosting the Online DQM Server CMS Online Data Quality Monitoring: Real-Time Event Processing Infrastructure

Summary Online (and offline) DQM was in place and working robustly on day 1 of the LHC and has proven a cornerstone of CMS data taking efficiency and data certification (selection of runs for analysis). Up-to-date detector information (and a history of runs) available to the experiment by real-time event processing Infrastructure - application building blocks: FU EP, SMPS, SM... put together by RCMS configuration Built-in robustness (fault-tolerant design) →less problem-solving for the shifter, experts (valuable in night hours  ) Fast and flexible update policy allows up-to-date code use in production → important in fast changing environment (early data taking period...) limitations of existing system (processing power) are being looked at, extensions being developed (still experimental) → Aaron Soha: Web Based Monitoring in the CMS Experiment at CERN CMS Online Data Quality Monitoring: Real-Time Event Processing Infrastructure

BACKUP Srećko Morović: CMS Online Data Quality Monitoring: Real-Time Event Processing Infrastructure

DQM GUI Web based, experiment-wide data quality information display dynamic, asynchrounous UI updates Variety of information grouped and presented front page: detector overview content: mostly detector data integrity plots some physics plots “quick” plot collections for the shifter Provenance Application: subdet. on/off Detailed subdetector collections available for experts to look for issues CMS Online Data Quality Monitoring: Real-Time Event Processing Infrastructure