Evolution of Monitoring System for AMS Science Operation Center

Slides:

Advertisements

Similar presentations

Clara Gaspar on behalf of the LHCb Collaboration, “Physics at the LHC and Beyond”, Quy Nhon, Vietnam, August 2014 Challenges and lessons learnt LHCb Operations.

Advertisements

Computing Strategy of the AMS-02 Experiment B. Shan 1 On behalf of AMS Collaboration 1 Beihang University, China CHEP 2015, Okinawa.

23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.

1 Data Storage MICE DAQ Workshop 10 th February 2006 Malcolm Ellis & Paul Kyberd.

Maintaining and Updating Windows Server 2008

ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.

Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.

OO Software and Data Handling in AMS Computing in High Energy and Nuclear Physics Beijing, September 3-7, 2001 Vitali Choutko, Alexei Klimentov MIT, ETHZ.

+ discussion in Software WG: Monte Carlo production on the Grid + discussion in TDAQ WG: Dedicated server for online services + experts meeting (Thusday.

3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.

Status of AMS Regional Data Center The Second International Workshop on HEP Data Grid CHEP, KNU G. N. Kim, J. W. Shin, N. Tasneem, M. W.

ALICE Upgrade for Run3: Computing HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.

Offline shifter training tutorial L. Betev February 19, 2009.

Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.

And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR

© 2006 Cisco Systems, Inc. All rights reserved.1 Connection 7.0 Serviceability Reports Todd Blaisdell.

LOGO Development of the distributed computing system for the MPD at the NICA collider, analytical estimations Mathematical Modeling and Computational Physics.

Silicon Module Tests The modules are tested in the production labs. HIP is is participating in the rod quality tests at CERN. The plan of HIP CMS is to.

Infrastructure for QA and automatic trending F. Bellini, M. Germain ALICE Offline Week, 19 th November 2014.

Online Software 8-July-98 Commissioning Working Group DØ Workshop S. Fuess Objective: Define for you, the customers of the Online system, the products.

Online Reconstruction 1M.Ellis - CM th October 2008.

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.

The CMS CERN Analysis Facility (CAF) Peter Kreuzer (RWTH Aachen) - Stephen Gowdy (CERN), Jose Afonso Sanches (UERJ Brazil) on behalf.

Computing Strategy of the AMS-02 Experiment V. Choutko 1, B. Shan 2 A. Egorov 1, A. Eline 1 1 MIT, USA 2 Beihang University, China CHEP 2015, Okinawa.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.

Large scale data flow in local and GRID environment Viktor Kolosov (ITEP Moscow) Ivan Korolko (ITEP Moscow)

Computing Strategy of the AMS-02 Experiment

GlueX Computing GlueX Collaboration Meeting – JLab Edward Brash – University of Regina December 11 th -13th, 2003.

POCC activities AMS02 meeting at KSC October 11 th 2010.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Feedback from CMS Andrew Lahiff STFC Rutherford Appleton Laboratory Contributions from Christoph Wissing, Bockjoo Kim, Alessandro Degano CernVM Users Workshop.

ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.

Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.

Maintaining and Updating Windows Server 2008 Lesson 8.

ATLAS – statements of interest (1) A degree of hierarchy between the different computing facilities, with distinct roles at each level –Event filter Online.

Local Alarm Station Data Acquisition, Storage and Visualization for Radiation Portal Monitor (RPM).

Use of FPGA for dataflow Filippo Costa ALICE O2 CERN

Chapter 19: Network Management

Simulation Production System

Database Replication and Monitoring

NA61/NA49 virtualisation:

Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław

Belle II Physics Analysis Center at TIFR

Jan 12, 2005 Improving CMS data transfers among its distributed Computing Facilities N. Magini CERN IT-ES-VOS, Geneva, Switzerland J. Flix Port d'Informació.

Tango Administrative Tools

Design and realization of Payload Operation and Application system of China’s Space Station Wang HongFei 首页.

Calibrating ALICE.

Controlling a large CPU farm using industrial tools

ProtoDUNE SP DAQ assumptions, interfaces & constraints

Offline shifter training tutorial

Data Quality Monitoring of the CMS Silicon Strip Tracker Detector

AMS Summary 16:00 on Thu 29 Mar 2018 (GMT 088) To

Performance Load Testing Case Study – Agilent Technologies

N. De Filippis - LLR-Ecole Polytechnique

Cloud computing mechanisms

Monitoring VHE Extragalactic Sources with ARGO-YBJ detector

Simulation and Physics

AMS Summary 16:00 on Fri 12 Apr 2019 (GMT 102) To

AMS Summary 16:00 on Mon 15 Apr 2019 (GMT 105) To

AMS Summary 16:00 on Thu 11 Apr 2019 (GMT 101) To

The Performance and Scalability of the back-end DAQ sub-system

AMS Summary 16:00 on Thu 18 Apr 2019 (GMT 108) To

16:00 on Fri 3 May 2019 (GMT 123) To 16:00 on Mon 6 May 2019 (GMT 126)

AMS Summary 16:00 on Fri 10 May 2019 (GMT 130) To

AMS Summary 16:00 on Mon 13 May 2019 (GMT 133) To

AMS Summary 16:00 on Tue 14 May 2019 (GMT 134) To

Status and plans for bookkeeping system and production tools

AMS Summary 16:00 on Wed 10 Jul 2019 (GMT 191) To

AMS Summary 16:00 on Thu 11 Jul 2019 (GMT 192) To

Presentation transcript:

Evolution of Monitoring System for AMS Science Operation Center Good afternoon, I am honored to present the “Evolution of Monitoring System for AMS Science Operation Center”, on behalf of the collaboration. B. Shan1 On behalf of AMS Collaboration 1Beihang University, China CHEP 2016, San Francisco

AMS Experiment The Alpha Magnetic Spectrometer (AMS) is a high energy physics experiment on board the International Space Station (ISS), featured: Geometrical acceptance: 0.5 m2sr Number of ReadOut Channels: ≈200K Main payload of Space Shuttle Endeavour’s last flight (May 16, 2011) Installed on ISS on May 19, 2011 7x24 running Up to now, over 60 billion events collected Max. event rate: 2KHz TRD TOF Tracker RICH ECAL 1 2 7-8 3-4 9 5-6 The Alpha …., with the following features. The geometrical acceptance is .5 square meter steradian The number of readout channels is around 200 thousand Our detector was the main payload of the space shuttle endeavour’s last flight It was installed on the ISS on May 19, 2011, and started to collect data ever since then, without interruption. And up to now we have collected more than 60 billion events. The maximum event collecting rate can reach 2 thousand herz.

ISS Astronaut with AMS Laptop AMS Data Flow Data transferred via relay satellites to Marshall Space Flight Center, then to CERN, nearly real-time, in form of one-minute-frames Preproduction: Frames  runs (RAW): 1 run = ¼ orbit (~23 minutes) Run number is the UNIX time of the starting of the run AMS on ISS White Sands, NM, USA TDRS Satellites ISS Astronaut with AMS Laptop AMS POCC, CERN The data collected by the detector is transferred via relay satellites, first to MSFC, and then to CERN, in nearly real-time, in the form of one-minute frame files. Frame files are then built into runs. To facilitate calibrations, we organize one quarter orbit data to one run, around 23 minutes, and we call it RAW file. Raw files are the reconstructed into ROOT files.

AMS Flight Data Reconstruction Standard Production Runs 7x24 on freshly arrived data Initial data validation and indexing Produces Data Summary Files and Event Tags (ROOT) for fast events selection Usually be available within 2 hours after flight data arriving Used to produce various calibrations for the second production as well as quick performance evaluation Pass-N Production Every 3-6 months incremental Full reconstruction in case of software major update To use all the available calibrations, alignments, ancillary data from ISS, and slow control data to produce physics analysis ready set of reconstructed data RAW ROOT Calibrations/ alignments/ SlowControlDB (Ready for physics analysis) The data reconstruction has two stages. First production is always running on freshly arrived data, it is the initial data validation and indexing. First production produces Data Summary files and Event Tags for fast events selection. Usually data will be available within 2 hours after flight data arriving It is used for detector experts to produce various calibrations for second production, and to do a quick performance evaluation Second production runs every quarter of year or half a year. However, full reconstruction is needed in case of software major updates Second production uses all the available calibrations, alignments, ancillary data from ISS, and slow control data to produce final data for physics analysis.

Science Operation Center (SOC) of AMS Responsibilities Processing of the AMS science data: Preproduction Standard production Pass-N production Monte-Carlo simulation Management of SOC own production farm 21 hosts, 264 cores 300 TB storage Operations on AMS resources at CERN LSF, EOS, CASTOR, AFS, CVMFS, ELOG, PDB-R, etc. The first responsibility of SOC is to process AMS science data, which includes the above mentioned preproduction, standard production, pass-N production and Monte-Carlo simulation. Second is to manage SOC dedicated production farm including 21 hosts, 264 cores, and 300 TB storage managed by RedHat High Availability cluster service. Also we are responsible to look after our resources provided by CERN, like …

Science Operation Center (SOC) of AMS Responsibilities Processing of the AMS science data: Frame arriving Preproduction Standard production Pass-N production Monte-Carlo simulation Management of SOC own production farm 21 hosts, 264 cores 300 TB storage Operations on AMS resources at CERN LSF, EOS, CASTOR, AFS, CVMFS, ELOG, PDB-R, etc. Frame Monitor There a set of monitoring tools for monitoring different SOC activities. I will go over the tools briefly. First is frame monitor.

Frame Monitor To monitor the frame data statistics arrived at SOC Frame monitor is to monitor the arriving rate of frame data. Due to incomplete coverage of relay satellites, frame data are not coming all the time. Usually the shift taker should have some checking if there has been no frame for one hour or more.

Science Operation Center (SOC) of AMS Responsibilities Processing of the AMS science data: Frame arriving Preproduction Standard production Pass-N production Monte-Carlo simulation Management of SOC own production farm 21 hosts, 264 cores 300 TB storage Operations on AMS resources at CERN LSF, EOS, CASTOR, AFS, CVMFS, ELOG, PDB-R, etc. Data files webpage For preproduction, we have a data files page to monitor the RAW files’ building up and validation.

Data files web page A web page to list the RAW/ROOT files The page lists the information by RUNs, including run number, run tag, path, timestamp, number of events, type, size, and frame range from which it is built. And one can check the latency of the preproduction.

Science Operation Center (SOC) of AMS Responsibilities Processing of the AMS science data: Frame arriving Preproduction Standard production Pass-N production Monte-Carlo simulation Management of SOC own production farm 21 hosts, 264 cores 300 TB storage Operations on AMS resources at CERN LSF, EOS, CASTOR, AFS, CVMFS, ELOG, PDB-R, etc. Production Monitor Standard production is important and running all the time constantly. And production monitor is such a tool to monitor the detailed status and information for the standard production.

Production Monitor Monitoring/managing standard production The production monitor uses CORBA protocol to communicate with all the producer processes to get the latest progress status. In fact production monitor is more than a monitor, administrator can configure and manage everything about the standard production, including: setting the participating hosts and number of threads, the production server parameter, etc.

Science Operation Center (SOC) of AMS Responsibilities Processing of the AMS science data: Preproduction Standard production Pass-N production Monte-Carlo simulation Management of SOC own production farm 21 hosts, 264 cores 300 TB storage Operations on AMS resources at CERN LSF, EOS, CASTOR, AFS, CVMFS, ELOG, PDB-R, etc. MC Production Monitoring Tool For MC production monitoring, we developed a web page for monitoring the progress.

MC Production Monitoring Tool Web-based MC progress monitoring (Poster-284) By this page, we can see the progress, the estimation of finishing time, and the contribution from different computing centers.

Science Operation Center (SOC) of AMS Responsibilities Processing of the AMS science data: Frame arriving Preproduction Standard production Pass-N production Monte-Carlo simulation Management of SOC own production farm 21 hosts, 264 cores 300 TB storage Operations on AMS resources at CERN LSF, EOS, CASTOR, AFS, CVMFS, ELOG, PDB-R, etc. NetMonitor is the tool to monitor our hardware health. NetMonitor

NetMonitor NetMonitor is a monitoring tool without GUI to poll the health status of SOC own computing facilities, including: Production servers/hosts NFS servers NFS mounted volumes on NFS clients Software: production server/Frame decoder/Raw validator In case of issues, NetMonitor will send out warning messages to the system administrators

Integrating to the new monitoring infrastructure of CERN The above mentioned monitoring tools Powerful “Fragmented”, lack of a “glance view” Most have no direct access from outside of CERN Design principles: Unified interface: including all related elements which should be monitored Knowledge base: to define rules for automatic trouble shooting

The key criteria for production health: Delay Frame delay (last frame arrived) If the frames are arriving normally RAW delay (latest validated RAW) If frame decoder/raw validator works normally ROOT delay (latest validated ROOT) If production server/producers/root validator work normally FRESH RUN delay (validated largest RUN number) If job requesting works normally RAW ROOT Calibrations/ alignments/ SlowControlDB (Ready for physics analysis) Frames

Metric, metric class, sensor To describe the delays, a metric class named “amssoc.proddelay” is defined in “Lemon Metrics Manager” 4 integers to indicate the 4 delay time in seconds A metric instance and a sensor are defined based on the metric class A puppet based virtual host is created to run this sensor

Web monitoring A Kibana page is created to monitor the delay data

Notification and trouble shooting Thresholds are defined for notifications Trouble shooting To define rules in the database, which is called rule base Program goes step by step according to the rules defined in the rule base Notifications are sent to the experts about the production status as well as the potential reason(s) given by the diagnostics program After each manual intervention, rule base can be updated by experts to make the diagnostics program smarter than before, hopefully

Notification and trouble shooting Example of trouble shooting rule base

Summary Legacy SOC monitoring tools provide powerful functions but “fragmented” Based on the monitoring infrastructure provided by CERN, a monitoring system for SOC is designed, which provides: Glance view of SOC status Web accessible Automatic notification Trouble shooting assistant Experiences from experts are extracted into rules and stored into database, expandable Automatic trouble shooting attempt is carried out in case of issues Manual interventions act as feedbacks to existing rules

Future works To manage SOC physical boxes by Puppet, to monitor their health status by sensors, and to integrate the sensor data to the monitoring page To add more sensors for SOC programs, like RAW/ROOT validators, job requestor, data movers, etc., and integrate their data to the monitoring page To integrate resource availability data of CERN services (EOS/CASTOR/AFS/CVMFS…) into the monitoring page To improve the rule base and provide better assistance for trouble shooting