Evolution of Monitoring System for AMS Science Operation Center

Evolution of Monitoring System for AMS Science Operation Center
Good afternoon, I am honored to present the “Evolution of Monitoring System for AMS Science Operation Center”, on behalf of the collaboration. B. Shan1 On behalf of AMS Collaboration 1Beihang University, China CHEP 2016, San Francisco

AMS Experiment The Alpha Magnetic Spectrometer (AMS) is a high energy physics experiment on board the International Space Station (ISS), featured: Geometrical acceptance: 0.5 m2sr Number of ReadOut Channels: ≈200K Main payload of Space Shuttle Endeavour’s last flight (May 16, 2011) Installed on ISS on May 19, 2011 7x24 running Up to now, over 60 billion events collected Max. event rate: 2KHz TRD TOF Tracker RICH ECAL 1 2 7-8 3-4 9 5-6 The Alpha …., with the following features. The geometrical acceptance is .5 square meter steradian The number of readout channels is around 200 thousand Our detector was the main payload of the space shuttle endeavour’s last flight It was installed on the ISS on May 19, 2011, and started to collect data ever since then, without interruption. And up to now we have collected more than 60 billion events. The maximum event collecting rate can reach 2 thousand herz.

ISS Astronaut with AMS Laptop
AMS Data Flow Data transferred via relay satellites to Marshall Space Flight Center, then to CERN, nearly real-time, in form of one-minute-frames Preproduction: Frames  runs (RAW): 1 run = ¼ orbit (~23 minutes) Run number is the UNIX time of the starting of the run AMS on ISS White Sands, NM, USA TDRS Satellites ISS Astronaut with AMS Laptop AMS POCC, CERN The data collected by the detector is transferred via relay satellites, first to MSFC, and then to CERN, in nearly real-time, in the form of one-minute frame files. Frame files are then built into runs. To facilitate calibrations, we organize one quarter orbit data to one run, around 23 minutes, and we call it RAW file. Raw files are the reconstructed into ROOT files.

AMS Flight Data Reconstruction
Standard Production Runs 7x24 on freshly arrived data Initial data validation and indexing Produces Data Summary Files and Event Tags (ROOT) for fast events selection Usually be available within 2 hours after flight data arriving Used to produce various calibrations for the second production as well as quick performance evaluation Pass-N Production Every 3-6 months incremental Full reconstruction in case of software major update To use all the available calibrations, alignments, ancillary data from ISS, and slow control data to produce physics analysis ready set of reconstructed data RAW ROOT Calibrations/ alignments/ SlowControlDB (Ready for physics analysis) The data reconstruction has two stages. First production is always running on freshly arrived data, it is the initial data validation and indexing. First production produces Data Summary files and Event Tags for fast events selection. Usually data will be available within 2 hours after flight data arriving It is used for detector experts to produce various calibrations for second production, and to do a quick performance evaluation Second production runs every quarter of year or half a year. However, full reconstruction is needed in case of software major updates Second production uses all the available calibrations, alignments, ancillary data from ISS, and slow control data to produce final data for physics analysis.

Science Operation Center (SOC) of AMS Responsibilities
Processing of the AMS science data: Preproduction Standard production Pass-N production Monte-Carlo simulation Management of SOC own production farm 21 hosts, 264 cores 300 TB storage Operations on AMS resources at CERN LSF, EOS, CASTOR, AFS, CVMFS, ELOG, PDB-R, etc. The first responsibility of SOC is to process AMS science data, which includes the above mentioned preproduction, standard production, pass-N production and Monte-Carlo simulation. Second is to manage SOC dedicated production farm including 21 hosts, 264 cores, and 300 TB storage managed by RedHat High Availability cluster service. Also we are responsible to look after our resources provided by CERN, like …

Processing of the AMS science data: Frame arriving Preproduction Standard production Pass-N production Monte-Carlo simulation Management of SOC own production farm 21 hosts, 264 cores 300 TB storage Operations on AMS resources at CERN LSF, EOS, CASTOR, AFS, CVMFS, ELOG, PDB-R, etc. Frame Monitor There a set of monitoring tools for monitoring different SOC activities. I will go over the tools briefly. First is frame monitor.

Frame Monitor To monitor the frame data statistics arrived at SOC
Frame monitor is to monitor the arriving rate of frame data. Due to incomplete coverage of relay satellites, frame data are not coming all the time. Usually the shift taker should have some checking if there has been no frame for one hour or more.

Processing of the AMS science data: Frame arriving Preproduction Standard production Pass-N production Monte-Carlo simulation Management of SOC own production farm 21 hosts, 264 cores 300 TB storage Operations on AMS resources at CERN LSF, EOS, CASTOR, AFS, CVMFS, ELOG, PDB-R, etc. Data files webpage For preproduction, we have a data files page to monitor the RAW files’ building up and validation.

Data files web page A web page to list the RAW/ROOT files
The page lists the information by RUNs, including run number, run tag, path, timestamp, number of events, type, size, and frame range from which it is built. And one can check the latency of the preproduction.

Processing of the AMS science data: Frame arriving Preproduction Standard production Pass-N production Monte-Carlo simulation Management of SOC own production farm 21 hosts, 264 cores 300 TB storage Operations on AMS resources at CERN LSF, EOS, CASTOR, AFS, CVMFS, ELOG, PDB-R, etc. Production Monitor Standard production is important and running all the time constantly. And production monitor is such a tool to monitor the detailed status and information for the standard production.

Production Monitor Monitoring/managing standard production
The production monitor uses CORBA protocol to communicate with all the producer processes to get the latest progress status. In fact production monitor is more than a monitor, administrator can configure and manage everything about the standard production, including: setting the participating hosts and number of threads, the production server parameter, etc.

Processing of the AMS science data: Preproduction Standard production Pass-N production Monte-Carlo simulation Management of SOC own production farm 21 hosts, 264 cores 300 TB storage Operations on AMS resources at CERN LSF, EOS, CASTOR, AFS, CVMFS, ELOG, PDB-R, etc. MC Production Monitoring Tool For MC production monitoring, we developed a web page for monitoring the progress.

MC Production Monitoring Tool
Web-based MC progress monitoring (Poster-284) By this page, we can see the progress, the estimation of finishing time, and the contribution from different computing centers.

Processing of the AMS science data: Frame arriving Preproduction Standard production Pass-N production Monte-Carlo simulation Management of SOC own production farm 21 hosts, 264 cores 300 TB storage Operations on AMS resources at CERN LSF, EOS, CASTOR, AFS, CVMFS, ELOG, PDB-R, etc. NetMonitor is the tool to monitor our hardware health. NetMonitor

NetMonitor NetMonitor is a monitoring tool without GUI to poll the health status of SOC own computing facilities, including: Production servers/hosts NFS servers NFS mounted volumes on NFS clients Software: production server/Frame decoder/Raw validator In case of issues, NetMonitor will send out warning messages to the system administrators

Integrating to the new monitoring infrastructure of CERN
The above mentioned monitoring tools Powerful “Fragmented”, lack of a “glance view” Most have no direct access from outside of CERN Design principles: Unified interface: including all related elements which should be monitored Knowledge base: to define rules for automatic trouble shooting

The key criteria for production health: Delay
Frame delay (last frame arrived) If the frames are arriving normally RAW delay (latest validated RAW) If frame decoder/raw validator works normally ROOT delay (latest validated ROOT) If production server/producers/root validator work normally FRESH RUN delay (validated largest RUN number) If job requesting works normally RAW ROOT Calibrations/ alignments/ SlowControlDB (Ready for physics analysis) Frames

Metric, metric class, sensor
To describe the delays, a metric class named “amssoc.proddelay” is defined in “Lemon Metrics Manager” 4 integers to indicate the 4 delay time in seconds A metric instance and a sensor are defined based on the metric class A puppet based virtual host is created to run this sensor

Web monitoring A Kibana page is created to monitor the delay data

Notification and trouble shooting
Thresholds are defined for notifications Trouble shooting To define rules in the database, which is called rule base Program goes step by step according to the rules defined in the rule base Notifications are sent to the experts about the production status as well as the potential reason(s) given by the diagnostics program After each manual intervention, rule base can be updated by experts to make the diagnostics program smarter than before, hopefully

Notification and trouble shooting
Example of trouble shooting rule base

Summary Legacy SOC monitoring tools provide powerful functions but “fragmented” Based on the monitoring infrastructure provided by CERN, a monitoring system for SOC is designed, which provides: Glance view of SOC status Web accessible Automatic notification Trouble shooting assistant Experiences from experts are extracted into rules and stored into database, expandable Automatic trouble shooting attempt is carried out in case of issues Manual interventions act as feedbacks to existing rules

Future works To manage SOC physical boxes by Puppet, to monitor their health status by sensors, and to integrate the sensor data to the monitoring page To add more sensors for SOC programs, like RAW/ROOT validators, job requestor, data movers, etc., and integrate their data to the monitoring page To integrate resource availability data of CERN services (EOS/CASTOR/AFS/CVMFS…) into the monitoring page To improve the rule base and provide better assistance for trouble shooting

Evolution of Monitoring System for AMS Science Operation Center

Similar presentations

Presentation on theme: "Evolution of Monitoring System for AMS Science Operation Center"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Evolution of Monitoring System for AMS Science Operation Center

Similar presentations

Presentation on theme: "Evolution of Monitoring System for AMS Science Operation Center"— Presentation transcript:

Similar presentations

About project

Feedback