Download presentation
Presentation is loading. Please wait.
Published byJasper Heath Modified over 8 years ago
1
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo, Vincenzo Miccio, Elisa Lanciotti, Roberto Santinelli and Andrea Sciabà Service Availability Monitor (SAM) SAM is the EGEE framework developed to provide a global and uniform monitoring tool for Grid services. Periodic tests, organized in sensors, on all Grid services. Test results: published in an internal database (SAMDB) exposed via web interface processed to calculate availability metrics for site validation SAM Alarm System SAM offers an alarm system to announce failures to sites and Vos. List of critical tests defined through the FCR web interface. Notification system (email and/or sms), frequency of reports and contact persons configurable on a site by site (and VO by VO) basis. SAM Architecture Input: Site information collection tools: Static and dynamic information SAM submission framework: test submission, high level execution workflow Storage and Processing Web services: query/publishing, programmatic interface Oracle Database: storing the test results, test description, test criticality, alarms Output SAM display: SAM portal and GridView (availability graphs, historical test results, detailed test results) Experiment monitoring with SAM The Experiment Integration Support (EIS) team is active since 2002 in the Worldwide LHC Computing Grid project. The EIS team helps the LHC experiments and other communities to run activities on the Grid as effectively as possible. EIS activities include: Contributing to integrate the experiment computing framework with the Grid middleware Interfacing user communities with the middleware developers and the WLCG infrastructure operations Developing new user tools to implement functionalities missing from the Grid middleware The flexibility of the SAM framework makes it an excellent choice for any Virtual Organization to implement custom tests on existing service types, or even on experiment-specific services. The EIS team is strongly involved in the integration of the LHC experiment monitoring with SAM. SAM implementation for the ALICE experiment The ALICE production model requires a dedicated host (VOBOX) at each site: deploy and manage specific long-living agents and install the ALICE specific software More than 60 ALICE VOBOXes deployed all over the world! ALICE created a self-contained test suite verifying the correct behavior of these nodes based: VOBOX services (proxy renewal and delegation) VOBOX clients configuration SAM framework fulfilled ALICE requirements on VOBOX monitoring: Flexible definition and dynamic configuration of tests VO-based definition of service endpoint to be monitored The ATLAS experiment is developing dedicated SAM tests to monitor the availability of critical site services, like CE and SE, and to verify the correct installation of the ATLAS software installation on each site. Endpoints definitions contained in an ATLAS specific configuration file (TiersOfATLAS). Different endpoints might need to be tested using different VOMS credentials ATLAS uses the SAM alarm system: SE / SRM / CE tests failing: site contact persons will be alerted via SAM Alarm System Grid Services (FTS, LFC etc.) tests failing: alarms will be sent to the service responsible and to the ATLAS dedicated services (Distributed Data Management, etc) that use those services ATLAS runs standard Grid Operations team tests, but using ATLAS Grid credentials Under development: Storage Element endpoint: test direct access to the SE / SRM via native protocols Computing Element: test all the functionalities needed for production and analysis jobs send on the WN of the CEs a special job to: check the presence of the required version of the ATLAS sw compile and execute a real analysis job based on a sample dataset Special agreements have been negotiated with the sites to ensure highest priority for these jobs. CEs passing these critical sensors are eligible to run LHCb jobs. The sensor workflows are now constructed using the DIRAC API and submitted to the DIRAC WMS where LHCb can monitor them. At the end of execution on a worker node, the result is published from the WN to the SAM DB and the output is sent to a special Storage Element accessible via web. LHCb use the SAM framework for: Checking the availability of Computing Elements; Detecting Operating System and architecture; Installing the appropriate versions of the LHCb software if a shared area is provided. Evolution of the usage of the SAM sensor 50% of sites over the 80% availability mark CMS monitoring via SAM CMS has adopted SAM as the system to implement Grid- wide monitoring of computing and storage elements. CMS contacts at sites must ensure that the CMS tests run successfully. An interesting side effect of the choice of SAM for the CMS monitoring was a strong push for interoperability between EGEE and OSG: jobs are submitted by SAM to the LCG Resource Broker also for OSG sites. CMS - Computing Element tests NameChecks basicCMS software area and CMS site local configuration swinstPresence of the required versions of the CMSSW Monte CarloStage out of a file from the WN to the local SE SquidBasic functionality of the closest Squid server FroNtier CMS - SRM tests NameChecks get-pfn-from-tfcGets LFN PFN rule from a central CMS DB putCopies a test file into the SRM via srmcp get-metadataGets metadata of the remote file getCopies back the remote file advisory-deleteRemoves the remote file Site availability The outcome of the CMS SAM tests is used to give a measurement of the CMS availability. It is expressed as the fraction of successful tests as a function of time. SAM test suite is run with the credentials of the LHCb Software Manager and is composed of several critical sensors for LHCb, grouped in a single task Testing middleware components as they become available Directly participating to the experiment computing activities (data challenges, MC production, etc.) Providing end-user documentation
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.