SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

Slides:



Advertisements
Similar presentations
JSAGA2 Overview job desc. gLite plug-ins Globus plug-ins JSAGA hidemiddlewareheterogeneity (e.g. gLite, Globus, Unicore) JDLRSL.
Advertisements

The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Services Abderrahman El Kharrim
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
OSG Middleware Roadmap Rob Gardner University of Chicago OSG / EGEE Operations Workshop CERN June 19-20, 2006.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
F.Fanzago – INFN Padova ; S.Lacaprara – LNL; D.Spiga – Universita’ Perugia M.Corvo - CERN; N.DeFilippis - Universita' Bari; A.Fanfani – Universita’ Bologna;
Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
Marian Babik, Luca Magnoni SAM Test Framework. Outline  SAM Test Framework  Update on Job Submission Timeouts  Impact of Condor and direct CREAM tests.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
The ILC And the Grid Andreas Gellrich DESY LCWS2007 DESY, Hamburg, Germany
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
Stefano Belforte INFN Trieste 1 Middleware February 14, 2007 Resource Broker, gLite etc. CMS vs. middleware.
T3 analysis Facility V. Bucard, F.Furano, A.Maier, R.Santana, R. Santinelli T3 Analysis Facility The LHCb Computing Model divides collaboration affiliated.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The usage of the gLite Workload Management.
22 February 2008GS Group Meeting - EIS section GS-EIS: Experiment Integration Support section Five staff: Harry Renshall Section Leader Simone Campana.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Glite. Architecture Applications have access both to Higher-level Grid Services and to Foundation Grid Middleware Higher-Level Grid Services are supposed.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Recent improvements in HLRmon, an accounting portal suitable for national Grids Enrico Fattibene (speaker), Andrea Cristofori, Luciano Gaido, Paolo Veronesi.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Julia Andreeva on behalf of the MND section MND review.
The GridPP DIRAC project DIRAC for non-LHC communities.
Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
DIRAC Project A.Tsaregorodtsev (CPPM) on behalf of the LHCb DIRAC team A Community Grid Solution The DIRAC (Distributed Infrastructure with Remote Agent.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –
New solutions for large scale functional tests in the WLCG infrastructure with SAM/Nagios: The experiments experience ES IT Department CERN J. Andreeva.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Ideal information system - CMS Andrea Sciabà IS.
Enabling Grids for E-sciencE Experience Supporting the Integration of LHC Experiments Computing Systems with the LCG Middleware Simone.
Open Science Grid OSG Resource and Service Validation and WLCG SAM Interoperability Rob Quick With Content from Arvind Gopu, James Casey, Ian Neilson,
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
The GridPP DIRAC project DIRAC for non-LHC communities.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
MND section. Summary of activities Job monitoring In collaboration with GridView and LB teams enabled full chain from LB harvester via MSG to Dashboard.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
LCG Workshop User Support Working Group 2-4 November 2004 – n o 1 Some thoughts on planning and organization of User Support in LCG/EGEE Flavia Donno LCG.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
CERN IT Department CH-1211 Genève 23 Switzerland t CMS SAM Testing Andrea Sciabà Grid Deployment Board May 14, 2008.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
Daniele Bonacorsi Andrea Sciabà
StoRM: a SRM solution for disk based storage systems
POW MND section.
Brief overview on GridICE and Ticketing System
Patricia Méndez Lorenzo ALICE Offline Week CERN, 13th July 2007
Short update on the latest gLite status
A Messaging Infrastructure for WLCG
Monitoring of the infrastructure from the VO perspective
Leigh Grundhoefer Indiana University
Site availability Dec. 19 th 2006
The LHCb Computing Data Challenge DC06
Presentation transcript:

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo, Vincenzo Miccio, Elisa Lanciotti, Roberto Santinelli and Andrea Sciabà Service Availability Monitor (SAM) SAM is the EGEE framework developed to provide a global and uniform monitoring tool for Grid services. Periodic tests, organized in sensors, on all Grid services. Test results:  published in an internal database (SAMDB)  exposed via web interface  processed to calculate availability metrics for site validation SAM Alarm System SAM offers an alarm system to announce failures to sites and Vos. List of critical tests defined through the FCR web interface. Notification system ( and/or sms), frequency of reports and contact persons configurable on a site by site (and VO by VO) basis. SAM Architecture Input: Site information collection tools: Static and dynamic information SAM submission framework: test submission, high level execution workflow Storage and Processing Web services: query/publishing, programmatic interface Oracle Database: storing the test results, test description, test criticality, alarms Output SAM display: SAM portal and GridView (availability graphs, historical test results, detailed test results) Experiment monitoring with SAM The Experiment Integration Support (EIS) team is active since 2002 in the Worldwide LHC Computing Grid project. The EIS team helps the LHC experiments and other communities to run activities on the Grid as effectively as possible. EIS activities include: Contributing to integrate the experiment computing framework with the Grid middleware Interfacing user communities with the middleware developers and the WLCG infrastructure operations Developing new user tools to implement functionalities missing from the Grid middleware The flexibility of the SAM framework makes it an excellent choice for any Virtual Organization to implement custom tests on existing service types, or even on experiment-specific services. The EIS team is strongly involved in the integration of the LHC experiment monitoring with SAM. SAM implementation for the ALICE experiment The ALICE production model requires a dedicated host (VOBOX) at each site:  deploy and manage specific long-living agents and install the ALICE specific software  More than 60 ALICE VOBOXes deployed all over the world! ALICE created a self-contained test suite verifying the correct behavior of these nodes based: VOBOX services (proxy renewal and delegation) VOBOX clients configuration SAM framework fulfilled ALICE requirements on VOBOX monitoring:  Flexible definition and dynamic configuration of tests  VO-based definition of service endpoint to be monitored The ATLAS experiment is developing dedicated SAM tests to monitor the availability of critical site services, like CE and SE, and to verify the correct installation of the ATLAS software installation on each site.  Endpoints definitions contained in an ATLAS specific configuration file (TiersOfATLAS).  Different endpoints might need to be tested using different VOMS credentials ATLAS uses the SAM alarm system: SE / SRM / CE tests failing: site contact persons will be alerted via SAM Alarm System Grid Services (FTS, LFC etc.) tests failing: alarms will be sent to the service responsible and to the ATLAS dedicated services (Distributed Data Management, etc) that use those services ATLAS runs standard Grid Operations team tests, but using ATLAS Grid credentials Under development:  Storage Element endpoint: test direct access to the SE / SRM via native protocols  Computing Element: test all the functionalities needed for production and analysis jobs  send on the WN of the CEs a special job to:  check the presence of the required version of the ATLAS sw  compile and execute a real analysis job based on a sample dataset  Special agreements have been negotiated with the sites to ensure highest priority for these jobs. CEs passing these critical sensors are eligible to run LHCb jobs.  The sensor workflows are now constructed using the DIRAC API and submitted to the DIRAC WMS where LHCb can monitor them. At the end of execution on a worker node, the result is published from the WN to the SAM DB and the output is sent to a special Storage Element accessible via web. LHCb use the SAM framework for: Checking the availability of Computing Elements; Detecting Operating System and architecture; Installing the appropriate versions of the LHCb software if a shared area is provided. Evolution of the usage of the SAM sensor 50% of sites over the 80% availability mark CMS monitoring via SAM CMS has adopted SAM as the system to implement Grid- wide monitoring of computing and storage elements. CMS contacts at sites must ensure that the CMS tests run successfully. An interesting side effect of the choice of SAM for the CMS monitoring was a strong push for interoperability between EGEE and OSG: jobs are submitted by SAM to the LCG Resource Broker also for OSG sites. CMS - Computing Element tests NameChecks basicCMS software area and CMS site local configuration swinstPresence of the required versions of the CMSSW Monte CarloStage out of a file from the WN to the local SE SquidBasic functionality of the closest Squid server FroNtier CMS - SRM tests NameChecks get-pfn-from-tfcGets LFN  PFN rule from a central CMS DB putCopies a test file into the SRM via srmcp get-metadataGets metadata of the remote file getCopies back the remote file advisory-deleteRemoves the remote file Site availability The outcome of the CMS SAM tests is used to give a measurement of the CMS availability. It is expressed as the fraction of successful tests as a function of time.  SAM test suite is run with the credentials of the LHCb Software Manager and is composed of several critical sensors for LHCb, grouped in a single task Testing middleware components as they become available Directly participating to the experiment computing activities (data challenges, MC production, etc.) Providing end-user documentation