CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CMS SAM Testing Andrea Sciabà Grid Deployment Board May 14, 2008.

Slides:



Advertisements
Similar presentations
Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
Advertisements

Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Zhiling Chen (IPP-ETHZ) Doktorandenseminar June, 4 th, 2009.
CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
How to Install and Use the DQ2 User Tools US ATLAS Tier2 workshop at IU June 20, Bloomington, IN Marco Mambelli University of Chicago.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
2 Sep Experience and tools for Site Commissioning.
Lessons for the naïve Grid user Steve Lloyd, Tony Doyle [Origin: 1645–55; < F, fem. of naïf, OF naif natural, instinctive < L nātīvus native ]native.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Metalink for Tier 1 Miguel Anjo Database mini workshop 26.January.2007.
CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Tracking your tasks with Task Monitoring PAT eLearning – Module 11 Edward.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
UMD TIER-3 EXPERIENCES Malina Kirn October 23, 2008 UMD T3 experiences 1.
Andrea Sciabà CERN CMS availability in December Critical services  CE, SRMv2 (since December) Critical tests  CE: job submission (run by CMS), CA certs.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
Stefano Belforte INFN Trieste 1 Middleware February 14, 2007 Resource Broker, gLite etc. CMS vs. middleware.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
CERN IT Department CH-1211 Geneva 23 Switzerland t CCRC’08 Tools for measuring our progress CCRC’08 F2F 5 th February 2008 James Casey, IT-GS-MND.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Priorities update Andrea Sciabà IT/GS Ulrich Schwickerath IT/FIO.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
Configuring and Troubleshooting Identity and Access Solutions with Windows Server® 2008 Active Directory®
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT DPM / LFC and FTS news Ricardo Rocha ( on behalf of the IT/GT/DMS.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
The CMS Top 5 Issues/Concerns wrt. WLCG services WLCG-MB April 3, 2007 Matthias Kasemann CERN/DESY.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
CERN IT Department CH-1211 Geneva 23 Switzerland t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008.
Julia Andreeva on behalf of the MND section MND review.
Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.
CERN IT Department CH-1211 Genève 23 Switzerland t Experiment Operations Simone Campana.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
Testing the HEPCAL use cases J.J. Blaising, F. Harris, Andrea Sciabà GAG Meeting April,
1 Andrea Sciabà CERN The commissioning of CMS computing centres in the WLCG Grid ACAT November 2008 Erice, Italy Andrea Sciabà S. Belforte, A.
Kati Lassila-Perini EGEE User Support Workshop Outline: – CMS collaboration – User Support clients – User Support task definition – passive support:
New solutions for large scale functional tests in the WLCG infrastructure with SAM/Nagios: The experiments experience ES IT Department CERN J. Andreeva.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Ideal information system - CMS Andrea Sciabà IS.
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
Feedback from CMS Andrew Lahiff STFC Rutherford Appleton Laboratory Contributions from Christoph Wissing, Bockjoo Kim, Alessandro Degano CernVM Users Workshop.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
CERN - IT Department CH-1211 Genève 23 Switzerland t Grid Reliability Pablo Saiz On behalf of the Dashboard team: J. Andreeva, C. Cirstoiu,
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
Probes Requirement Review OTAG-08 03/05/ Requirements that can be directly passed to EMI ● Changes to the MPI test (NGI_IT)
Data Management at Tier-1 and Tier-2 Centers Hironori Ito Brookhaven National Laboratory US ATLAS Tier-2/Tier-3/OSG meeting March 2010.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
CERN - IT Department CH-1211 Genève 23 Switzerland t Service Level & Responsibilities Dirk Düllmann LCG 3D Database Workshop September,
CERN IT Department CH-1211 Genève 23 Switzerland t EIS Section input to GLM For GLM attended by Director for Computing.
Job Priorities and Resource sharing in CMS A. Sciabà ECGI meeting on job priorities 15 May 2006.
CERN IT Department CH-1211 Genève 23 Switzerland t DPM status and plans David Smith CERN, IT-DM-SGT Pre-GDB, Grid Storage Services 11 November.
CMS data access Artem Trunov. CMS site roles Tier0 –Initial reconstruction –Archive RAW + REC from first reconstruction –Analysis, detector studies, etc.
The CMS Beijing Tier 2: Status and Application Xiaomei Zhang CMS IHEP Group Meeting December 28, 2007.
CERN IT Department CH-1211 Genève 23 Switzerland t EGEE09 Barcelona ATLAS Distributed Data Management Fernando H. Barreiro Megino on behalf.
Daniele Bonacorsi Andrea Sciabà
WLCG IPv6 deployment strategy
Site availability Dec. 19 th 2006
Presentation transcript:

CERN IT Department CH-1211 Genève 23 Switzerland t CMS SAM Testing Andrea Sciabà Grid Deployment Board May 14, 2008

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Outline Description of the CMS SAM tests –CE –SRM Test criticality and availability calculation –Critical tests for WLCG –Critical tests for CMS Visualisation –SAM Dashboard Current and future applications –Site commissioning –Daily checks Conclusions

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services The CMS SAM tests Goal –Test the basic functionality of some Grid services –Verify the correctness of the CMS software installation and site configuration –Reproduce the operations performed by a typical Monte Carlo or analysis job –Avoid “false alarms” –Add tests as more things that can fail are discovered

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Test submission A “canonical” approach –Private installation of the SAM client SAM Code is manually updated from time to time Code of CMS tests is automatically updated Running on the same UI as OPS, very soon moving to an 8- core CMS VOBOX to speed up test submission –Grid credentials /cms/Role=lcgadmin –Used for most of the tests run in Grid jobs to take advantage of the higher priority /cms/Role=production –Used for tests which simulate a MC production job /cms –Used for tests which must resemble an operation done by a generic user

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Computing Element tests As for OPS, these tests are run via a Grid job submitted via EDG Resource Broker –Need to move to the WMS, the RB is almost deprecated Test nameRoleMeaning CE-sft-joblcgadminFails if the job aborts CE-cms-prodproductionFails if the job aborts CE-cms-basiclcgadmin Checks CMS sw area, CMS site local configuration, Trivial File Catalogue CE-cms-swinstlcgadmin Checks correct installation of CMSSW, availability of required CMSSW versions CE-cms-squidlcgadmin Checks the local site configuration for a proxy tag and that the Squid server replies without errors CE-cms-frontierlcgadmin Using CMSSW, tries to download the ECAL pedestals from FroNtier and checks for errors CE-cms-mcproduction Like a MC job, tries to stage out a file to local SRM as described in the local site config (srmcp, rfio, etc.) CE-cms-analysislcgadmin Using CMSSW, tries to read 10 events from a random file from a given dataset and checks for errors

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services SRM v1 tests Try to copy a file SAM UI  remote SRM Use srmcp (dCache client) LFN: /store/unmerged/SAM/testSRM PFN: built from the Trivial File Catalogue (as done by PhEDEx) Test nameRoleMeaning SRM-v1-get-pfn-from-tfcproduction Looks in the PhEDEx database for the LFN-to- PFN matching according to the TFC rules for the site SRM-v1-putproductionsrmcp file://... SRM-v1-get-metadataproduction Checks remote file size and checksum (if supported) SRM-v1-getproductionsrmcp file://... then diff SRM-v1-advisory-deleteproductionsrm-advisory-delete

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services SRM v2 tests Use lcg-util commands (lcg-cp, lcg-del, lcg-ls) 1)SrmPrepareToPut + gridftp transfer + SrmPutDone 2)SrmPrepareToGet + gridftp 3)SrmRm 4)SrmLs Space tokens –Only CMS_DEFAULT is tested, but it is not required to work (so far) VO independent –The test code can be reused by any VO Test nameRoleMeaning SRMv2-get-pfn-from-tfcproduction Looks in the PhEDEx database for the LFN-to-PFN matching according to the TFC rules for the site SRMv2-lcg-cpproduction Copies forth and back and deletes a file (1+2+3) SRMv2-lcg-lsproduction As lcg-cp + tries to list the remote file ( ) SRMv2-lcg-ls-dirproduction Lists the directory with the remote file SRMv2-lcg-gtproduction As lcg-cp + tries to get a gsiftp TURL for the remote file SRMv2-lcg-gt-rm-gtproduction As lcg-gt + tries to get again a gsiftp TURL after file deletion to verify it was successful SRMv2-user- As lcg-cp but tries to write under the logical path /store/user/test (/store/user for user data)

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Test criticality Test criticality defined in two contexts –WLCG set in FCR, determines availability/reliability in GridView Only tests whose failure is a middleware/fabric problem –Job submission failures, SRM, problems... –CMS Set and taken into account in the SAM dashboard Also tests specifically related to CMS –CMSSW installation, FroNtier, etc. –The algorithms are very similar

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Critical tests Test nameRun by Computing Element CE-sft-jobCMS CE-sft-caverOPS SRMv2 SRMv2-lcg- cp CMS Test nameRun by Computing Element CE-sft-jobCMS CE-cms-prodCMS CE-cms-basicCMS CE-cms-swinstCMS CE-cms-squidCMS CE-cms-frontierCMS CE-cms-mcCMS CE-cms-analysisCMS SRMv2 SRMv2-get-pfn-from-tfcCMS SRMv2-lcg-cpCMS WLCG critical tests CMS critical tests

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Development The test development is decentralized –Every test is maintained by somebody who is an “expert” on the area Software installation, FroNtier, SRM, MC production, etc. –All tests are thoroughly documented One coordinator to decide on test criticality, needed improvements, etc. Close contact with the Dashboard team for the visualisation part

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Visualisation The Dashboard provides all that is needed to examine the output of the SAM tests Page developed following CMS requirements, soon to be adopted also by ATLAS

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Latest results

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Last 48 hours

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Test output

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Site availability

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Ranking by site availability

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Service availability

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Test history Clickable to go the test output

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Applications What are the SAM tests used for? –To see if something is not working –To measure the site availability –To rank the sites by availability Site commissioning –A new activity in CMS to determine if a site is “usable” or not –SAM test results are among the different sources of information to rate a site –Commissioning criteria still to be agreed, but for sure a site which looks “bad” in SAM will not be used for any “real” work (MC generation, user analysis) –Exception: Tier-1 sites will never be “decommissioned”

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Operations (I) Who should look at the SAM tests? –The sites! (typically the CMS site contact) It takes just a glance to see if a single site has problems In case there are, action can be taken immediately –“Backup” solution A small (~6) team of people who daily look at ~1/6 of the CMS sites and act of errors according to a checklist 1)Look for errors in the CMS SAM tests 2)If any, do one’s best to troubleshoot (a “knowledge base” is regularly updated) 3)Inform site via a Savannah ticket addressed to the local CMS site contact (as from the CMS SiteDB) »File also a GGUS ticket if a Grid problem in EGEE 4)Follow up on previously opened tickets

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services “SAM” Savannah

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Latest 24 hours

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Operations (II) Results of the backup solution –Significant improvement when the exercise started (more pronounced for Tier-1 sites) –Reached a “plateau” far from being satisfactory Alarms? –It is possible for a site to get alarms if it so desires Only one site did it, Caltech, and using the Nagios plugin developed by the WLCG Grid Services Monitoring Working Group See Conclusions –Significant effort required (it should really be just a “backup”) –Cannot go beyond a certail level –A more proactive attitude from the sites is needed –This will probably happen when sites bad in SAM will not be used

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Tier-1 sites: before and after

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Tier-2 sites: before and after

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Conclusions CMS has a well developed SAM setup Many use cases covered, still expanding OSG and EGEE sites equally covered, ARC sites (Helsinki) soon to be added SAM test results should be checked both by sites (essential) and possibly also centrally (as a backup) SAM test results, to be useful at all, must be considered in deciding whether to run on a site