WLCG Operations and Tools TEG Status Report Maria Girone and Jeff Templon GDB, 14 December 2011.

Slides:



Advertisements
Similar presentations
Operations Coordination Team Maria Girone, CERN IT-ES GDB 10 th October 2012.
Advertisements

Operations Coordination Team Maria Girone, CERN IT-ES Kick-off meeting 24 th September 2012.
WLCG Operations and Tools TEG Monitoring – Experiment Perspective Simone Campana and Pepe Flix Operations TEG Workshop, 23 January 2012.
The Middleware Readiness Working Group LHCb Computing Workshop LHCb Computing Workshop Maria Dimou IT/SDC 2014/05/22.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Input from CMS Nicolò Magini Andrea Sciabà IT/SDC 5 July 2013.
OSG Operations and Interoperations Rob Quick Open Science Grid Operations Center - Indiana University EGEE Operations Meeting Stockholm, Sweden - 14 June.
EGI: SA1 Operations John Gordon EGEE09 Barcelona September 2009.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EG recent developments T. Ferrari/EGI.eu ADC Weekly Meeting 15/05/
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
Workshop summary Ian Bird, CERN WLCG Workshop; DESY, 13 th July 2011 Accelerating Science and Innovation Accelerating Science and Innovation.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF TEG Workshop, 7 th February 2012.
Evolution of Grid Projects and what that means for WLCG Ian Bird, CERN WLCG Workshop, New York 19 th May 2012.
MW Readiness Verification Status Andrea Manzi IT/SDC 21/01/ /01/15 2.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team James Casey EGEE’08.
Towards a Global Service Registry for the World-Wide LHC Computing Grid Maria ALANDES, Laurence FIELD, Alessandro DI GIROLAMO CERN IT Department CHEP 2013.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE-EGI Grid Operations Transition Maite.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
WLCG operations A. Sciabà, M. Alandes, J. Flix, A. Forti WLCG collaboration workshop July , Barcelona.
MW Readiness WG Update Andrea Manzi Maria Dimou Lionel Cons 10/12/2014.
Julia Andreeva, CERN IT-ES GDB Every experiment does evaluation of the site status and experiment activities at the site As a rule the state.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
Information System Status and Evolution Maria Alandes Pradillo, CERN CERN IT Department, Grid Technology Group GDB 13 th June 2012.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
WLCG Monitoring Roadmap Julia Andreeva, CERN , WLCG workshop, CERN.
Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.
Site Manageability & Monitoring Issues for LCG Ian Bird IT Department, CERN LCG MB 24 th October 2006.
WLCG Software Lifecycle First ideas for a post EMI approach 0.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
WLCG Technical Evolution Group: Operations and Tools Maria Girone & Jeff Templon Kick-off meeting, 24 th October 2011.
ATP Future Directions Availability of historical information for grid resources: It is necessary to store the history of grid resources as these resources.
Julia Andreeva on behalf of the MND section MND review.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
PIC port d’informació científica EGEE – EGI Transition for WLCG in Spain M. Delfino, G. Merino, PIC Spanish Tier-1 WLCG CB 13-Nov-2009.
LCG WLCG Accounting: Update, Issues, and Plans John Gordon RAL Management Board, 19 December 2006.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
GridView - A Monitoring & Visualization tool for LCG Rajesh Kalmady, Phool Chand, Kislay Bhatt, D. D. Sonvane, Kumar Vaibhav B.A.R.C. BARC-CERN/LCG Meeting.
WLCG Operations and Tools TEG Report to the GDB Jeff Templon and Maria Girone 9 November 2011.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
New solutions for large scale functional tests in the WLCG infrastructure with SAM/Nagios: The experiments experience ES IT Department CERN J. Andreeva.
LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010.
WLCG Technical Evolution Group: Operations and Tools Maria Girone & Jeff Templon GDB 12 th October 2011, CERN.
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
MW Readiness WG Update Andrea Manzi Maria Dimou Lionel Cons Maarten Litmaath On behalf of the WG participants GDB 09/09/2015.
CERN - IT Department CH-1211 Genève 23 Switzerland t IT-GD-OPS attendance to EGEE’09 IT/GD Group Meeting, 09 October 2009.
WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
Ian Bird LCG Project Leader Status of EGEE  EGI transition WLCG LHCC Referees’ meeting 21 st September 2009.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
EMI INFSO-RI Testbed for project continuous Integration Danilo Dongiovanni (INFN-CNAF) -SA2.6 Task Leader Jozef Cernak(UPJŠ, Kosice, Slovakia)
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.
CMS Experience with the Common Analysis Framework I. Fisk & M. Girone Experience in CMS with the Common Analysis Framework Ian Fisk & Maria Girone 1.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
Outcome should be a documented strategy Not everything needs to go back to square one! – Some things work! – Some work has already been (is being) done.
WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,
WLCG Operations Coordination report Maria Dimou Andrea Sciabà IT/SDC On behalf of the WLCG Operations Coordination team GDB 12 th November 2014.
Campana (CERN-IT/SDC), McKee (Michigan) 16 October 2013 Deployment of a WLCG network monitoring infrastructure based on the perfSONAR-PS technology.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Monitoring Overview: status, issues and outlook Simone Campana.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Regional tools use cases overview Peter Solagna – EGI.eu On behalf of the.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
Monitoring Working Group Update Grid Deployment Board 5 th December, CERN Ian Neilson.
Operations Coordination Team Maria Girone, CERN IT-ES GDB, 11 July 2012.
Daniele Bonacorsi Andrea Sciabà
WLCG IPv6 deployment strategy
James Casey, CERN IT-GD WLCG Workshop 1st September, 2007
POW MND section.
Monitoring of the infrastructure from the VO perspective
Presentation transcript:

WLCG Operations and Tools TEG Status Report Maria Girone and Jeff Templon GDB, 14 December 2011

Introduction Five areas of work identified – WG1: Monitoring – WG2: Support tools, Underlying Services and WLCG Operations – WG3: Application Software Management – WG4: Operational Requirements on Middleware – WG5: Middleware Configuration, Deployment and Distribution GDB, October 20112

Organization – Monitoring and Metrics: Simone Campana and Pepe Flix Costin Grigoras, Andrea Sciabà, Alessandro Di Girolamo; (sites) Ian Collier, David Collados, Marian Babik, Xavier Espinal, Vera Hansper, Alexandre Lossent, Ian Fisk, Alessandra Forti, Gonzalo Merino – Support tools + Underlying Services+ WLCG Operations: Andrea Sciabà, Maria Dimou (support tools), Lionel Cons (underlying services) and Stefan Roiser (WLCG Operations) Simone Campana, Andrea Sciabà, Alessandro Di Girolamo, Joel Closier, Pepe Flix, John Gordon, Alison Packer; (sites) Alexandre Lossent, Xavier Espinal, NIls Hoimyr, Alessandra Forti, Tiziana Ferrari – Application Software Management : Stefan Roiser Marco Cattaneo, Steve Traylen; (sites) Ian Collier, Alexandre Lossent, Alessandra Forti – Operational Requirements on Middleware: Maarten Litmaath, Tiziana Ferrari (EGI) Maria Dimou, Laurence Field (EMI); Doina Cristina Aiftimiei (EMI); (sites) Alexandre Lossent, Jeff Templon, Vera Hansper, Anthony Tiradani, Paolo Veronesi – Middleware Configuration, Deployment and Distribution: Oliver Keeble, Rob Quick Cristina Aiftimiei, Simone Campana; (sites) Ian Collier, Alexandre Lossent, Pablo Fernandez, Felix Lee GDB, October 20113

Status GDB, October Six weeks since kick-off dedicated to assess the current situation and issues – All working groups have received input from sites, experiments and infrastructure providers. Finalizing now the report on the current status: Technology and Tools Procedures and Policies Described Areas of Improvement Identify tools or procedures that are out of date or should be rethought Identify the largest use of operational effort or areas of potential efficiency gains Missing Areas The current situation has evolved over a number of years and has seen quite a few changes – Important to document and review inconsistencies, duplication and address issues Workshop held on 12 th December to – Identify common patterns among issues raised – Identify possible strategies – Prioritize suggested changes

Workshop Agenda (1/2) GDB, October 20115

Workshop Agenda (2/2) GDB, October 20116

Priorities from the MB Discussions at the last MB have asked “Operations TEG to provide a clear strategy on Availability Reporting which takes account of the needs of all the stakeholders (at least experiments, sites and funding agencies) and reduces the duplication which currently exists” “The Experiments agreed to continue their contributions in the Operations TEG who will provide an update at the GDB meeting on 14 Dec 2011” GDB, October 20117

WG1: Monitoring 8GDB, October 2011

Monitoring: Technologies and Tools Experiment Activity Monitoring – ALICE and LHCb: in-house complete monitoring Workload Management, Data Management, Service Monitoring Common Framework and Common Library for all use cases – ATLAS and CMS: largely relying on Experiment Dashboards Based on common framework Nevertheless, several ad-hoc monitoring systems (CMS PhEDEx monitor, CMS Site Readiness, ATLAS Panda Monitors Site Fabric Monitoring – All sites (obviously) instrumented fabric monitoring Crashing daemons, blackhole WNs etc.. Popular tools: Nagios, Ganglia, Lemon – No need to converge on one system (unrealistic) Make sure the middleware services come with generic probes GDB, October 20119

SAM is used by 4 experiments to monitor services at sites – Very modular framework - Nagios probes (NGIs and experiments): launch tests and publish results in Messaging System Results stored in Central DB as well as NGIs local DBs ACE component: to calculate availabilities – SAM allows the definition of profiles (list of metrics) Very useful to provide views to different communities – SAM test results can be fetched from messaging and injected into local monitoring HammerCloud is used by ATLAS, CMS and LHCb for site (stress) testing and monitor their performance – Data Processing (Production and Analysis) – Site and experiments like it GDB, October Monitoring: Technologies and Tools

Site Monitoring The Site Status Board (SSB) is used by ATLAS and CMS for site monitoring – Visualizes arbitrary metrics for a list of sites (highly configurable) – Filtering/Sorting + Visualization – Offers a programmatic interface to expose current and historical values Some experiments integrate the SSB with ad-hoc monitoring tools – For example the CMS Site Readiness SiteView: tool for aggregating the information from VO specific monitoring systems and display a summary in one view. Still in production but would need to be validated GDB, October

SSB and SiteView GDB, October

Site Monitoring: proposal We miss the equivalent of the today’s SSB experiment views tailored for sites Proposal to use the SSB framework to provide this functionality as well – Advantages: many metrics already in the SSB for ATLAS and CMS No duplication of effort nor issues of consistency – Need to agree on a few common metrics between experiments Relevant from a site perspective – Some development needed in SSB to facilitate the visualization – Some commitment needed from experiment and sites Experiment (support): define and inject metrics, validation Sites: validation, feedback GDB, October

Monitoring: Areas of Improvement Network Monitoring – Network issues are difficult to spot and need a lot of operational effort – Suggest deployment of PerfSONAR(PS/MND) in all WLCG – Complementary to new FTS monitoring Need for Monitoring Coordination on WLCG level GDB, October

Site Availability: OPS and VO tests Standard tests – Developed by EGI and bundled with the official SAM probes OPS tests – Tests run with an OPS proxy; they are all standard tests Experiment custom tests – Custom tests developed by an experiment and run with an experiment proxy Submitted by a Nagios box, or Submitted by other test systems and published in SAM Critical tests – Tests whose result is used to determine the state of a service instance (and hence the service and the site availability) Other quality metrics outside the SAM framework – E.g. quality of data transfers, success rate of typical jobs (HammerCloud, Job Robot, DIRAC), etc.

Availability: what we have and use today GridView availability – Used primarily by the WLCG management – Available for all the WLCG VOs and OPS Dashboard availability – Calculated by a Dashboard application (but the algorithm is the same) – Used by the experiment operations teams – May use different critical tests ACE availability – Not yet in production for the experiments (pending validation) but so for OPS – Will soon calculate all availabilities

Schematic view: Today OpsALICEATLASCMSLHCb         Standard tests Exp custom tests Critical standard test critical custom test, submitted by Nagios  non-critical test, external and injected to SAM SAM tests ALICEATLASCMSLHCb       External metrics

Current Issues OPS tests – Sites are very familiar with them and look at them – Not sufficient by the experiments as they may test different services than those used by them and test results may depend on the VO of the proxy Gridview availabilities are too good – Too few tests are critical! – Largely uncorrelated to the “real” usability – Included in a monthly report to the WLCG On the other hand VO custom tests may be obscure for sites There are site quality metrics that are not measured by SAM – E.g. CMS Job Robot, HammerCloud, data transfer quality etc.

Proposal GDB, October Experiments extend their SAM tests to test more site-specific functionality – Any new test contributing to the availability is properly agreed upon with sites and documented The SAM framework is extended to properly support external metrics such as from Panda, DIRAC, … The resulting availability will have these properties: – Takes into account more relevant site functionality – Is as independent as possible from experiment-side issues – Is well understood by the sites Additional experiment-side metrics are nevertheless used by VO computing operations and published e.g. via SSB

Schematic view: Proposal OpsALICEATLASCMSLHCb         Standard tests Exp custom tests Critical standard test critical custom test, submitted by Nagios  non-critical test, external and injected to SAM SAM tests ALICEATLASCMSLHCb       External metrics

WG2: Support Tools, Underlying Services, WLCG Operations 21GDB, October 2011

Support tools: status, issues and outlook Tools are generally good and mature enough, well proven in operations Some issues were identified – Improve the reliability of GGUS as a service and of the interfaces with other ticketing systems – Improve the accuracy and the completeness of accounting data (especially CPU scaling factors) – Provide a way to publish VO-specific information (e.g. downtimes) in GOCDB (or elsewhere) GDB, October

Underlying services: status, issues and outlook No critical issues here, but significant improvements desired – Improve security, scalability and reliability of the messaging service MSG – Improve the stability of the information system as well as the validity and accuracy of the information Suggestions from developers to achieve it with a redesign of the system to separate static from dynamic data Better information validation tools under way – Ensure proper support for batch systems and their integration with middleware – Investigate new solutions to address current scalability and support shortcomings in some systems (Torque/Maui, SGE) and evaluate potential candidates (CONDOR, SLURM) in particular for EGI sites GDB, October

WLCG operations and procedures: status, issues and outlook Here the focus is on making experiment-site interactions work better with less people – Establish a proper channel with the Tier-2 sites Now communication mainly delegated to experiments – Strengthen central WLCG operations – Reduce the need of experiment contact persons while preserving or improving the quality of the link between experiments and sites GDB, October

WG3: Application Software Management 25GDB, October 2011

Application Software Management: Status and Issues GDB, October – Software Management The LCG Application Area is the established body to coordinate software management between experiments – Configuration Management Many individually developed configuration and build tools – One candidate could be CMake (used by Alice, being considered by LHCb) Heavy load on the sites shared sw area during runtime environment setup is still an issue for the experiments – Deployment Management Need to agree on one common software deployment tool and process between all experiments – ATLAS and LHCb use CVMFS but – not on all sites- need to maintain another additional deployment system

WG4 Operational Requirements on Middleware WG5: Middleware Configuration, Deployment and Distribution 27GDB, October 2011

Operational Requirements on Middleware Requirements on services and clients – Robustness Client retries and failover Service load balancing Design based on long term scalability and sustainability – Simple configuration E.g. reasonable default – Monitoring Error messages: unambiguous and documented both for client and server Logging: proper verbosity depending on the use case – Proper documentation GDB, October

EGI Requirement Gathering Process EGI collects operational requirements on ARC, dCache, gLite, UNICORE and Globus. Requirements can be submitted through the EGI Request Tracker system The same process for tools maintained by EGI, namely: accounting, accounting portal, GGUS, GOCDB, Operations portal and dashboard, SAM Middleware requirements are collected, discussed and prioritized at the Operations Management Board (OMB) Operational tool requirements are collected, discussed and prioritized at the Operational Tool Advisory Group (OTAG) GDB, October

Middleware Configuration, Deployment and Distribution Configuration – Improved documentation, packaging, simplification (e.g. files in standard locations,..) – On-going discussions on yaim, puppet, quattor Deployment – Pre-release validations and pilots (needs commitment from sites, experiments, developers) – Alternative client distribution strategies (e.g. CVMFS) Distribution – Improve the responsiveness of the release process – Resolve repository proliferation EMI, UMD(EGI), gLite, OSG, RC, dCache, EPEL,.. – Maintain application area releases GDB, October

Next Steps Prepare TOC of final report F2F meeting in Amsterdam to finalize the proposals on strategy and review the draft document – in conjunction with the DM and Storage TEG ones Deliver document on schedule in February Some decisions can be made internal to WLCG (priorities, resources) Others will require negotiation with external bodies, e.g. EGI GDB, October

Summary Very active participation from service and infrastructure providers, experiments and (some) sites: many thanks to all! Well advanced with: – Assessment of the current situation; – Identification of problem areas; – Discussions on proposals to address these just started at this week’s workshop Need to understand timelines: – Things that can be addressed for 2012 run (few) – LS1: extensive analysis and reprocessing – Post-LS1: a new phase in WLCG GDB, October

Additional information Mailing list: Twiki: GTEGOperations GTEGOperations Indico agenda pages: egId= egId=3770 GDB, October