Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF TEG Workshop, 7 th February 2012.

Slides:

Advertisements

Similar presentations

4 th DataGRID Project Conference, Paris, 5 March 2002 Testbed Software Test Plan I. Mandjavidze on behalf of L. Bobelin – CS SI; F.Etienne, E. Fede – CPPM;

Advertisements

Operations Coordination Team Maria Girone, CERN IT-ES GDB 10 th October 2012.

Operations Coordination Team Maria Girone, CERN IT-ES Kick-off meeting 24 th September 2012.

WLCG Operations and Tools TEG Monitoring – Experiment Perspective Simone Campana and Pepe Flix Operations TEG Workshop, 23 January 2012.

The Middleware Readiness Working Group LHCb Computing Workshop LHCb Computing Workshop Maria Dimou IT/SDC 2014/05/22.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

WLCG Operations and Tools TEG Status Report Maria Girone and Jeff Templon GDB, 14 December 2011.

Assessment of Core Services provided to USLHC by OSG.

OSG Area Coordinators Meeting Operations Rob Quick 2/22/2012.

LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Future support of EGI services Tiziana Ferrari/EGI.eu Future support of EGI.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EG recent developments T. Ferrari/EGI.eu ADC Weekly Meeting 15/05/

CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.

EMI INFSO-RI SA2 - Quality Assurance Alberto Aimar (CERN) SA2 Leader EMI First EC Review 22 June 2011, Brussels.

LCG and HEPiX Ian Bird LCG Project - CERN HEPiX - FNAL 25-Oct-2002.

Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.

Workshop summary Ian Bird, CERN WLCG Workshop; DESY, 13 th July 2011 Accelerating Science and Innovation Accelerating Science and Innovation.

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.

Evolution of Grid Projects and what that means for WLCG Ian Bird, CERN WLCG Workshop, New York 19 th May 2012.

Towards a Global Service Registry for the World-Wide LHC Computing Grid Maria ALANDES, Laurence FIELD, Alessandro DI GIROLAMO CERN IT Department CHEP 2013.

Impact of end of EMI+EGI-SA3 April 2013: EMI project finishes EGI-Inspire-SA3 finishes (mainly CERN affected) EGI-Inspire continues until April 2014 EGI.eu.

WLCG operations A. Sciabà, M. Alandes, J. Flix, A. Forti WLCG collaboration workshop July , Barcelona.

DataGRID PTB, Geneve, 10 April 2002 Testbed Software Test Plan Status Laurent Bobelin on behalf of Test Group.

MW Readiness WG Update Andrea Manzi Maria Dimou Lionel Cons 10/12/2014.

Julia Andreeva, CERN IT-ES GDB Every experiment does evaluation of the site status and experiment activities at the site As a rule the state.

Information System Status and Evolution Maria Alandes Pradillo, CERN CERN IT Department, Grid Technology Group GDB 13 th June 2012.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGI Operations Tiziana Ferrari EGEE User.

Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.

Ian Bird GDB CERN, 9 th September Sept 2015

Site Manageability & Monitoring Issues for LCG Ian Bird IT Department, CERN LCG MB 24 th October 2006.

G.Govi CERN/IT-DB 1 September 26, 2003 POOL Integration, Testing and Release Procedure Integration  Packages structure  External dependencies  Configuration.

EMI INFSO-RI SA1 Session Report Francesco Giacomini (INFN) EMI Kick-off Meeting CERN, May 2010.

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

LCG Support for Pilot Jobs John Gordon, STFC GDB December 2 nd 2009.

WLCG Technical Evolution Group: Operations and Tools Maria Girone & Jeff Templon Kick-off meeting, 24 th October 2011.

Julia Andreeva on behalf of the MND section MND review.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.

NIKHEF11 th March WLCG Operational Costs M. Dimou, J. Flix, A. Forti, A. Sciabà WLCG Operations Coordination Team GDB – NIKHEF [11 th March.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES CVMFS deployment status Ian Collier – STFC Stefan Roiser – CERN.

WLCG Operations and Tools TEG Report to the GDB Jeff Templon and Maria Girone 9 November 2011.

Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.

Area Coordinator Report for Operations Rob Quick 4/10/2008.

New solutions for large scale functional tests in the WLCG infrastructure with SAM/Nagios: The experiments experience ES IT Department CERN J. Andreeva.

1 Proposal for a technical discussion group  Because...  We do not have a forum where all of the technical people discuss the critical.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Ideal information system - CMS Andrea Sciabà IS.

LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010.

WLCG Technical Evolution Group: Operations and Tools Maria Girone & Jeff Templon GDB 12 th October 2011, CERN.

Open Science Grid OSG Resource and Service Validation and WLCG SAM Interoperability Rob Quick With Content from Arvind Gopu, James Casey, Ian Neilson,

INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.

WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.

WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.

Ian Bird LCG Project Leader Status of EGEE  EGI transition WLCG LHCC Referees’ meeting 21 st September 2009.

WLCG Operations Coordination and Commissioning Maria Girone, CERN IT On behalf of the Operations Coordination Team 11 th March OSG All Hands Meeting,

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.

CMS Experience with the Common Analysis Framework I. Fisk & M. Girone Experience in CMS with the Common Analysis Framework Ian Fisk & Maria Girone 1.

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

Grid Deployment Technical Working Groups: Middleware selection AAA,security Resource scheduling Operations User Support GDB Grid Deployment Resource planning,

WLCG Information System Status Maria Alandes Pradillo, CERN CERN IT Department, Support for Distributed Computing Group GDB 9 th September 2015.

WLCG Operations Coordination report Maria Dimou Andrea Sciabà IT/SDC On behalf of the WLCG Operations Coordination team GDB 12 th November 2014.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Monitoring Overview: status, issues and outlook Simone Campana.

Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.

Monitoring Working Group Update Grid Deployment Board 5 th December, CERN Ian Neilson.

Operations Coordination Team Maria Girone, CERN IT-ES GDB, 11 July 2012.

Accounting Review Summary and action list from the (pre)GDB Julia Andreeva CERN-IT WLCG MB 19th April

Ian Bird, CERN WLCG Project Leader Amsterdam, 24 th January 2012.

James Casey, CERN IT-GD WLCG Workshop 1st September, 2007

Testbed Software Test Plan Status

EMI: dal Produttore al Consumatore

Monitoring of the infrastructure from the VO perspective

Presentation transcript:

Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF TEG Workshop, 7 th February 2012

Operations & Tools TEG Composition: about 40 people overall: – Experiments (~15), sites (~20), “WLCG operations”, plus representatives from EMI, EGI-InSPIRE, OSG – Full list on TwikiTwiki – Good representation: UK, ES, US, NL, CH, CERN, EGI, EMI – Not all regions / tiers equally represented Regular meetings (6), including 2 workshopsmeetings – CERN & NIKHEF CERNNIKHEF Organized from beginning in 5 working groups TEG workshop, February 20122

Working Groups As part of the work of the TEG, subgroups were formed on the various topics within the overall charge  WG1: Monitoring and Metrics Simone Campana, Pepe Flix  WG2: Support Tools, Underlying Services and WLCG Operations Andrea Sciabà, Maria Dimou, Lionel Cons, Stefan Roiser  WG3: Application Software Management Stefan Roiser  WG4: Operational Requirements on Middleware Maarten Litmaath, Tiziana Ferrari  WG5: Middleware Configuration, Deployment and Distribution Oliver Keeble, Rob Quick Each WG produced a detailed report included in overall document TEG workshop, February 20123

Methodology Detailed assessments of the current status – Which things work well; – What things are missing; – Which things are operationally intensive; – Areas of potential savings in operational effort and/or improvements in site reliability and efficiency Generally acknowledged that things work but neither optimally nor sustainably – Need to revisit given our concrete experience and adapt to future needs – We must assume that resources will be less but service demands will grow All recommendations and observations indicated the impact, effort and timeline involved January workshop defined global priorities across all WGs based on these criteria – only those of high impact will be presented; all in document TEG Workshop, February 20124

Where we would like to be A small number of well-defined common services would be needed per site; Installing, configuring and upgrading these would be “trivial” All services would comply to standards, e.g. for error messages, monitoring; Services would be resilient to glitches and highly available; In case of load (or unexpected “user behaviour”) they would react gracefully; In case of problems, diagnosis and remedy should be straight- forward and rapid. Not necessarily the agreed goals at design & implementation stage – how close can we approach these retro-actively? TEG Workshop, February 20125

Guiding Principles 1.Reduce operations effort 2.Reduce complexity 3.Minimize dependencies between sites and services (reduce reliance on actions of others) 4.Reduce effort to upgrade and reconfigure 5.Improve access to information 6.Improve reaction to service/hardware failures 7.Deploy scalable services (able to handle up to 2- 3 times the average load) TEG Workshop, February 20126

Global Recommendations TEG Workshop, February #TitleAreaTimeline R1WLCG Service CoordinationOperationsFrom 2012 R2WLCG Service CommissioningOperationsFrom 2012 R3WLCG Availability MonitoringMonitoring2012 R4WLCG Site MonitoringMonitoring2012 R5WLCG Network MonitoringMonitoringLS1 R6Software deploymentS/W2012/LS1 R7Information System (WM TEG)Underlying Services2012/LS1 R8Middleware ServicesM/W2012/LS1 R9Middleware DeploymentM/W2012/LS1

Recommendations: Monitoring R3: WLCG Availability Monitoring: streamline availability calculation and visualization – Converge on one system for availability calculation and for visualization – Review/add critical tests for VO availability calculation to better match site usability expose usability also in regular reports (monthly, MB). R4: WLCG Site Monitoring: deploy a common multi-VO tool to be used by sites to locally display the site performance – Site and experiments should agree on a few common metrics between experiments, relevant from a site perspective  Extensively covered at 14 th December GDB TEG Status Report R5: WLCG Network Monitoring: deploy a WLCG-wide and experiment independent monitoring system for network connectivity TEG Workshop, February 20128

Software Deployment R6: Software deployment – Adopt CVMFS for use as shared software area at all WLCG sites (Tier-1 and Tier-2) – Deploy a robust and redundant infrastructure for CVMFS Complete the deployment and test the implemented resilience TEG Workshop, February 20129

Information System R7: Information System (consistent with the recommendations of the GDB from June 2011) – Short term: improve the Information System via full deployment of the cached BDII and a strengthening of information validation (for instance via nagios probes) – Long term: split the information into optimized tools focused to provide structural data (static), meta data, and state data (transient) During refactoring the information elements in the BDII should be reviewed and unnecessary elements dropped TEG workshop, February

Recommendations: Operations R1: WLCG Service Coordination: improve the computing service(s) provided by the sites – Clarify scope, frequency and outcome of current meetings; – Address specific Tier-2 communication needs Dedicated service coordination meetings Evolve to “Computing as a Service at Tier-2s” – less experiment-specific services and interactions – Organize with EGI, NDGF and OSG common site administrator training R2: WLCG Service Commissioning: establish a core team of experts (from sites and experiments) to validate, commission and troubleshoot services TEG Workshop, February

Middleware Services R8: Review site (middleware) services – Refactor existing middleware configurations to establish consistent procedures and remove unnecessary complexity – Assess services on scalability, load balancing and high availability aspects – Assess clients on retry and fail-over behaviors – Team of experts to prioritize open bugs and RFEs – Improve documentation based on input from service administrators and users TEG Workshop, February

Middleware Deployment R9: Middleware Distribution, Configuration, and Deployment – Middleware configuration should be improved and should not be bound to a particular configuration management tool – Endorse middleware distribution via EPEL repository for additions to the RHEL/SL operating system family Opportunity to optimize release process – Encourage sites and experiments to actively participate in the commissioning and validation of middleware components and services – Maintain compatible middleware clients in the Application Area repository. Establish a compatible UI/WN release in rpm and tar format – Possibility to produce targeted updates which fix individual problems on request TEG workshop, February

Order by Time Short term, specific time bounded and well defined targets – Availability, Site & Network monitoring – Software deployment Medium term, require a WG and need goals and metrics – Information system, Middleware Services and Deployment Long term, requires coordination and communication – Service Coordination and Commissioning TEG Workshop, February

Ordering by Principle Reduce operations effort – Service Coordination and Commissioning, Site and Network Monitoring, Software deployment, Middleware Services Reduce complexity – Software deployment, Middleware Services and Deployment Minimize inter-dependencies (sites, experiments, services) – Software deployment, Information System Reduce effort to upgrade and reconfiguration – Middleware deployment Improve access to information – Information System, Availability, Site and Network monitoring Improve reaction to service/hardware failures – Site Monitoring Deploy scalable services (2-3 times above the average load) – Middleware Services TEG Workshop, February

Areas not covered Model for middleware and infrastructure support after EGI-InSPIRE, EMI, … have ended – WLCG has benefitted for more than a decade from a series of externally supported grid projects – In the absence of a new project the transition to internally sustained infrastructure will require effort and strategic planning – Subject of the User and General EGI Sustainability WorkshopUser and General EGI Sustainability Workshop Network operations TEG workshop, February

Conclusions and Outlook Ops & Tools TEG has documented strategy and recommendations for the suggested topics and scope TEG report document already in a good shape and soon available – Draft version at perations perations Hard to get active participation from the sites – A rapid feedback to the document is welcome Looks possible to conclude according to schedule TEG workshop, February

Thank you Working Group coordinators: – Simone Campana, Pepe Flix, Andrea Sciabà, Maria Dimou, Lionel Cons, Stefan Roiser, Maarten Litmaath, Tiziana Ferrari (EGI), Oliver Keeble, Rob Quick (OSG) The LHC experiments: – Marco Cattaneo, Joel Closier, Ian Fisk, Costin Grigoras, Stephane Jezequel, I. Ueda Sites: – Ian Collier, Xavier Espinal, Alessandra Forti, John Gordon, Pablo Hernandez, Gonzalo Merino, Anthony Tiradani – CERN-IT: Julia Andreeva, Marian Babik, David Collados, Laurence Field, Alessandro Di Girolamo, Elisa Lanciotti, Wojciech Lapka, Alexandre Lossent, Pablo Saiz, Steve Traylen EMI: Cristina Aiftimiei OSG: Miron Livny TEG workshop, February

Backup Slides 19

WG1: Recommendations 20 BACKUP SLIDES

WG2 Recommendations TEG workshop, February Support Tools

WG2 Recommendations TEG workshop, February Underlying Services

WG2 Recommendations WLCG Operations TEG workshop, February

WG3 Recommendations TEG workshop, February

WG4 Recommendations TEG workshop, February

WG5 Recommendations TEG workshop, February Middleware Configuration

WG5 Recommendations Middleware Deployment TEG workshop, February

WG5 Recommendations Middleware Distribution TEG workshop, February

Availability Monitoring Proposal GDB, October Experiments extend their SAM tests to test more site-specific functionality – Any new test contributing to the availability is properly agreed upon with sites and documented The SAM framework is extended to properly support external metrics such as from Panda, DIRAC, … The resulting availability will have these properties: – Takes into account more relevant site functionality – Is as independent as possible from experiment-side issues – Is well understood by the sites Additional experiment-side metrics are nevertheless used by VO computing operations and published e.g. via SSB

Schematic view: Proposal OpsALICEATLASCMSLHCb         Standard tests Exp custom tests Critical standard test critical custom test, submitted by Nagios  non-critical test, external and injected to SAM SAM tests ALICEATLASCMSLHCb       External metrics

Site Monitoring Proposal We miss the equivalent of the today’s SSB experiment views tailored for sites Proposal to use the SSB framework to provide this functionality as well – Advantages: many metrics already in the SSB for ATLAS and CMS No duplication of effort nor issues of consistency – Need to agree on a few common metrics between experiments Relevant from a site perspective – Some development needed in SSB to facilitate the visualization – Some commitment needed from experiment and sites Experiment (support): define and inject metrics, validation Sites: validation, feedback GDB, October