WLCG operations A. Sciabà, M. Alandes, J. Flix, A. Forti WLCG collaboration workshop July 7-9 2014, Barcelona.

Slides:



Advertisements
Similar presentations
Operations Coordination Team Maria Girone, CERN IT-ES GDB 10 th October 2012.
Advertisements

Operations Coordination Team Maria Girone, CERN IT-ES Kick-off meeting 24 th September 2012.
Integrating Network and Transfer Metrics to Optimize Transfer Efficiency and Experiment Workflows Shawn McKee, Marian Babik for the WLCG Network and Transfer.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
EGI-Engage Recent Experiences in Operational Security: Incident prevention and incident handling in the EGI and WLCG infrastructure.
Assessment of Core Services provided to USLHC by OSG.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks From ROCs to NGIs The pole1 and pole 2 people.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Future support of EGI services Tiziana Ferrari/EGI.eu Future support of EGI.
Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.
EGI: SA1 Operations John Gordon EGEE09 Barcelona September 2009.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EG recent developments T. Ferrari/EGI.eu ADC Weekly Meeting 15/05/
LCG and HEPiX Ian Bird LCG Project - CERN HEPiX - FNAL 25-Oct-2002.
Network and Transfer WG Metrics Area Meeting Shawn McKee, Marian Babik Network and Transfer Metrics Kick-off Meeting 26 h November 2014.
Workshop summary Ian Bird, CERN WLCG Workshop; DESY, 13 th July 2011 Accelerating Science and Innovation Accelerating Science and Innovation.
European Middleware Initiative (EMI) – Release Process Doina Cristina Aiftimiei (INFN) EGI Technical Forum, Amsterdam 17. Sept.2010.
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
Evolution of Grid Projects and what that means for WLCG Ian Bird, CERN WLCG Workshop, New York 19 th May 2012.
Towards a Global Service Registry for the World-Wide LHC Computing Grid Maria ALANDES, Laurence FIELD, Alessandro DI GIROLAMO CERN IT Department CHEP 2013.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks SA1: Grid Operations Maite Barroso (CERN)
INFSO-RI Enabling Grids for E-sciencE EGEE SA1 in EGEE-II – Overview Ian Bird IT Department CERN, Switzerland EGEE.
Information System Status and Evolution Maria Alandes Pradillo, CERN CERN IT Department, Grid Technology Group GDB 13 th June 2012.
Trusted Virtual Machine Images a step towards Cloud Computing for HEP? Tony Cass on behalf of the HEPiX Virtualisation Working Group October 19 th 2010.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGI Operations Tiziana Ferrari EGEE User.
WLCG Middleware Support II Markus Schulz CERN-IT-GT May 2011.
Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.
Ian Bird GDB CERN, 9 th September Sept 2015
Julia Andreeva on behalf of the MND section MND review.
EMI INFSO-RI European Middleware Initiative (EMI) Alberto Di Meglio (CERN)
Andrea Manzi CERN On behalf of the DPM team HEPiX Fall 2014 Workshop DPM performance tuning hints for HTTP/WebDAV and Xrootd 1 16/10/2014.
Data Placement Intro Dirk Duellmann WLCG TEG Workshop Amsterdam 24. Jan 2012.
Additional Services: Security and IPv6 David Kelsey STFC-RAL.
PIC port d’informació científica EGEE – EGI Transition for WLCG in Spain M. Delfino, G. Merino, PIC Spanish Tier-1 WLCG CB 13-Nov-2009.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
NIKHEF11 th March WLCG Operational Costs M. Dimou, J. Flix, A. Forti, A. Sciabà WLCG Operations Coordination Team GDB – NIKHEF [11 th March.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team Kickoff Meeting.
Ian Bird Overview Board; CERN, 8 th March 2013 March 6, 2013
WLCG Operations Coordination Andrea Sciabà IT/SDC 10 th July 2013.
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
ATLAS Distributed Computing ATLAS session WLCG pre-CHEP Workshop New York May 19-20, 2012 Alexei Klimentov Stephane Jezequel Ikuo Ueda For ATLAS Distributed.
CERN - IT Department CH-1211 Genève 23 Switzerland t IT-GD-OPS attendance to EGEE’09 IT/GD Group Meeting, 09 October 2009.
WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.
Evolution of WLCG infrastructure Ian Bird, CERN Overview Board CERN, 30 th September 2011 Accelerating Science and Innovation Accelerating Science and.
Ian Bird LCG Project Leader Status of EGEE  EGI transition WLCG LHCC Referees’ meeting 21 st September 2009.
WLCG Operations Coordination and Commissioning Maria Girone, CERN IT On behalf of the Operations Coordination Team 11 th March OSG All Hands Meeting,
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
WLCG Operations Coordination news and meeting restructuring Maria Alandes Pradillo Josep Flix Alessandra Forti Andrea Sciabà WLCG operations coordination.
EGI Process Assessment and Improvement Plan – EGI core services – Tiziana Ferrari FedSM project 1EGI Process Assessment and Improvement Plan (Core Services)
Outcome should be a documented strategy Not everything needs to go back to square one! – Some things work! – Some work has already been (is being) done.
WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,
Grid Deployment Technical Working Groups: Middleware selection AAA,security Resource scheduling Operations User Support GDB Grid Deployment Resource planning,
WLCG Operations Coordination report Maria Dimou Andrea Sciabà IT/SDC On behalf of the WLCG Operations Coordination team GDB 12 th November 2014.
Pledged and delivered resources to ALICE Grid computing in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.
Operations Coordination Team Maria Girone, CERN IT-ES GDB, 11 July 2012.
Accounting Review Summary and action list from the (pre)GDB Julia Andreeva CERN-IT WLCG MB 19th April
WLCG Accounting Task Force Introduction Julia Andreeva CERN 9 th of June,
Ian Bird, CERN WLCG Project Leader Amsterdam, 24 th January 2012.
Evolution of storage and data management
Bob Jones EGEE Technical Director
WLCG IPv6 deployment strategy
WLCG Workshop 2017 [Manchester] Operations Session Summary
Ian Bird GDB Meeting CERN 9 September 2003
System performance and cost model working group
WLCG Collaboration Workshop;
Presentation transcript:

WLCG operations A. Sciabà, M. Alandes, J. Flix, A. Forti WLCG collaboration workshop July , Barcelona

The WLCG operations group Experience before and during Run1 Achievements and current work Issues Priorities beyond Run2 2 Outline

WLCG operations Evolved from a decade long experience in prototypes, service and data challenges – Result of the effort of many individuals – Use also procedures and tools provided by the federated Grid projects GGUS, GOCDB, OIM, EGI portals, etc. – Initial focus on delivering a stable service – Very successful but at the expense of a high manpower cost In 2011 WLCG decided that it was the right time for an internal review – The Technical Evolution Groups 3

Operations before and during Run1 WLCG operations ran since 2008 as a loosely coordinated effort – effective but manpower intensive The 2012 TEG review identified several weak points 4Date No effective communication with the Tier-2’s No central operations team No central body where to take operative decisions Fragile experiment software installation procedures Poor documentation and logging for services Insufficient middleware validation

The WLCG operations coordination working group Established in October 2012 Core operations and deployment coordination team – Manages operational issues and service deployment in synergy with EGI and OSG – Discusses experiments plans and needs – Defines actions and work plans – Forms working groups and time-limited task forces on specific issues – Ensures communication among experiments and sites All stakeholders are represented – LHC experiments, Tier-0/1/2’s, Grid projects – Largely based on voluntary effort from the entire WLCG community 5

Organisation 6 Management Board Grid Deployment Board Task forces gLExec Tracking tools HTTP proxy discovery FTS-3 SHA-2 Machine/job features IPv6 Multicore WMS decommissioning Middleware readiness Network and transfer metrics Working groups Chairpersons Experiments Sites Projects (EGI, OSG, GridPP) Ops coord meeting

TEG recommendations Establish a core team for coordinating WLCG operations Expand the scope of existing meetings to fully involve Tier-2 sites Adopt CVMFS to distribute experiment software (and middleware) at sites Simplify the middleware stack and improve documentation and procedures Improve middleware distribution and configuration mechanisms Strengthen the participation of sites and experiments to the middleware validation Operations during Run1 Main goals – Implement the recommendations given by the TEG – Manage the daily operations – Solve specific problems requiring complex validation or deployment activities 7 The WLCG operations coordination group T2 regional representatives T2 participation in TFs CVMFS used by all LHC experiments LFC and WMS decommissioned FTS-2 soon to be Rationalise repositories Puppet and virtualisation Middleware readiness validation working group Completed Work in progress

Achievements (1/2) Several other goals were accomplished The gLexec deployment is basically completed – Mandatory at CMS sites (affects their availability) CVMFS deployment at all sites and adopted by all experiments – Much more robust and scalable software distribution mechanism Migration to Scientific Linux 6 of all compute resources – Defined procedures for sites, a WLCG repository and a “HEP” metapackage for experiment software dependencies – Completed in the agreed timescale thanks to the considerable effort spent in keeping experiments and sites focused – As a bonus, validated the EMI-3 WNs for WLCG ahead of staged rollout 8

Achievements (2/2) Centralised squid monitoring for Frontier and CVMFS – Harmonised squid installations for different experiments and set up central monitoring pages WLCG-wide perfSONAR deployment – pS instances at almost all sites and central monitoring – the first important step towards network-aware applications Xrootd deployment – Enabled detailed monitoring of all xrootd traffic for ATLAS and CMS – Tested and distributed different plugins Validation of SHA-2 certificates – Validated all middleware and experiment software with SHA-2 certificates and proxies 9

Current activities Covering many high priority objectives for Run2 – Enabling multicore resources – Commissioning of FTS-3 – Enabling network-aware applications – Efficient job-compute node interaction in batch and cloud resources – Collaborative middleware validation using experiment applications – Testing and deployment of IPv6-enabled services In collaboration with the HEPiX IPv6 working group – Many of these topics have dedicated talks in this workshop 10

Task force progress summary 11

Recent changes and new roles Recently reorganised the structure of the meetings and the reports – to make them more effective and useful by focusing on the most relevant issues and having more time for discussions – Task forces now maintain lists of specific goals and measure their progress – Eliminated the quarterly planning meeting, deemed redundant Created the role of the WLCG middleware officer – Tracks all relevant middleware issues – Decides which versions are the baseline – Works with the MW readiness WG to decide what to test and assess the results 12

Issues during LS1 Site participation can be improved – Unfortunately many Tier-1 sites and Tier-2 regional representatives do not follow ops coordination meetings regularly – Some steps already taken, but still to be seen if effective – Hard also to get feedback Long tails in deployment campaigns – It’s very hard to reach 100% completion – Tickets are a must but often not enough – Sometimes huge effort needed by the task force coordinators to keep things moving Uncertainty in middleware support – After the end of EMI, continuation of middleware support is a concern and we already face problems (e.g. ARGUS) Lack of risk analysis – Do we need it? Manpower – Not so severe, but contributions from sites are absolutely essential 13

Priorities for the next years Several task forces close to completion – SHA-2, gLExec, WMS, tracking tools – No new task forces on the horizon Future evolution will heavily impact operations – Changes in the experiment computing models – Changes in the infrastructure 14

Priorities for the experiments? Some clear commonalities – Funded CPU and storage resources will become increasingly scarce with respect to needs Aggressive storage space management Adapting to new types of resources beyond traditional tiers – Should the scope of WLCG operations change accordingly? E.g. by providing VO-independent tools, procedures, recipes appropriate also for new types of resources? – Commissioning of new services, protocols, etc. Here we can assume that WLCG operations will always play a role 15

Priorities from ATLAS (from the ATLAS computing management) Reduce the overall effort from sites and experiments – The operations effort is still too heavy Sites need very skilled people to run services which are still far from IT standards – Need to find a suitable model Fully enable the usage of cloud resources – If today 50% of Tier-2’s moved to OpenStack, experiments would have a hard time – Need to address the operational side Where the experiment becomes the sysadmin of the resources 16

Priorities from CMS (from the CMS computing management) Continue to operate and support existing services and actively participate in their evolution during Run2 – Central role in mediating between middleware developer teams and experiment requests Provide solutions for availability monitoring, accounting and resource utilisation in Openstack and other cloud resources – Covering also other resource provisioning interfaces like BOSCO – To contribute to enable opportunistic resources Usage and overload monitoring and service health for storage federations – Global storage usage accounting is a long standing request from the C-RSG Support in introducing new standards – Machine/job features, IPv6, multicore provisioning, etc. Operations of common services – Myproxy, gLExec, CVMFS, OS upgrades, perfSONAR, etc. 17

Priorities for WLCG (input from WLCG management) Cloud computing – Not today, but will become an operations issue if sites start moving to clouds Understanding performance – Benchmarks, I/O metrics on storage federations, global monitoring Optimising the system for performance and cost – What can WLCG do to help experiments getting the most from the resources? “Virtual” sites – Will be possible to have sites that can run almost unattended and save on operational effort? 18

Conclusions WLCG Operations coordinate most activities of common interest for the experiments Site participation is mostly via task forces – But still to be improved Long term middleware support is already a problem Current activities are mostly targeted for Run2, need to look beyond – Good agreement in expectations between experiments and WLCG – Manpower, cloud resources, monitoring, service commissioning and operations are the main areas Questions? 19