CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t DBES LHC(b) Grid operations Roberto Santinelli IT/ES 5 th User Forum – Uppsala 12-16 April.

Slides:

Advertisements

Similar presentations

1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.

Advertisements

CERN IT Department CH-1211 Genève 23 Switzerland t Messaging System for the Grid as a core component of the monitoring infrastructure for.

CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks From ROCs to NGIs The pole1 and pole 2 people.

Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.

OSG Operations and Interoperations Rob Quick Open Science Grid Operations Center - Indiana University EGEE Operations Meeting Stockholm, Sweden - 14 June.

CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.

CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.

Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Successful Common Projects: Structures and Processes WLCG Management.

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Overview ROC_LA CERN

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.

CERN IT Department CH-1211 Geneva 23 Switzerland t GDB CERN, 4 th March 2008 James Casey A Strategy for WLCG Monitoring.

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.

Julia Andreeva, CERN IT-ES GDB Every experiment does evaluation of the site status and experiment activities at the site As a rule the state.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

CERN IT Department CH-1211 Geneva 23 Switzerland t CCRC’08 Tools for measuring our progress CCRC’08 F2F 5 th February 2008 James Casey, IT-GS-MND.

Monitoring for CCRC08, status and plans Julia Andreeva, CERN , F2F meeting, CERN.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.

Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.

Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

Julia Andreeva on behalf of the MND section MND review.

The GridPP DIRAC project DIRAC for non-LHC communities.

CERN IT Department CH-1211 Genève 23 Switzerland t Experiment Operations Simone Campana.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.

CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.

MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.

WLCG Service Report ~~~ WLCG Management Board, 18 th September

1 Andrea Sciabà CERN The commissioning of CMS computing centres in the WLCG Grid ACAT November 2008 Erice, Italy Andrea Sciabà S. Belforte, A.

Kati Lassila-Perini EGEE User Support Workshop Outline: – CMS collaboration – User Support clients – User Support task definition – passive support:

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team Kickoff Meeting.

Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Ideal information system - CMS Andrea Sciabà IS.

Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

The GridPP DIRAC project DIRAC for non-LHC communities.

WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.

ATLAS Distributed Computing ATLAS session WLCG pre-CHEP Workshop New York May 19-20, 2012 Alexei Klimentov Stephane Jezequel Ikuo Ueda For ATLAS Distributed.

CERN - IT Department CH-1211 Genève 23 Switzerland t IT-GD-OPS attendance to EGEE’09 IT/GD Group Meeting, 09 October 2009.

Ian Bird LCG Project Leader Status of EGEE  EGI transition WLCG LHCC Referees’ meeting 21 st September 2009.

Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.

1 September 2007WLCG Workshop, Victoria, Canada 1 WLCG Collaboration Workshop Victoria, Canada Site Readiness Panel Discussion Saturday 1 September 2007.

Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Improving resilience of T0 grid services Manuel Guijarro.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.

ADC Operations Shifts J. Yu Guido Negri, Alexey Sedov, Armen Vartapetian and Alden Stradling coordination, ADCoS coordination and DAST coordination.

Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.

CERN IT Department CH-1211 Genève 23 Switzerland t CMS SAM Testing Andrea Sciabà Grid Deployment Board May 14, 2008.

Operations Coordination Team Maria Girone, CERN IT-ES GDB, 11 July 2012.

Daniele Bonacorsi Andrea Sciabà

POW MND section.

Data Challenge with the Grid in ATLAS

Experiment Dashboard overviw of the applications

Readiness of ATLAS Computing - A personal view

ALICE – FAIR Offline Meeting KVI (Groningen), 3-4 May 2010

LCG Operations Workshop, e-IRG Workshop

Monitoring of the infrastructure from the VO perspective

Presentation transcript:

CERN IT Department CH-1211 Geneva 23 Switzerland t DBES LHC(b) Grid operations Roberto Santinelli IT/ES 5 th User Forum – Uppsala April 2010

CERN IT Department CH-1211 Geneva 23 Switzerland t ES Outline LHCb Grid operations – Structure – Services criticality – Monitoring tools Activities Infrastructure – Communications and issue Tracking Outlook to the other experiments – ALICE – CMS – ATLAS Conclusion

CERN IT Department CH-1211 Geneva 23 Switzerland t ES Shifters GEOC ( ) Dirac ExpertsGrid Contact Production Manager LHCb Grid Operations Problem (mailing list, monitoring ) Real data distribution recostruction MC Production User Analysis 6 Contact persons at T1 (Europe)

CERN IT Department CH-1211 Geneva 23 Switzerland t ES LHCb Critical Services

CERN IT Department CH-1211 Geneva 23 Switzerland t ES Monitoring the activities

CERN IT Department CH-1211 Geneva 23 Switzerland t ES Monitoring the infrastructure

CERN IT Department CH-1211 Geneva 23 Switzerland t ES Communications and tracking Elogbook Meetings GGUS policies and escalation Contact person at sites Escalation at various LCG bodies Massive usage of GGUS TEAM tickets Priority used according the *real* severity Different GEOCs can act on the same ticket GGUS ALARM tickets only for show stoppers Escalation of tickets via GGUS and through local contact person Regular discussion at WLCG daily meeting Regular discussion (for real data) at daily production meeting in LHCb Long standing issues: weekly at T1 coordination meeting and GDB; Usage of Savannah for internal operational tasks reviewed on weekly basis in LHCb production. Development discussions: weekly (PASTE) Massive usage of GGUS TEAM tickets Priority used according the *real* severity Different GEOCs can act on the same ticket GGUS ALARM tickets only for show stoppers Escalation of tickets via GGUS and through local contact person Regular discussion at WLCG daily meeting Regular discussion (for real data) at daily production meeting in LHCb Long standing issues: weekly at T1 coordination meeting and GDB; Usage of Savannah for internal operational tasks reviewed on weekly basis in LHCb production. Development discussions: weekly (PASTE)

CERN IT Department CH-1211 Geneva 23 Switzerland t ES DIRAC Solution Failover for all operations Automatic job resubmission (in some cases) pilot jobs submission and monitoring Integrity checks per production –A-synchronous automatic fix Redundancy& resilience of services –DNS load balance, hot spares…. Fault tolerance on all clients Alarms and notification Self-recovery services All of that to minimize intervention of the very small production crew and to offer users a more stable system All of that to minimize intervention of the very small production crew and to offer users a more stable system

CERN IT Department CH-1211 Geneva 23 Switzerland t ES RSS 1.High level view: the global status as result of many parameters 2.Easy investigation: top-down 3.Combine multiple –scattered – sources of information 1.GGUS 2.GOCDB 3.SAM 4.SLS 5.DIRAC 6.Lemon 4.Elaborated policies for site/service management; notification mechanism 5.Possibility to define custom metrics (ex. CPU usage and comparison with expectation, efficiencies etc.etc) 6.Flexibility for adding more plug ins (ex. Nagios) 7.Automatic actions for reliable and trusted metrics 1.High level view: the global status as result of many parameters 2.Easy investigation: top-down 3.Combine multiple –scattered – sources of information 1.GGUS 2.GOCDB 3.SAM 4.SLS 5.DIRAC 6.Lemon 4.Elaborated policies for site/service management; notification mechanism 5.Possibility to define custom metrics (ex. CPU usage and comparison with expectation, efficiencies etc.etc) 6.Flexibility for adding more plug ins (ex. Nagios) 7.Automatic actions for reliable and trusted metrics

CERN IT Department CH-1211 Geneva 23 Switzerland t ES ALICE: Principle of operations Uniform deployment of WLCG services at all sites –Same behavior for T1 and T2 in terms of production –Differences between T1 and T2: a matter of QoS WLCG entry point: VO-boxes deployed at all T0- T1-T2 sites providing resources for ALICE –Mandatory requirement to enter the production –Required in addition to all standard WLCG Services –Runs standard WLCG components and ALICE specific services Installation and maintenance of the specific experiment services at the local VO-boxes entirely ALICE responsibility –Based on a regional principle –Set of ALICE experts matched to groups of sites Site related problems (services and operations of the generic services) handled by site administrators WLCG Service problems reported via GGUS –Not too much, ALICE has experts in almost all sites

CERN IT Department CH-1211 Geneva 23 Switzerland t ES ALICE: Operation procedures Core team placed at CERN coordinates the operations and the services status at all sites –IT-ES/experiment collaboration Dedicated persons at CERN-IT engaged of the experiment operations, storage solutions and specific experiment software developments –High level of expertise able to identify problems at any site in a short time –Close collaboration with the services developers and the Grid deployment team Thanks to this close collaboration, ALICE is the 1st WLCG experiment which has migrated the whole production at all sites to use the CREAM-CE as the production backend –ALICE is represented at all Grid forums EGEE TMB Daily operations meeting WLCG GDB WLCG MB T1 service operations meeting The good level of operations procedures established for and with ALICE is enabling a smooth data taking –No fundamental issues have been reported which could demage the data taking regime of the experiment

CERN IT Department CH-1211 Geneva 23 Switzerland t ES

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 13 CMS Computing Operations steered by 3 projects –Data Operations Resp. for central data processing and production transfers: RAW data repacking + prompt reco at T0, RAW and MC re-reco/skimming at T1’s, MC prod at T2’s –Facilities Operations Resp. of providing and maintain a working distributed computing fabric at WLCG Tiers with a consistent working env for Data Operations and end users –Analysis Operations Resp. for central data placement at T2 level, CRAB server ops, validation, support, and for metrics, monitoring, evaluation of the distributed analysis system Strong central teams complemented by CMS contacts at Tiers, working in sync Regular CMS Ops meetings –Weekly general meeting check of status of all Tier-0/1/2 + check of progress on all activities) –Weekly T2-only meeting Stable contact with WLCG Ops –Daily attendance and daily reports - no exceptions –Weekly-scope ‘special’ report on one day CMS Computing Ops

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 14 Check SAM tests Check CMS-specific SAM tests, on all Tier-0/1/2 Check Tier-X satisfy the overall availability thresholds –goal for CMS {T1,T2} is {90%, 80%}, follow-up on issues Check the Site Readiness estimators –Boolean ‘AND’ of: uptime, JobRobot, SAM, –# commissioned links, quality of transfers,... (both daily, and in historical evolutions) Open tickets, follow up at meetings –Savannah (more) + GGUS (less, but increasing) And more... The CMS Computing Ops daily checks

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 15 Few Primary Centres in multiple time zones –Connected via Tandberg Video system Many existing secondary CMS centres worldwide –Permanent EVO room Started in Fall 2008, now ~50 people in 3 time-zones –Also non core-computing experts Shift roles and responsibilities –24/7 with a Computing Run Coordinator + a shifter Shift procedures and checklists constantly improving Several tools: –EVO room (see above) –Shift sign-up tool (+credits) –IM account for shifters –Computing Plan of the Day –Large use of shifter ELOG –Savannah + GGUS CMS Computing shifts

CERN IT Department CH-1211 Geneva 23 Switzerland t ES ATLAS Distributed Computing Shift Operations In addition we have –High level of support from system experts in, e.g., DDM, PanDA –Cloud squads who deal with data management and other issues within clouds (site setup, cleaning lost files, etc.) RoleLocation FTE Effort Comments ADC Manager on Duty CERN1 In overall charge of distributed computing operations for 1 week, reports ATLAS issues to WLCG daily meeting ADC Point 1 Shifter CERN 3 (data taking); 1 (non data taking) Monitor data export from CERN to Tier-1 sites; monitor health of ATLAS central services (e.g., DDM) ADC Operations Shifters Distributed (EU, US and AP shiters) 5 (2 EU, 2 US, 1 AP) Monitor central production and data flows between T1s and T2s Distributed Analysis Shifters Distributed (Usually EU + US) 2 (1 EU, 1 US) Responsive to user questions and problems on distributed analysis

CERN IT Department CH-1211 Geneva 23 Switzerland t ES ATLAS: Monitoring Tools SLS monitoring for health of ATLAS central services Production system monitoring Analysis Job Monitoring We have many very useful monitoring tools, but shifters do have to look in many places DDM Transfer Dashboard for monitoring all data movement

CERN IT Department CH-1211 Geneva 23 Switzerland t ES Communications and Issue Tracking There are a wide diversity of tools suitable for different types of issues and communications between different actors, however links between tools are manual and too time consuming for both shifters and experts ToolUse Jabber Chatroom Instant communication between different shift teams and experts eLog Global logbook for shift teams GGUS Communication about problems with sites - Team and Alarm tickets are essential for us Savannah ATLAS internal communication and issue tracking

CERN IT Department CH-1211 Geneva 23 Switzerland t ES Some general remarks on ATLAS Operations The system takes a great deal of effort to run –This is mainly because sites are still unstable and we generally notice problems before they do Storage system stability, reliability and performance is still lacking –ATLAS, as a heavy user of the system, probes sites far more deeply than any automated tests We are trying to automate systems as much as possible, but this has to be done carefully to avoid false positives There is little notion of site criticality in the grid tools: T0, T1, T2 or T3 for ATLAS makes a big difference to our operations Managing change during LHC running is the big challenge, e.g., continued service migration to SL5

CERN IT Department CH-1211 Geneva 23 Switzerland t ES Conclusion SC3/SC4 DCXX 2008 CCRC Step09/ First collisions 2010….. real Data Experiment operations differ (size of the collaboration, man power, time-zone) …and also many common needs (scattered information, manual interventions still required, procedures to be improved, managing changes client/service side during data taking) QoS offered by WLCG improved dramatically in the years thanks to many coordinated activities, gained experience in managing emergencies and improved communication among various Stakeholders …but have also many common solutions due a collaborative attitude to share best practices.

CERN IT Department CH-1211 Geneva 23 Switzerland t ES Many thanks to: Graeme Stewart/Alessandro Di Girolamo (ATLAS) Daniele Bonacorsi and Andrea Sciaba’ (CMS) Patricia Mendez Lorenzo (ALICE)