Download presentation
Presentation is loading. Please wait.
Published byGriffin Ramsey Modified over 9 years ago
1
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t DBES LHC(b) Grid operations Roberto Santinelli IT/ES 5 th User Forum – Uppsala 12-16 April 2010
2
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES Outline LHCb Grid operations – Structure – Services criticality – Monitoring tools Activities Infrastructure – Communications and issue Tracking Outlook to the other experiments – ALICE – CMS – ATLAS Conclusion
3
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES Shifters GEOC ( 161914) Dirac ExpertsGrid Contact Production Manager LHCb Grid Operations Problem (mailing list, monitoring ) Real data distribution recostruction MC Production User Analysis 6 Contact persons at T1 (Europe)
4
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES LHCb Critical Services
5
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES Monitoring the activities
6
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES Monitoring the infrastructure
7
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES Communications and tracking Elogbook Meetings GGUS policies and escalation Contact person at sites Escalation at various LCG bodies Massive usage of GGUS TEAM tickets Priority used according the *real* severity Different GEOCs can act on the same ticket GGUS ALARM tickets only for show stoppers Escalation of tickets via GGUS and through local contact person Regular discussion at WLCG daily meeting Regular discussion (for real data) at daily production meeting in LHCb Long standing issues: weekly at T1 coordination meeting and GDB; Usage of Savannah for internal operational tasks reviewed on weekly basis in LHCb production. Development discussions: weekly (PASTE) Massive usage of GGUS TEAM tickets Priority used according the *real* severity Different GEOCs can act on the same ticket GGUS ALARM tickets only for show stoppers Escalation of tickets via GGUS and through local contact person Regular discussion at WLCG daily meeting Regular discussion (for real data) at daily production meeting in LHCb Long standing issues: weekly at T1 coordination meeting and GDB; Usage of Savannah for internal operational tasks reviewed on weekly basis in LHCb production. Development discussions: weekly (PASTE)
8
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES DIRAC Solution Failover for all operations Automatic job resubmission (in some cases) pilot jobs submission and monitoring Integrity checks per production –A-synchronous automatic fix Redundancy& resilience of services –DNS load balance, hot spares…. Fault tolerance on all clients Alarms and notification Self-recovery services All of that to minimize intervention of the very small production crew and to offer users a more stable system All of that to minimize intervention of the very small production crew and to offer users a more stable system
9
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES RSS 1.High level view: the global status as result of many parameters 2.Easy investigation: top-down 3.Combine multiple –scattered – sources of information 1.GGUS 2.GOCDB 3.SAM 4.SLS 5.DIRAC 6.Lemon 4.Elaborated policies for site/service management; notification mechanism 5.Possibility to define custom metrics (ex. CPU usage and comparison with expectation, efficiencies etc.etc) 6.Flexibility for adding more plug ins (ex. Nagios) 7.Automatic actions for reliable and trusted metrics 1.High level view: the global status as result of many parameters 2.Easy investigation: top-down 3.Combine multiple –scattered – sources of information 1.GGUS 2.GOCDB 3.SAM 4.SLS 5.DIRAC 6.Lemon 4.Elaborated policies for site/service management; notification mechanism 5.Possibility to define custom metrics (ex. CPU usage and comparison with expectation, efficiencies etc.etc) 6.Flexibility for adding more plug ins (ex. Nagios) 7.Automatic actions for reliable and trusted metrics
10
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES ALICE: Principle of operations Uniform deployment of WLCG services at all sites –Same behavior for T1 and T2 in terms of production –Differences between T1 and T2: a matter of QoS WLCG entry point: VO-boxes deployed at all T0- T1-T2 sites providing resources for ALICE –Mandatory requirement to enter the production –Required in addition to all standard WLCG Services –Runs standard WLCG components and ALICE specific services Installation and maintenance of the specific experiment services at the local VO-boxes entirely ALICE responsibility –Based on a regional principle –Set of ALICE experts matched to groups of sites Site related problems (services and operations of the generic services) handled by site administrators WLCG Service problems reported via GGUS –Not too much, ALICE has experts in almost all sites
11
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES ALICE: Operation procedures Core team placed at CERN coordinates the operations and the services status at all sites –IT-ES/experiment collaboration Dedicated persons at CERN-IT engaged of the experiment operations, storage solutions and specific experiment software developments –High level of expertise able to identify problems at any site in a short time –Close collaboration with the services developers and the Grid deployment team Thanks to this close collaboration, ALICE is the 1st WLCG experiment which has migrated the whole production at all sites to use the CREAM-CE as the production backend –ALICE is represented at all Grid forums EGEE TMB Daily operations meeting WLCG GDB WLCG MB T1 service operations meeting The good level of operations procedures established for and with ALICE is enabling a smooth data taking –No fundamental issues have been reported which could demage the data taking regime of the experiment
12
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES
13
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 13 CMS Computing Operations steered by 3 projects –Data Operations Resp. for central data processing and production transfers: RAW data repacking + prompt reco at T0, RAW and MC re-reco/skimming at T1’s, MC prod at T2’s –Facilities Operations Resp. of providing and maintain a working distributed computing fabric at WLCG Tiers with a consistent working env for Data Operations and end users –Analysis Operations Resp. for central data placement at T2 level, CRAB server ops, validation, support, and for metrics, monitoring, evaluation of the distributed analysis system Strong central teams complemented by CMS contacts at Tiers, working in sync Regular CMS Ops meetings –Weekly general meeting check of status of all Tier-0/1/2 + check of progress on all activities) –Weekly T2-only meeting Stable contact with WLCG Ops –Daily attendance and daily reports - no exceptions –Weekly-scope ‘special’ report on one day https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports CMS Computing Ops
14
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 14 Check SAM tests Check CMS-specific SAM tests, on all Tier-0/1/2 Check Tier-X satisfy the overall availability thresholds –goal for CMS {T1,T2} is {90%, 80%}, follow-up on issues Check the Site Readiness estimators –Boolean ‘AND’ of: uptime, JobRobot, SAM, –# commissioned links, quality of transfers,... (both daily, and in historical evolutions) Open tickets, follow up at meetings –Savannah (more) + GGUS (less, but increasing) And more... The CMS Computing Ops daily checks
15
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 15 Few Primary Centres in multiple time zones –Connected via Tandberg Video system Many existing secondary CMS centres worldwide –Permanent EVO room Started in Fall 2008, now ~50 people in 3 time-zones –Also non core-computing experts Shift roles and responsibilities –24/7 with a Computing Run Coordinator + a shifter Shift procedures and checklists constantly improving Several tools: –EVO room (see above) –Shift sign-up tool (+credits) –IM account for shifters –Computing Plan of the Day –Large use of shifter ELOG –Savannah + GGUS CMS Computing shifts
16
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES ATLAS Distributed Computing Shift Operations In addition we have –High level of support from system experts in, e.g., DDM, PanDA –Cloud squads who deal with data management and other issues within clouds (site setup, cleaning lost files, etc.) RoleLocation FTE Effort Comments ADC Manager on Duty CERN1 In overall charge of distributed computing operations for 1 week, reports ATLAS issues to WLCG daily meeting ADC Point 1 Shifter CERN 3 (data taking); 1 (non data taking) Monitor data export from CERN to Tier-1 sites; monitor health of ATLAS central services (e.g., DDM) ADC Operations Shifters Distributed (EU, US and AP shiters) 5 (2 EU, 2 US, 1 AP) Monitor central production and data flows between T1s and T2s Distributed Analysis Shifters Distributed (Usually EU + US) 2 (1 EU, 1 US) Responsive to user questions and problems on distributed analysis
17
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES ATLAS: Monitoring Tools SLS monitoring for health of ATLAS central services Production system monitoring Analysis Job Monitoring We have many very useful monitoring tools, but shifters do have to look in many places DDM Transfer Dashboard for monitoring all data movement
18
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES Communications and Issue Tracking There are a wide diversity of tools suitable for different types of issues and communications between different actors, however links between tools are manual and too time consuming for both shifters and experts ToolUse Jabber Chatroom Instant communication between different shift teams and experts eLog Global logbook for shift teams GGUS Communication about problems with sites - Team and Alarm tickets are essential for us Savannah ATLAS internal communication and issue tracking
19
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES Some general remarks on ATLAS Operations The system takes a great deal of effort to run –This is mainly because sites are still unstable and we generally notice problems before they do Storage system stability, reliability and performance is still lacking –ATLAS, as a heavy user of the system, probes sites far more deeply than any automated tests We are trying to automate systems as much as possible, but this has to be done carefully to avoid false positives There is little notion of site criticality in the grid tools: T0, T1, T2 or T3 for ATLAS makes a big difference to our operations Managing change during LHC running is the big challenge, e.g., continued service migration to SL5
20
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES Conclusion 2004 -07 SC3/SC4 DCXX 2008 CCRC08 2009 Step09/ First collisions 2010….. real Data Experiment operations differ (size of the collaboration, man power, time-zone) …and also many common needs (scattered information, manual interventions still required, procedures to be improved, managing changes client/service side during data taking) QoS offered by WLCG improved dramatically in the years thanks to many coordinated activities, gained experience in managing emergencies and improved communication among various Stakeholders …but have also many common solutions due a collaborative attitude to share best practices.
21
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES Many thanks to: Graeme Stewart/Alessandro Di Girolamo (ATLAS) Daniele Bonacorsi and Andrea Sciaba’ (CMS) Patricia Mendez Lorenzo (ALICE)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.