CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Experiment Operations Simone Campana.

Slides:



Advertisements
Similar presentations
Jan 2010 Current OSG Efforts and Status, Grid Deployment Board, Jan 12 th 2010 OSG has weekly Operations and Production Meetings including US ATLAS and.
Advertisements

CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Hall D Online Data Acquisition CEBAF provides us with a tremendous scientific opportunity for understanding one of the fundamental forces of nature. 75.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES WLCG operations: communication channels Andrea Sciabà WLCG operations.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
OSG Operations and Interoperations Rob Quick Open Science Grid Operations Center - Indiana University EGEE Operations Meeting Stockholm, Sweden - 14 June.
CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.
CERN IT Department CH-1211 Genève 23 Switzerland t Service Management GLM 15 November 2010 Mats Moller IT-DI-SM.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
GGF12 – 20 Sept LCG Incident Response Ian Neilson LCG Security Officer Grid Deployment Group CERN.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
1 24x7 support status and plans at PIC Gonzalo Merino WLCG MB
The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 25 th April 2012.
Offline shifter training tutorial L. Betev February 19, 2009.
OFFLINE TRIGGER MONITORING TDAQ Training 5 th November 2010 Ricardo Gonçalo On behalf of the Trigger Offline Monitoring Experts team.
A.Golunov, “Remote operational center for CMS in JINR ”, XXIII International Symposium on Nuclear Electronics and Computing, BULGARIA, VARNA, September,
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Overview ROC_LA CERN
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
Graeme Stewart: ATLAS Computing WLCG Workshop, Prague ATLAS Suspension and Downtime Procedures Graeme Stewart (for ATLAS Central Operations Team)
WLCG Service Report ~~~ WLCG Management Board, 9 th August
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
Julia Andreeva, CERN IT-ES GDB Every experiment does evaluation of the site status and experiment activities at the site As a rule the state.
WLCG and the India-CERN Collaboration David Collados CERN - Information technology 27 February 2014.
CERN IT Department CH-1211 Genève 23 Switzerland t 24x7 Service Support Tony Cass LCG GDB, 24 th November 2009.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
EGEE is a project funded by the European Union under contract IST Support in EGEE Ron Trompert SARA NEROC Meeting, 28 October
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
The OSG and Grid Operations Center Rob Quick Open Science Grid Operations Center - Indiana University ATLAS Tier 2-Tier 3 Meeting Bloomington, Indiana.
CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
CERN IT Department CH-1211 Geneva 23 Switzerland t DBES LHC(b) Grid operations Roberto Santinelli IT/ES 5 th User Forum – Uppsala April.
Operations Working Group Summary Ian Bird CERN IT-GD 4 November 2004.
Julia Andreeva on behalf of the MND section MND review.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Communication tools between Grid Virtual.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
1 Andrea Sciabà CERN The commissioning of CMS computing centres in the WLCG Grid ACAT November 2008 Erice, Italy Andrea Sciabà S. Belforte, A.
tons, 150 million sensors generating data 40 millions times per second producing 1 petabyte per second The ATLAS experiment.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
CERN IT Department CH-1211 Geneva 23 Switzerland t James Casey CCRC’08 April F2F 1 April 2008 Communication with Network Teams/ providers.
CERN IT Department CH-1211 Genève 23 Switzerland t Future Needs of User Support (in ATLAS) Dan van der Ster, CERN IT-GS & ATLAS WLCG Workshop.
CERN IT Department CH-1211 Genève 23 Switzerland t Distributed Analysis User Support in ATLAS (and LHCb) Dan van der Ster, CERN IT-GS & ATLAS.
CERN - IT Department CH-1211 Genève 23 Switzerland t Grid Reliability Pablo Saiz On behalf of the Dashboard team: J. Andreeva, C. Cirstoiu,
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
CERN - IT Department CH-1211 Genève 23 Switzerland t IT-GD-OPS attendance to EGEE’09 IT/GD Group Meeting, 09 October 2009.
WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.
Ian Bird LCG Project Leader Status of EGEE  EGI transition WLCG LHCC Referees’ meeting 21 st September 2009.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
LCG Workshop User Support Working Group 2-4 November 2004 – n o 1 Some thoughts on planning and organization of User Support in LCG/EGEE Flavia Donno LCG.
EGI-Engage is co-funded by the Horizon 2020 Framework Programme of the European Union under grant number GGUS Service Provider GGUS –
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.
Operation team at Ccin2p3 Suzanne Poulat –
Setting up NGI operations Ron Trompert EGI-InSPIRE – ROD teams workshop1.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
Operations Working Group Summary Ian Bird CERN IT-GD 4 November 2004.
ADC Operations Shifts J. Yu Guido Negri, Alexey Sedov, Armen Vartapetian and Alden Stradling coordination, ADCoS coordination and DAST coordination.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
CERN IT Department CH-1211 Genève 23 Switzerland t CMS SAM Testing Andrea Sciabà Grid Deployment Board May 14, 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS Section input to GLM For GLM attended by Director for Computing.
CERN IT Department CH-1211 Genève 23 Switzerland t EGEE09 Barcelona ATLAS Distributed Data Management Fernando H. Barreiro Megino on behalf.
CERN WLCG Grid Storage Systems Deployment Flavia Donno, CERN 6 November 2007 Organization of Storage Support through GGUS Flavia Donno CERN/IT-GD CERN.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc News from the CMS computing and offline monitoring.
Offline shifter training tutorial
Offline shifter training tutorial
LCG Operations Workshop, e-IRG Workshop
Presentation transcript:

CERN IT Department CH-1211 Genève 23 Switzerland t Experiment Operations Simone Campana

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Outline Try to answer to the following questions: –How are experiment operations organized? –Which Communication Channels are used? –Which are the commonalities? –Which are the differences? Thanks to Patricia Mendez Lorenzo, Roberto Santinelli and Andrea Sciaba + many other from experiments

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services CMS Computing Operations Computing Shift Person (CSP) at the CMS centre at CERN or FNAL –Monitors the computing infrastructure and services going through a checklist –Identifies problems, triggers actions and calls –Creates eLog reports and support tickets –Reacts to unexpected events Computing Run Coordinator (CRC) at CERN –Overview of offline computing plans and status, operational link with online, keeps track of open computing issues –Is a computing expert Expert On Call (EOC), physically located anywhere in the world –Very expert in one or more aspects of the computing system (there can be more than one) –Must be on call

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services CMS Computing Operations Data Operations expert on call: –Runs the T0 workflows and the T1 transfers –Monitors the above workflows Time Coverage –During global runs: Computing Shift Person: 8 hours shift, 16/7 coverage DataOps expert: 16/7 mandatory, 24/7 voluntary –Otherwise (local runs): CSP: 8/5 coverage DataOps expert: just on call

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services LHCb Computing Operations Grid Shifters (a.k.a production shifters) –Running production and data handling activities –Identifying and escalating problems –Some not-so-basic knowledge of Grid services and LHCb framework –See tick list for more information: pdf pdf Grid Expert on call –addressing problems –defining/improving operational procedures. Production Manager (based at CERN) –Organizes the overall production Dirac Developers experts –Fraction of time dedicated to run Grid Operations All Grid Operations are run from CERN –With the exception of some contact persons at T1s whose role also fits in one of the above

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services LHCb Time Coverage For more information please check the production operations web page LHC down : decided to move to 1 shifter for working hours

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services ALICE Computing Operations ALICE Computing Operations is a joined effort between: –ALICE Core offline team running ALICE operations. Centralized at CERN –WLCG ALICE experiment support i.e. people offering Grid expertise to ALICE Production manager organizing the overall activity –with workflow and component experts behind data expert, workload expert, Alien expert etc... Offline shifts in the ALICE control room (P2) –Support the central GRID services and management tasks. RAW data registration (T0) and replication to T1s Conditions data gathering, storage and replication Quasi online first pass reconstruction at T0 –and asynchronous second pass at T1s ALICE Central Services status ALICE Site Services (VO-box/WMS/storage) status

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services ALICE Time Coverage Offline shifts 24/7 during data taking First line support at CERN provided by IT/GS. Site support is tiered and assured by regional experts –one per country/region, in contact with site experts. –supported by the Core Offline and/or by the WLCG experts for high level or complex Grid issues. –very important to emphasize the importance of the support also at T2 sites

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services ATLAS Computing Operations ATLAS Computing Shift at P1: 24(16)/7 during data taking –T0 shifter Monitor Data collection and recording from P1 to T0 Monitor First processing at T0 –Distributed Computing Shifter Monitor T0-T1 and T1-T1 data distribution –Database shifter ATLAS Distributed Computing Shifts (ADCoS) –Several level of expertise: Trainee, Senior, Expert, Coordinator –Monitor Monte Carlo production and T2 transfer activities ATLAS Expert On-Call: 24/7 –Offers expertise for data distribution activities Developers and single components experts: best effort –offering third level support

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services ADCoS Time Coverage Europe 5 experts+10 seniors+ 5 trainees Asia: 4 seniors+1trainee America: 2 experts+5 seniors+ 3 trainees Covering 24h/day and 6 days/week, having people in three time-zones (no need for night shifts)

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services CMS Comunication Channels eLog (using DAQ eLog + FNAL eLog, will have dedicated CERN box) “Computing plan of the day” (by the CRC) AIM accounts for shifters Savannah –+ GGUS for EGEE sites Sites  Operations: Savannah + HN Operations  Sites: Savannah, GGUS (+HN) Users  Operations: CMS user support (Savannah + )

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services LHCb Communication Channels Internally LHCb: –Elog book: –14X7 :Expert cell-phone number: –Daily meeting (14:30 – 15:??) –Mailing list: (for ops matters) (for dev matters) mailing list for each contact Outreaching services and sites: –GGUS and/or Remedy ALARM tickets just for test, TEAM ticket not extensively used yet –WLCG daily and weekly meetings –IT/LHCb coordination meeting, SCM meeting –Higher level meetings (GDB/MB) –Local contact person and central grid coordinator person useful for speeding up resolution of problems Being reached from users and sites: –Support unit defined in GGUS –Mailing lists –Contact persons acting as liaison/reference for many site admins and service providers

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services ALICE Communication Channels Internal ALICE communication –Mailing list –ALICE-LCG-EGEE Task Force Communication with users and User Support –Mailing list for operational problems and Savannah tracker for bugs. –Monthly User Forums (EVO) for dissemination of new Grid related information and analysis news. And monthly Grid training for new users Communication with sites and Grid operation support –TASK force Mailing List for operational problems –GGUS –daily WLCG ops meetings –weekly ALICE-LCG taskforce meetings –Dedicated contacts with many sites

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services ATLAS Communication Channels Internal Communication –ADCoS ELOG + T0 ELOG + ELOG –Savannah for DDM problem tracking Communication with sites –Mainly GGUS Team Tickets for all shifts + ALARM tickets for restricted list of experts –Support Mailing Lists mostly for CERN (CASTOR, FTS, LFC) –Cloud Mailing Lists Informational only –Many sites read ELOG –No clear site2ATLAS channel ATLAS operations mailing list, but something better should be thought. Communication with Users –Mostly HN for Operations2Users –GGUS + Savannah for Users2Operations … and meetings: Daily WLCG Meeting, weekly ATLAS ops

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Conclusions (I) Experiment Operations rely on multilevel operation mode –First line shift crew –Second line Experts On-Call –Developers as third line support not necessarily on-call Experiments Operations strongly integrated with WLCG operations and Grid Service Support –Expert support –Escalation procedures Especially for critical issues or long standing issues Incidents Post Mortems –Communications and Notifications I personally like the daily 15:00h meeting

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Conclusions (II) ATLAS and CMS rely on a more distributed operation model –Worldwide shifts and experts on call Central Coordination always at CERN –Possibly due to geographical distribution of partner sites Especially for US and Asia regions All experiments recognize the importance of experiment dedicated support at sites –CMS can rely on contacts at every T1 and T2 –ATLAS and ALICE can rely on contacts per region/cloud Contact at all T1s, usually dedicated Some dedicated contact also at some T2 –LHCb can rely on contacts at some T1