1 LHCC RRB SG 16 Sep P. Vande Vyvre CERN-PH On-line Computing M&O LHCC RRB SG 16 Sep 2004 P. Vande Vyvre CERN/PH for 4 LHC DAQ project leaders
2 LHCC RRB SG 16 Sep P. Vande Vyvre CERN-PH Introduction Questions raised by RRB Scrutiny Group: –System managers profiles –Number of system managers –M&O budget category –Replacement profile of computer/network equipment Common answer from 4 LHC experiments See also presentation by A. Ceccucci to RRB SG in April 2003 on M&O for Online Computing
3 LHCC RRB SG 16 Sep P. Vande Vyvre CERN-PH System managers profiles CategoryFunctionQualification Level-1- React to alarms - Follow predefined procedures - 24/7 operational Experience and knowledge of computers Level-2- Install/Update systems/services - Configure/Monitor - Piquet service Same as above + 2 years experience Unix shell scripts etc. SupervisorOverall supervision and direction of these tasks Informatics professional Continuity is needed for Level-2 and supervisor personnel
4 LHCC RRB SG 16 Sep P. Vande Vyvre CERN-PH System management effort (1) Estimates based on LCG guidelines: fixed number of boxes (PC, network switch, storage element) per system manager Differences between online and offline systems: –Wide variety of equipment used as a single system –Various PCs with different configurations (trigger farms, dataflow, control, monitoring, file servers) –Variety compounded by staged procurements –Very large and highly loaded network (event building e.g.) –Failure of any part of the online system will reduce efficiency of data-taking partially (loss of HLT sub-farm e.g.) or will interrupt data taking (failure of central controller) i.e. we have to run a complete coherent system Dedicated team with appropriate skills needed to ensure reliability and optimal capacity of the online systems
5 LHCC RRB SG 16 Sep P. Vande Vyvre CERN-PH System management effort (2) Manpower from collaboration ? –LHC collaborations are very large but attempts to find suitably qualified effort for system manager have failed even to meet today’s needs –Most people (physicists, engineers) do not have the right profile –Institutes who have people with proper qualifications not prepared to locate them at CERN for adequate periods Full operation –24/7 cover at Level-1, normal working hours at Level-2 + service piquet –At least 5 people Level-1 and 5 people Level-2. Reduced by some overlap –Shift crew will contribute to Level-1 Provisional estimates to be adapted (2008-9) following experience of running the system and a better knowledge of the system reliability
6 LHCC RRB SG 16 Sep P. Vande Vyvre CERN-PH System management effort (3) ALICE2 (1+1) 3 (2+1) 4 (3+1) 5 (4+1) ATLAS2 (1+1) 3 (2+1) 5 (4+1) 8 (6+2) 9 (7+2) CMS1.5 (1+0.5) 3 (2+1) 7 (5+2) 8 (6+2) 10 (8+2) LHCb2 ( ) 3 ( ) 5 (4+1) 6 (5+1) 6 (5+1) Total effort in FTEs (Level1 and Level2 + Supervisor)
7 LHCC RRB SG 16 Sep P. Vande Vyvre CERN-PH M&O budget category M&O A Request of CERN management and RRB No other identified source
8 LHCC RRB SG 16 Sep P. Vande Vyvre CERN-PH Replacement of equipment (1) Equipment: PCs, network, and storage used for dataflow and online trigger Motivations: –Reliability of equipment as it ages –Maintainability after a few years (3 years warranty) –Suitability of old equipment to follow evolution of operating system and to work with new equipment –Need to follow Operating System (OS) evolution: Security patches New PCs (staged installation) not supported by old OS versions Old OS versions not supported Code will continue to be developed with dependencies on the OS and compiler versions Online trigger code based/using offline code developed for current OS version
9 LHCC RRB SG 16 Sep P. Vande Vyvre CERN-PH Replacement of equipment (2) Categories –Disk and fileservers: lower reliability and very rapid evolution. 3 years –PCs: 4 years Replacement cost will not directly follow Moore’s Law: I/O performance limitations, new multi-core architecture might require major increase in system memory –Network Central switch: 5 years (= period of maintenance by manufacturer) Smaller peripheral switches: 4 years (shorter warranty but less critical)
10 LHCC RRB SG 16 Sep P. Vande Vyvre CERN-PH Previous practice LEP and fixed target era: –Computers were complete systems qualified by a commercial company –Maintenance contract to paid by experiments –System managers in experiments (some CERN staff) –CERN had operators staff in the computing center and in groups giving support to experiments LHC era: –Components tested, qualified and assembled into complete systems by the experiments –Overall system much larger and complex than previously –Very few operators at CERN directly employed by CERN