Service Operations at the T0/T1 for the ALICE Experiment

Slides:



Advertisements
Similar presentations
Preservasi Informasi Digital.  It will never happen here!  Common Causes of Loss of Data  Accidental Erasure (delete, power, backup)  Viruses and.
Advertisements

Data Center and Network Planning and Services Mark Redican IET CCFIT Update Feb 13, 2012.
CERN IT Department CH-1211 Genève 23 Switzerland t Next generation of virtual infrastructure with Hyper-V Michal Kwiatek, Juraj Sucik, Rafal.
CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases.
Disaster Recovery as a Cloud Service Chao Liu SUNY Buffalo Computer Science.
Chapter 10 : Designing a SQL Server 2005 Solution for High Availability MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design.
Maintaining a Microsoft SQL Server 2008 Database SQLServer-Training.com.
Chapter 16 Designing Effective Output. E – 2 Before H000 Produce Hardware Investment Report HI000 Produce Hardware Investment Lines H100 Read Hardware.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
Preventing Common Causes of loss. Common Causes of Loss of Data Accidental Erasure – close a file and don’t save it, – write over the original file when.
Data Synchronization by Dennis Sparrow. Infrastructure  New York Board of Trade (NYBOT) maintains 3 sites –Primary Trading Floor –Primary Data Center.
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO.
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
CERN Physics Database Services and Plans Maria Girone, CERN-IT
HalFILE 2.1 Network Protection & Disaster Recovery.
CMS Proposal for a Maintenance & Operation agreement with EN/CV/DC for fluorocarbon detector cooling circuits v.2 P. Tropea For the CMS experiment.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
Infrastructure as code. “Enable the reconstruction of the business from nothing but a source code repository, an application data backup, and bare metal.
Computer Centre Shutdown Post-Mortem Tim Smith FIO/IS (Presented at HEPiX by A.Silverman)
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
CERN IT Department CH-1211 Genève 23 Switzerland t Experiment Operations Simone Campana.
Update of SAM Implementation ALICE TF Meeting 18/10/07.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.
OCLC ILLiad hosted service March 18 th, 2010 The who, what, why, when and how of ILLiad hosting.
Current status WMS and CREAM CE deployment Patricia Mendez Lorenzo ALICE TF Meeting (CERN, 02/04/09)
LCG Tier1 Reliability John Gordon, STFC-RAL CCRC09 November 13 th, 2008.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
GRID interoperability and operation challenges under real load for the ALICE experiment F. Carminati, L. Betev, P. Saiz, F. Furano, P. Méndez Lorenzo,
Reaching MoU Targets at Tier0 December 20 th 2005 Tim Bell IT/FIO/TSI.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
DB Questions and Answers open session (comments during session) WLCG Collaboration Workshop, CERN Geneva, 24 of April 2008.
Disaster Recovery Prepared by Mark Lomas Mark Lomas IT Infrastructure Consultant Storage & Servers.
CommVault Architecture
Committee – June 30, News from CERN Erik Gottschalk June 30, 2005.
GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals
Servizi core INFN Grid presso il CNAF: setup attuale
Maintaining a Microsoft SQL Server 2008 Database
Information Systems Security
WLCG IPv6 deployment strategy
Planning for Application Recovery
Server Upgrade HA/DR Integration
Performance monitoring framework for the technical infrastructure
N-Tier Architecture.
LCG 3D Distributed Deployment of Databases
Cross-site problem resolution Focus on reliable file transfer service
Status of the Production
Summary on PPS-pilot activity on CREAM CE
Database Services at CERN Status Update
Data Management and Database Framework for the MICE Experiment
Patricia Méndez Lorenzo ALICE Offline Week CERN, 13th July 2007
Castor services at the Tier-0
1 VO User Team Alarm Total ALICE ATLAS CMS
IT Services Portfolio Todd Endicott – Senior Network and System Engineer Mary Monroe – Implementation Engineer.
A Messaging Infrastructure for WLCG
ALICE – FAIR Offline Meeting KVI (Groningen), 3-4 May 2010
Take the summary from the table on
SAM Alarm Triggering and Masking
Disaster Recovery Services
Leigh Grundhoefer Indiana University
Klopotek is transitioning to a Global Organization
Pierre Girard ATLAS Visit
Kashif Mohammad Deputy Technical Co-ordinator (South Grid) Oxford
EGEE Operation Tools and Procedures
IT OPERATIONS Session 7.
Deploying Production GRID Servers & Services
The LHCb Computing Data Challenge DC06
Designing Database Solutions for SQL Server
Presentation transcript:

Service Operations at the T0/T1 for the ALICE Experiment Patricia Méndez Lorenzo IT/ES

ALICE Critical services: update Rank Severity Comments CASTOR2+xrootd@T0 10 3 If restored within 2h no loss of jobs Site VOBOXES 7 2 IAliEnv2.18 establishes a failover mechanism. In addition most of the sites are providing at least 2 voboxes Transfer T0 T1 infras. Affects Pass2 reconstruction MSS@T1 5 1 Files are replicated to any of the available T1 sites Site WM system LCG-CE/CREAM-CE duality at all sites Proof@CAF Specially relevant during daytime (DB Service still not included, waiting for ALICE input) ALICE retains the highest criticality for CASTOR@T0 only Ranking in VOBOXEs and Site WM system decreases thanks to the failover mechanisms defined by the experiment Best effort support desirable Services ran by IT/PES/PS (MyProxy, BDII, FTS, CE, WMS): Support declared on best effort basis: NO ADDITIONAL SUPPORT REQUIRED

Support/Operations infrastructure for the T0

CASTOR@T0: Interventions End of 2009, ALICE DAQ group (responsible for the RAW data transfers to CC) established with FIO the following breakdown procedure: ALICE control room operator together with on-call expert confirms the nature of the breakdown Specific ALICE test suite identifies all possible sources of errors If the error is associated with the data transfer to the CC (CASTOR), the ALICE operator will call the CC, identifying himself as “ALICE control room operator”

CASTOR@T0: Interventions The ALICE operator describes briefly the problem and sends an email to the following lists: alice-operator-alarm@cern.ch (CC operator email included), alice-shift-alarms@cern.ch, alice-datesupport@cern.ch SUBJECT: CASTOR@T0 ALICE => Computer Center problem report from the ALICE CR operator BODY: Details of the CASTOR service failing, description of the problem, phone number of the ALICE CR operator and trace of the test transfer in attachment CC operator (experts taking charge of issue) can call back the ALICE CR operator for identification purposes, updates or further information as required CASTOR Requirements

CASTOR@T0: Interventions ALICE CR operator console is on the technical network: allows e-mails but no web access On-call experts can evaluate the problem without accessing directly the computers in the ALICE CR (firewall protected) ALICE has defined the above operation protocol for Severity 3 category Lower severities (1-2) will be followed during working hours basis through GGUS, Remedy, ops. meeting

DB@T0: Interventions DCS (Detector Control System) running an ORACLE cluster with: 6 nodes, SAN infrastructure, 3 RAID arrays for live data and one array for backup IT service (piquet service coverage): setup of the DB cluster monitoring, maintenance, patching, backups, disaster recovery Consultancy IT DB supports the Oracle DB (on-line DB of ALICE) for production database backup purposes ALICE is responsible of the hardware procurement ALICE provides an on-call service (1/2FTE) Further contacts are now required between ALICE and the IT/DB group to tune the current operation procedures