24 x 7 support in Amsterdam Jeff Templon NIKHEF GDB 05 september 2006
Jeff Templon – Amsterdam 24x7 support, GDB, BNL, x 7 support
Jeff Templon – Amsterdam 24x7 support, GDB, BNL, Main Principle: avoid needing it u Basic Infrastructure n Power : on-site emergency generators n Network : SURFnet console staffed 24 x 7 n Guard informs all relevant people in case of ‘calamities’ n Real people watching all services (and support / ticket systems) closely during working hours u Critical Servers: redundant failover n DNS server for farm networks n Databases (FTS, LFC, 3D, etc) n Pnfs server for dCache n dCache server itself
Jeff Templon – Amsterdam 24x7 support, GDB, BNL, More avoidance u Computing services: NIKHEF and SARA share computing, hence complete interruption of service is either network into Amsterdam, or something beyond our control u Tape Robot: dimension incoming disk cache to several days, hence can survive a weekend without tape if need be
Jeff Templon – Amsterdam 24x7 support, GDB, BNL, Monitoring u Create Dashboard (pieces exist) u Pool of people who agree to watch things and alert the relevant person in case of problems; check at least once every 12 hours. u Look into system a la IN2P3 for restart privileges to this team via special account and scripts. u Already have SMS service in place for some critical components
Jeff Templon – Amsterdam 24x7 support, GDB, BNL, Plan u Put this into place early 2007 u No formal 24 x 7 or on-call system u See how it goes n If we reach targets and don’t miss response deadlines, OK n If we miss targets and deadlines, start hard discussions n Note that 24 x 7 would depend on other pieces (like NIKHEF mail server) which themselves don’t have 24 x 7!
Jeff Templon – Amsterdam 24x7 support, GDB, BNL, Open Questions u What about dynamic redistribution from source in case of problems? Increases site load by 1 / (N(N-1)) naively u How big is CERN’s data buffer? u What to do with externally identified problems? GGUS will not get our on-call support number u Cost choices : what is the cheapest road? We expect that paying staff for 24 x 7 is not the cheapest. Grid is about distribution and redundancy, we should exploit it. u Are we making best choices? (push vs. pull?)