Download presentation
Presentation is loading. Please wait.
Published byKellie Hancock Modified over 9 years ago
1
1 24x7 support status and plans at PIC Gonzalo Merino WLCG MB 13-10-2006
2
2 Basic Infrastructure Power supply –200 KVA UPS –500 KVA diesel generator 300 KW air conditioned –Survived last summer which was exceptionally hot Network –Spanish NREN (RedIris), same level of support than GÉANT. –Support at the level of 24x7 (emergency telephone)
3
3 High Availability in Critical Servers Today many servers still running on “WN-like” h/w –Many new services in the last years –Urgency to deploy/test/run them Currently moving critical services to a standardized “server-like” building block h/w –Dual power supply –Mirrored system disk high quality standard HDs, hot swappable –Dual ethernet (using 2 separated switches)
4
4 High Availability in Critical Servers Basic infrastructure –DNS: Use secondary server in case of primary failure h/w: move to robust platform in the near future Databases –FTS (oracle) and LFC (mysql): RAID1 system and DB disk Regular hot backup (FTS: 24h ; LFC: 1h) h/w: move to robust platform in the near future
5
5 High Availability in Critical Servers Storage –Castor: Still using castor1 in production. Servers are not HA. Need to recover from backup in case of disaster. Now migrating to castor2. Production servers will be deployed in reliable h/w and in (as much as possible) HA configuration. –dCache: Core services already deployed in 5 servers with reliable h/w Deployment schema has already some HA –2 servers for PnfsManager (PostgreSQL replicated w. Slony) –2 server for dcache-core services –1 server for SRM service & PostgreSQL associated DB
6
6 Monitoring: sensors Currently using –Nagios: for alarm handling Operator also watching SAM monitoring pages. Currently in the process of interfacing this as a local Nagios alarm –Ganglia: for metric time-dependence monitoring Plan to evaluate other tools, like lemon, with integrated capabilities and possibility of full monitoring history archiving. Missing a dashboard that facilitates global status check to the MoD –Planning to create one that integrates the different views –Interested in sharing information with other sites
7
7 Monitoring: actions Two engineers from collaborating company (TID) developing INGRID INGRID: –framework for implementing “expert system” that takes recovery actions depending on the given services alarms Not yet in production. Plan to deploy it for most critical services by 2007.
8
8 Manager on Duty In charge of: –Monitoring: support mailing list + alarms for critical services –Redirecting issues to relevant experts –Tracking the problem until its resolution internal ticketing system in place to follow up and used as “knowledge database” –Contacting back the user –Writing a daily logbook/report with main incidences
9
9 Manager on Duty Pool of 7 people (will be 10 in 2007) weekly shifts (wed-wed) Today: MoD only active during working hours 24x7 Plan –Implement SMS service for critical service alarms –ADSL@home for all PIC employees –MoD on-call during non-working hours Will act as 1st line support for alarms. Will be able to call 2nd line expert for escalation if needed. On-call system being developed now (formal issues with contracts, pay extra hours vs extra holidays?, voluntary scheme) –Plan to finalise definition of 24x7 procedures by Dec- 2006 and start operating it by March-2007.
10
10 Summary We are not planning to have staff on site 24x7 => Emphasis put on: –Deploy services in a reliable/robust way –Monitoring + automating recovery actions as much as possible Pool of engineers taking Manager on Duty shifts –will evolve to cover non-working hours through an on- call schema We are clearly not there yet, but targeting to have it by end of Q1-2007.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.