BEST PRACTICES FOR RELIABLE CARRIER GRADE TELEPHONY Alistair Cunningham, Integrics Ltd.
Reliability Think people and culture, not technology. Complexity is the enemy. Discipline is the answer. Management must be willing to sacrifice features. Reliability for all customers is more important than winning one new customer.
Staff Responibilities Assign a senior engineer as system manager. System manager has ultimate responsibility for whole system. Can delegate tasks to others.
Cluster Architecture Duplicate all important functions. Use heartbeat, DRBD/GFS, application level load balancing. Remember utilities. Consistency between machines is vital. Virtual machines have more outages. Monitor all machines, services, and resources. Daily and monthly backups.
Upgrades and Changes Risk is unpredicable and cumulative. Many small changes are riskier than a few large changes. Test all changes on a staging machine first. Keep records of changes. Consider change management system. Keep customizations to a minimum.
Dealing with Vendors Vendors can never substitute for system manager. Give vendors access to staging machines but not production. Your staff must have debugging skills. Subscribe to security mailing lists.
Causes of Outages Most outages are caused by one of: Untested changes – use staging. Hard disks filling up – use monitoring. Power and network outages – redundancy or split cluster. Avoiding these three is usually sufficient to achieve good reliability.