Clustering Technology For Fault Tolerance Jim Gray Microsoft Research http://www.research.Microsoft.com/~Gray
What is Wolfpack? A consortium of 60 HW & SW vendors (everybody who is anybody) A set of APIs for clustering and fault tolerance An enhancement to NT™ Server (in beta test ) Key concepts System: a particular node Cluster: a collection of systems working together resource: a hardware or software module resource dependency: one resource needs another resource group: fails over as a unit: dependencies do not cross group boundaries
What Wolfpack Supports in V1 two node failover (twin-tail SCSI) Apps: File, Print, web server, IP address, Net Name Most of Microsoft BackOffice (SQL, Exchange, Viper, Falcon,…) Oracle SAP many others Easy to program, operate, use
Cluster Advantages Clients and Servers made from the same stuff. Inexpensive: Built with commodity components Fault tolerance: Spare modules mask failures Modular growth grow by adding small modules Parallel data search use multiple processors and disks
What Happens When a Component Fails? Redundant disk or path: configure around it. Non-redundant software: restart. Non-redundant hardware: migrate software to surviving nodes. Fault detection: 1 ms to 10 sec. Failover .1 sec to 1 min. This is standard in Tandem, Teradata, VMScluster