Clustering Next Wave In PC Computing
2 PP ppt Cluster Concepts 101 This section is about clusters in general, we’ll get to Microsoft’s Wolfpack cluster implementation in the next section.
3 PP ppt Why Learn About Clusters Today clusters are a niche Unix market But Microsoft will bring clusters to the masses Last October, Microsoft announced NT clusters SCO announced UnixWare clusters Sun announced Solaris / Intel clusters Novell announced Wolf Mountain clusters In 1998, 2M Intel servers will ship 100K in clusters In 2001, 3M Intel servers will ship 1M in clusters (IDC’s forecast) Clusters will be a huge market and RAID is essential to clusters
4 PP ppt What Are Clusters? Group of independent systems that Function as a single system Appear to users as a single system And are managed as a single system’ Clusters are “virtual servers”
5 PP ppt Why Clusters #1. Clusters Improve System Availability This is the primary value in Wolfpack-I clusters #2. Clusters Enable Application Scaling #3. Clusters Simplify System Management #4. Clusters (with Intel servers) Are Cheap
6 PP ppt Why Clusters - #1 #1. Clusters Improve System Availability When a networked server fails, the service it provided is down When a clustered server fail, the service it provided “failsover” and downtime is avoided Mail Server Internet Server Networked Servers Clustered Servers Mail & Internet
7 PP ppt Why Clusters - #2 #2. Clusters Enable Application Scaling With networked SMP servers, application scaling is limited to a single server With clusters, applications scale across multiple SMP servers (typically up to 16 servers)
8 PP ppt Why Clusters - #3 #3. Clusters Simplify System Management Clusters present a Single System Image; the cluster looks like a single server to management applications Hence, clusters reduce system management costs Three Management Domains One Management Domain
9 PP ppt Why Clusters - #4 #4. Clusters (with Intel servers) Are Cheap Essentially no additional hardware costs Microsoft charges an extra $3K per node Windows NT Server$1,000 Windows NT Server, Enterprise Edition$4,000 Note: Proprietary Unix cluster software costs $10K to $25K per node.
10 PP ppt An Analogy to RAID RAID Makes Disks Fault Tolerant Clusters make servers fault tolerant RAID Increases I/O Performance Clusters increase compute performance RAID Makes Disks Easier to Manage Clusters make servers easier to manage RAID
11 PP ppt Two Flavors of Clusters #1. High Availability Clusters Microsoft’s Wolfpack 1 Compaq’s Recovery Server # 2. Load Balancing Clusters (a.k.a. Parallel Application Clusters) Microsoft’s Wolfpack 2 Digital’s VAXClusters Note: Load balancing clusters are a superset of high availability clusters.
12 PP ppt High Availability Clusters Two node clusters (node = server) During normal operations, both servers do useful work Failover When a node fails, applications failover to the surviving node and it assumes the workload of both nodes MailWeb Mail & Web
13 PP ppt High Availability Clusters Failback When the failed node is returned to service, the applications failback MailWeb Mail
14 PP ppt Load Balancing Clusters Multi-node clusters (two or more nodes) Load balancing clusters typically run a single application, (e.g. database, distributed across all nodes) Cluster capacity is increased by adding nodes (but like SMP servers, scaling is less than linear) 3,000 TPM3,600 TPM
15 PP ppt Load Balancing Clusters Cluster rebalances the workload when a node dies If different apps are running on each server, they failover to the least busy server or as directed by predefined failover policies
16 PP ppt Two Cluster Models #1. “Shared Nothing” Model Microsoft’s Wolfpack Cluster #2. “Shared Disk” Model VAXClusters
17 PP ppt #1. “Shared Nothing” Model At any moment in time, each disk is owned and addressable by only one server “Shared nothing” terminology is confusing Access to disks is shared -- on the same bus But at any moment in time, disks are not shared RAID
18 PP ppt #1. “Shared Nothing” Model When a server fails, the disks that it owns “failover” to the surviving server transparently to the clients RAID
19 PP ppt #2. “Shared Disk” Model Disks are not owned by servers but shared by all servers At any moment in time, any server can access any disk Distributed Lock Manager arbitrates disk access so apps on different servers don’t step on one another (corrupt data) RAID
20 PP ppt Cluster Interconnect This is about how servers are tied together and how disks are physically connected to the cluster
21 PP ppt Cluster Interconnect Clustered servers always have a client network interconnect, typically Ethernet, to talk to users And at least one cluster interconnect to talk to other nodes and to disks RAID Cluster Interconnect Client Network HBA
22 PP ppt Cluster Interconnects (cont’d) Or They Can Have Two Cluster Interconnects One for nodes to talk to each other -- “Heartbeat Interconnect” Typically Ethernet And one for nodes to talk to disks -- “Shared Disk Interconnect” Typically SCSI or Fibre Channel RAID Shared Disk Interconnect Cluster Interconnect HBA NIC
Micosoft’s Wolfpack Clusters
24 PP ppt Clusters Are Not New Clusters Have been Around Since 1985 Most UNIX Systems are Clustered What’s New is Microsoft Clusters Code named “Wolfpack” Named Microsoft Cluster Server (MSCS) Software that provides clustering MSCS is part of Window NT, Enterprise Server
25 PP ppt Microsoft Cluster Rollout Wolfpack-I In Windows NT, Enterprise Server, 4.0 (NT/E, 4.0) [Also includes Transaction Server and Reliable Message Queue] Two node “failover cluster” Shipped October, 1997 Wolfpack-II In Windows NT, Enterprise Server 5.0 (NT/E 5.0) “N” node (probably up to 16) “load balancing cluster” Beta in 1998 and ship in 1999
26 PP ppt MSCS (NT/E, 4.0) Overview Two Node “Failover” Cluster “Shared Nothing” Model At any moment in time, each disk is owned and addressable by only one server Two Cluster Interconnects “Heartbeat” cluster interconnect Ethernet Shared disk interconnect SCSI (any flavor) Fibre Channel (SCSI protocol over Fibre Channel) Each Node Has a “Private System Disk” Boot disk
27 PP ppt MSCS (NT/E, 4.0) Topologies #1. Host-based (PCI) RAID Arrays #2. External RAID Arrays
28 PP ppt NT Cluster with Host-Based RAID Array Each node has Ethernet NIC -- Heartbeat Private system disk (generally on an HBA) PCI-based RAID controller -- SCSI or Fibre Nodes share access to data disks but do not share data RAID Shared Disk Interconnect “Heartbeat” Interconnect RAIDHBA NIC
29 PP ppt NT Cluster with SCSI External RAID Array Each node has Ethernet NIC -- Heartbeat Multi-channel HBA’s connect boot disk and external array Shared external RAID controller on the SCSI Bus -- DAC SX RAID Shared Disk Interconnect “Heartbeat” Interconnect HBA NIC
30 PP ppt NT Cluster with Fibre External RAID Array DAC SF or DAC FL (SCSI to disks) DAC FF (Fibre to disks) RAID Shared Disk Interconnect “Heartbeat” Interconnect HBA NIC
MSCS -- A Few of the Details Managers -->
32 PP ppt Cluster Interconnect & Heartbeats Cluster Interconnect Private Ethernet between nodes Used to transmit “I’m alive” heartbeat messages Heartbeat Messages When a node stops getting heartbeats, it assumes the other node has died and initiates failover In some failure modes both nodes stop getting heartbeats (NIC dies or someone trips over the cluster cable) Both nodes are still alive But each thinks the other is dead Split brain syndrome Both nodes initiate failover Who wins?
33 PP ppt Quorum Disk Special cluster resource that stores the cluster log When a node joins a cluster, it attempts to reserve the quorum disk (purple disk) If the quorum disk does not have an owner, the node takes ownership and forms a cluster If the quorum disk has an owner, the node joins the cluster RAID Disk Interconnect Cluster “Heartbeat” Interconnect RAID HBA
34 PP ppt Quorum Disk If Nodes Cannot Communicate (no heartbeats) Then only one is allow to continue operating They use the quorum disk to decide which one lives Each node waits, then tries to reserve the quorum disk Last owner waits the shortest time and if it’s still alive will take ownership of the quorum disk When the other node attempts to reserve the quorum disk, it will find that it’s already owned The node that doesn’t own the quorum disk then failsover This is called the Challenge / Defense Protocol
35 PP ppt Microsoft Cluster Server (MSCS) MSCS Objects Lots of MSCS objects but only two we care about Resources and Groups Resources Applications, data files, disks, IP addresses,... Groups Application and related resources like data on disks
36 PP ppt Microsoft Cluster Server (MSCS) When a server dies, groups failover When a server is repaired and returned to service, groups failback Since data on disks is included in groups, disks failover and failback Group: Mail Resource Group: Mail Resource Group: Mail Resource Group: Web Resource Group: Web Resource Group: Web Resource
37 PP ppt Groups Failover Groups are the entities that failover And they take their disks with them Group: Mail Resource Group: Mail Resource Group: Mail Resource Group: Web Resource Group: Web Resource Group: Web Resource Group: Mail Resource Group: Mail Resource Group: Mail Resource
38 PP ppt Microsoft Cluster Certification Two Levels of Certification Cluster Component Certification HBA’s and RAID controllers must be certified When they pass: They’re listed on the Microsoft web site They’re eligible for inclusion in cluster system certification Cluster System Certification Complete two node cluster When they pass: They’re listed on the Microsoft web site They’ll be supported by Microsoft Each Certification Takes Days
Mylex NT Cluster Solutions
40 PP ppt Commodity PC Market (Including Mobile) PC based Workstations Performance Desktop PCs Target Markets Entry Level Servers AcceleRAID ™ 200 eXtremeRAID-1100 DAC-PJ DAC-PG AcceleRAID ™ 250 DAC-FF DAC-FL DAC-SF DAC-SX AcceleRAID ™ 150 Mid Range Enterprise Servers
41 PP ppt Internal vs External RAID Positioning Internal RAID Lower cost solution Higher performance in read-intensive applications Proven TPC-C performance enhances cluster performance External RAID Higher performance in write-intensive applications Write-back cache is turned-off in PCI-RAID controllers Higher connectivity Attach more disk drives Greater footprint flexibility Until PCI-RAID implements fibre
42 PP ppt Why We’re Better -- External RAID Robust Active - Active Fibre Implementation Shipping active - active for over a year It works in NT (certified) and Unix environments Have Fibre on the back-end soon Mirrored Cache Architecture Without mirrored cache, data is inaccessible or dropped on the floor when a controller fails Unless you turn-off the write-back cache which degrades write performance by 5x to 30x. Four to Six Disk Channels I/O bandwidth and capacity scaling Dual Fibre Host Ports NT expects to access data over pre-configured paths If it doesn’t find the data over the expected path, then I/O’s don’t complete and applications fail
43 PP ppt SX Active / Active Duplex HBA SX Ultra SCSI Disk Interconnect Cluster Interconnect
44 PP ppt SF (or FL) Active / Active Duplex HBA SF FC HBA Single FC Array Interconnect
45 PP ppt SF (or FL) Active / Active Duplex HBA Dual FC Array Interconnect FC HBA FC Disk Interconnect FC HBA SF
46 PP ppt FF Active / Active Duplex HBA Single FC Array Interconnect FC HBA FF
47 PP ppt FF Active / Active Duplex HBA Dual FC Array Interconnect FC HBA FF
48 PP ppt Why We’ll Be Better -- Internal RAID Deliver Auto-Rebuild Deliver RAID Expansion MORE-IAdd Logical Units On-line MORE-IIAdd or Expand Logical Units On-line Deliver RAID Level Migration 0 ---> 1 1 ---> 0 0 ---> 5 5 ---> 0 1 ---> 5 5 ---> 1 And (of course) Award Winning Performance
49 PP ppt Nodes have: Ethernet NIC -- Heartbeat Private system disks (HBA) PCI-based RAID controller eXtreme RAID Shared Disk Interconnect “Heartbeat” Interconnect eXtreme RAID HBA NIC NT Cluster with Host-Based RAID Array
50 PP ppt Why eXtremeRAID & DAC960PJ Clusters Typically four or less processors Offers a less expensive, integrated RAID solution Can combine clustered and non clustered applications in the same enclosure Uses today’s readily available hardware
51 PP ppt TPC-C Performance for Clusters Two External Ultra Channels At 40 MB/sec 32 bit PCI bus between the controller and the server, providing burst data transfer rates up to 132 MB/sec. Three internal Ultra Channels At 40 MB/sec 66 Mhz I960 processor off-loads RAID management from the host CPU DAC960PJ
52 PP ppt eXtremeRAID™ achieves breakthrough in RAID technology, eliminates storage bottlenecks and delivers scaleable performance for NT Clusters. LEDsSerial Port 233 MHz RISC processor CPU NVRAM Ch 0Ch 1 Ch 0 (bottom) Ch 2 (top) SCSI PCI Bridge BASS DAC Memory Module with BBU 80 MB/sec. 64 bit PCI bus doubles data bandwidth between the controller and the server, providing burst data transfer rates up to 266 MB/sec. 3 - Ultra2 SCSI LVD channels for up to 42 shared storage devices and Connectivity Up To 12 Meters 233 MHz strong ARM RISC processor off-loads RAID management from the host CPU Mylex’s new firmware is optimized for performance and manageability eXtremeRAID™ supports up to 42 drives, per cluster, as much as 810 GB of capacity per controller. Performance increases as you add drives. eXtremeRAID ™ : Blazing Clusters
53 PP ppt eXtremeRAID ™ 1100 NT Clusters Nodes have: Ethernet NIC -- Heartbeat Private system disks (HBA) PCI-based RAID controller Nodes share access to data disks but do not share data 3 Shared Ultra2 Interconnects “Heartbeat” Interconnect HBA NIC eXtreme RAID eXtreme RAID
54 PP ppt Cluster Support Plans Internal RAID Windows NT Windows NT Novell OrionQ4 98 SCOTBD SUNTBD External RAID Windows NT Windows NT Novell OrionTBD SCOTBD
55 PP ppt Plans For NT Cluster Certification Microsoft Clustering (submission dates) DACSXCompleted (Simplex) DACSFCompleted (Simplex) DACSXJuly (Duplex) DACSFJuly (Duplex) DACFLAugust (Simplex) DACFLAugust (Duplex) DAC960 PJQ4 ‘99 eXtremeRAID™ 1164 Q4 ‘99 AcceleRAID™Q4 ‘99
56 PP ppt What RAID Arrays are Right for Clusters eXtremeRAID ™ AcceleRAID ™ 200 AcceleRAID ™ 250 DAC SF DAC FL DAC FF