Download presentation
Presentation is loading. Please wait.
Published byJune Freeman Modified over 9 years ago
1
MSCS Clustering Implementation Mylex eXtremeRAID 1100 PCI-to-Ultra2 SCSI RAID Controllers
2
Clustering: Basics
3
Q1’99 Mylex Confidential Slide 3 What Are Clusters ? Group of independent systems that – Function as a single system – Appear to users as a single system – And are managed as a single system Clusters are “virtual servers”
4
Q1’99 Mylex Confidential Slide 4 Why Clusters ? Clusters Improve System Availability – This is the primary value in Wolfpack-I clusters Clusters Enable Application Scaling Clusters Simplify System Management Clusters (with Intel servers) Are Cheap
5
Q1’99 Mylex Confidential Slide 5 System Availability Clusters Improve System Availability – When a networked server fails, the service it provided is down – When a clustered server fails, the service it provided “failsover” and downtime is avoided Mail Server Internet Server Networked Servers Clustered Servers Mail & Internet
6
Q1’99 Mylex Confidential Slide 6 Application Scaling Clusters Enable Application Scaling – With networked SMP servers, application scaling is limited to a single server – With clusters, applications scale across multiple SMP servers (typically up to 16 servers)
7
Q1’99 Mylex Confidential Slide 7 Simple Systems Management Clusters Simplify System Management – Clusters present a Single System Image; the cluster looks like a single server to management applications – Hence, clusters reduce system management costs Three Management Domains One Management Domain
8
Q1’99 Mylex Confidential Slide 8 Inexpensive Clusters (with Intel servers) Are Cheap – Essentially no additional hardware costs - Readily Available Hardware (High Volume Server) – Microsoft charges an extra $3K per node Windows NT Server$1,000 Windows NT Server, Enterprise Edition$4,000 Note: Proprietary Unix cluster software costs $10K to $25K per node.
9
Q1’99 Mylex Confidential Slide 9 An Analogy to RAID RAID Makes Disks Fault Tolerant – Clusters make servers fault tolerant RAID Increases I/O Performance – Clusters increase compute performance RAID Makes Disks Easier to Manage – Clusters make servers easier to manage RAID
10
Q1’99 Mylex Confidential Slide 10 Two Flavors of Clusters High Availability Clusters Microsoft’s Wolfpack 1 Compaq’s Recovery Server Load Balancing Clusters (a.k.a. Parallel Application Clusters) Microsoft’s Wolfpack 2 Digital’s VAXClusters Note: Load balancing clusters are a superset of high availability clusters.
11
Q1’99 Mylex Confidential Slide 11 High Availability Clusters Two node clusters (node = server) During normal operations, both servers do useful work Failover – When a node fails, applications failover to the surviving node and it assumes the workload of both nodes MailWeb Mail & Web
12
Q1’99 Mylex Confidential Slide 12 High Availability Clusters (Contd.) Failback – When the failed node is returned to service, the applications failback MailWeb Mail
13
Q1’99 Mylex Confidential Slide 13 Load Balancing Clusters Multi-node clusters (two or more nodes) Load balancing clusters typically run a single application, e.g. database, distributed across all nodes Cluster capacity is increased by adding nodes (but like SMP servers, scaling is less than linear) 3,000 TPM3,600 TPM
14
Q1’99 Mylex Confidential Slide 14 Load Balancing Clusters (Contd.) Cluster rebalances the workload when a node dies If different apps are running on each server, they failover to the least busy server or as directed by predefined failover policies
15
Q1’99 Mylex Confidential Slide 15 Two Cluster Models “Shared Nothing” Model – Microsoft’s Wolfpack Cluster “Shared Disk” Model – VAXClusters
16
Q1’99 Mylex Confidential Slide 16 “Shared Nothing” Model At any moment in time, each disk is owned and addressable by only one server “Shared nothing” terminology is confusing Access to disks is shared -- on the same bus But at any moment in time, disks are not shared RAID
17
Q1’99 Mylex Confidential Slide 17 “Shared Nothing” Model (Contd.) When a server fails, the disks that it owns “failover” to the surviving server transparently to the clients RAID
18
Q1’99 Mylex Confidential Slide 18 “Shared Disk” Model Disks are not owned by servers but shared by all servers At any moment in time, any server can access any disk Distributed Lock Manager arbitrates disk access so apps on different servers don’t step on one another (corrupt data) RAID
19
Q1’99 Mylex Confidential Slide 19 Cluster Interconnect This is about how servers are tied together and how disks are physically connected to the cluster Clustered servers always have a client network interconnect, typically Ethernet, to talk to users And at least one cluster interconnect to talk to other nodes and to disks RAID Cluster Interconnect Client Network HBA
20
Q1’99 Mylex Confidential Slide 20 Cluster Interconnect (Contd.) Or They Can Have Two Cluster Interconnects – One for nodes to talk to each other -- “Heartbeat Interconnect” Typically Ethernet – And one for nodes to talk to disks -- “Shared Disk Interconnect” Typically SCSI or Fibre Channel RAID Shared Disk Interconnect Cluster Interconnect HBA NIC
21
Microsoft Clustering Service(MSCS) Wolfpack
22
Q1’99 Mylex Confidential Slide 22 Clusters Are Not New Clusters Have been Around Since 1985 Most UNIX Systems are Clustered What’s New is Microsoft Clusters – Code named “Wolfpack” – Named Microsoft Cluster Server (MSCS) Software that provides clustering – MSCS is part of Window NT, Enterprise Server V 4.0
23
Q1’99 Mylex Confidential Slide 23 Microsoft Cluster Rollout Wolfpack-I – In Windows NT, Enterprise Server, 4.0 (NT/E, 4.0) [Also includes Transaction Server and Reliable Message Queue] – Two node “failover cluster” – Shipped October, 1997 Wolfpack-II – In (or after) Windows 2000, Advanced Server – Borrows component from more robust Tandem and Digital Cluster technology (Compaq technology sharing) – “N” node (probably up to 16) “load balancing cluster” – Beta in 1998 and ship in 1999 ?
24
Q1’99 Mylex Confidential Slide 24 MSCS (NT/E, 4.0) Overview Two Node “Failover” Cluster “Shared Nothing” Model – At any moment in time, each disk is owned and addressable by only one server Two Cluster Interconnects – “Heartbeat” cluster interconnect Ethernet – Shared disk interconnect SCSI (any flavor) Fibre Channel (SCSI protocol over Fibre Channel) Each Node Has a “Private System Disk” – Boot disk
25
Q1’99 Mylex Confidential Slide 25 MSCS (NT/E, 4.0) Topologies Host-based (PCI) RAID Arrays External RAID Arrays
26
Q1’99 Mylex Confidential Slide 26 NT Cluster With Host-Based RAID Array Each node has – Ethernet NIC -- Heartbeat – Private system disk (generally on an HBA) – PCI-based RAID controller -- SCSI or Fibre Nodes share access to data disks but do not share data RAID Shared Disk Interconnect “Heartbeat” Interconnect RAID HBA NIC
27
Q1’99 Mylex Confidential Slide 27 NT Cluster With External RAID Array Each node has – Ethernet NIC -- Heartbeat – Multi-channel HBA’s connect boot disk and external array Shared external RAID controller on the SCSI or FC Bus -- Mylex’s DAC-SX, DAC-FL, DAC-FF products RAID Shared Disk Interconnect “Heartbeat” Interconnect HBA NIC
28
Q1’99 Mylex Confidential Slide 28 Cluster Interconnect and Heartbeats Cluster Interconnect – Private Ethernet between nodes – Used to transmit “I’m alive” heartbeat messages Heartbeat Messages – When a node stops getting heartbeats, it assumes the other node has died and initiates failover – In some failure modes both nodes stop getting heartbeats (NIC dies or someone trips over the cluster cable) Both nodes are still alive But each thinks the other is dead Split brain syndrome Both nodes initiate failover Who wins?
29
Q1’99 Mylex Confidential Slide 29 Quorum Disk Special cluster resource that stores the cluster log When a node joins a cluster, it attempts to reserve the quorum disk (purple disk) – If the quorum disk does not have an owner, the node takes ownership and forms a cluster – If the quorum disk has an owner, the node joins the cluster RAID Disk Interconnect Cluster “Heartbeat” Interconnect RAIDHBA Quorum Disk
30
Q1’99 Mylex Confidential Slide 30 Quorum Disk (Contd.) If Nodes Cannot Communicate (no heartbeats) – Then only one is allow to continue operating – They use the quorum disk to decide which one lives – Each node waits, then tries to reserve the quorum disk – Last owner waits the shortest time and if it’s still alive will take ownership of the quorum disk – When the other node attempts to reserve the quorum disk, it will find that it’s already owned – The node that doesn’t own the quorum disk then failsover – This is called the Challenge / Defense Protocol
31
Q1’99 Mylex Confidential Slide 31 Microsoft Cluster Server (MSCS) MSCS Objects – Lots of MSCS objects but only two we care about Resources and Groups Resources – Applications, data files, disks, IP addresses,... Groups – Application and related resources like data on disks
32
Q1’99 Mylex Confidential Slide 32 Microsoft Cluster Server (MSCS) When a server dies, groups failover When a server is repaired and returned to service, groups failback Since data on disks is included in groups, disks failover and failback Group: Mail Resource Group: Mail Resource Group: Mail Resource Group: Web Resource Group: Web Resource Group: Web Resource
33
Q1’99 Mylex Confidential Slide 33 Groups Failover Groups are the entities that failover And they take their disks with them Group: Mail Resource Group: Mail Resource Group: Mail Resource Group: Web Resource Group: Web Resource Group: Web Resource Group: Mail Resource Group: Mail Resource Group: Mail Resource
34
Q1’99 Mylex Confidential Slide 34 Microsoft Cluster Certification Two Levels of Certification – Cluster Component Certification HBA’s and RAID controllers must be certified When they pass: They’re listed on the Microsoft web site www.microsoft.com/hwtest/hcl/ They’re eligible for inclusion in cluster system certification – Cluster System Certification Complete two node cluster When they pass: They’re listed on the Microsoft web site They’ll be supported by Microsoft Each Certification Takes 30 - 60 Days
35
Mylex’s Clustering Implementation eXtremeRAID 1100 PCI-to-Ultra2 SCSI RAID
36
Q1’99 Mylex Confidential Slide 36 NT Cluster With Host-Based RAID Array – Nodes have: Ethernet NIC -- Heartbeat Private system disks (HBA) PCI-based RAID controller – Nodes share access to data disks but do not share data 3 Shared Ultra2 Interconnects “Heartbeat” Interconnect HBA NIC eXtreme RAID eXtreme RAID
37
Q1’99 Mylex Confidential Slide 37 MSCS Requirement for Shared Storage Bus Local drive is needed for boot OS and file system At any time, only one node has sole ownership of a shared drive. MSCS only supports SCSI protocol for shared bus SCSI commands are required for clustered shared devices – Reserved, Release, Test Unit Ready, Inquiry – Support of DPO(Disable page OUT), FUA(Force Unit Access) in read/write commands Support of multiple initiators, and ability to handle SCSI Bus Reset and Bus Device Reset Controller ability to handle cluster partner node shutdown, removal -- SCSI bus transition, reset and termination control Operating System Control Access
38
Q1’99 Mylex Confidential Slide 38 Mylex RAID Products for MSCS Clustering Controllers supported -- LVD based – eXtremeRAID - DAC1164P Recommend LVD mode for long cabling distance (12m). Single End mode is limited to 3m and will require SCSI Bus extender for longer distance eXtreme RAID Shared Disk Interconnect “Heartbeat” Interconnect eXtreme RAID HBA NIC
39
Q1’99 Mylex Confidential Slide 39 eXtremeRAID 1100: Technology eXtremeRAID 1100
40
Q1’99 Mylex Confidential Slide 40 eXtremeRAID 1100: Architecture SCSI ASIC SCSI ASIC /16 LVD SCSI Channel 80MB/sec SCSI ASIC SCSI ASIC /16 LVD SCSI Channel 80MB/sec SCSI ASIC SCSI ASIC /16 LVD SCSI Channel 80MB/sec Secondary PCI Bus NVRAM Host P2P Bridge Host P2P Bridge /32 PCI 33MHz CPU Bridge CPU Bridge RISC CPU RISC CPU Flash SDRAM /32 /8 /32 /64 Host PCI 33MHz 40MHz
41
Q1’99 Mylex Confidential Slide 41 Mylex PCI RAID’s Two-node Cluster Emulate SCSI shared bus requirement through NT mini-port driver and RAID F/W – Treat RAID volume drive as physical disk drives – Support release/reserved and other clustered related SCSI commands in the FW through volume reservation table – Honor DPO and FUA and Flush operation in FW. RAID configuration, Fault Management, Enclosure Management, Volume Reserve/Release are administrated by Master/Slave mechanism Establish communication between RAID controllers in the 2-node through back-end SCSI bus -- Heartbeat, Cluster commands and RAID configuration and fault management
42
Q1’99 Mylex Confidential Slide 42 Master-Slave Concept Master/Slave is a controller concept and is transparent to host system and OS Master/Slave is independent of the server cluster-node status First established node will act as master, the later one a slave If one node fails or goes offline, the surviving node becomes master Node discovery is initiated by a SCSI Bus Reset and kept alive by heartbeat communication through back-end shared SCSI bus eXtreme RAID Back-end SCSI Buses eXtreme RAID Raid Heartbeat & Communication Node A Node B Master Slave
43
Q1’99 Mylex Confidential Slide 43 Master/Slave Perspective Only master manages RAID configuration changes and fault rebuild process. – Raid configuration and fault management can be initiated from either nodes or invoked from DACCF/GAM. – COD updates are done by master but it will inform Slave to update its NVRAM information. – The master manages the rebuild process and could delegate task to slave. Enclosure management (SAF-TE) is administrated by Master. Logic Volume Release/Reserved are communicated between master and slave through backend shared SCSI Bus.
44
Q1’99 Mylex Confidential Slide 44 Termination Control and Bus Isolation In a cluster setup, one server node could be powered-on, shutdown or removed for upgrade or maintenance Mylex Supplied Terminator Switch Box – Contain LVD/SE terminator and fast silicon switches – When server node power is on Terminator is off and SCSI signal passes through – When server node power is off or removed Terminator is on and SCSI signal is isolated from the server node Server Node A Server Node B 1164P Disk Box Terminator Switch
45
Q1’99 Mylex Confidential Slide 45 Mylex’s Clustering Support Elements Two-Node NT 4.0 Clustering only (MSCS) – FW 5.07C for eXtremeRAID – BIOS support for cluster nexus establishment message – DACCF/BCU modification for initiator ID and clustering support – NT miniport drive modification to support clustered related SCSI commands. – GAM driver, Server, Clients no changes FW BIOS BCU DACCF MiniPort GAM Driver GAM Server GAM Client TCP/IP
46
Q1’99 Mylex Confidential Slide 46 Global Array Management (GAM) GAM : Client/Server RAID management tool via TCP/IP protocol – Uses Virtual IP for viewing single RAID subsystem image (Could use physical IP to view 2 physical node image if needed) – Either Master/Slave will be viewed depending on the current cluster group. GAM task-request will be communicated through back-end SCSI bus and administrated by Master Controller eXtreme RAID Shared Disk Interconnect Virtual IP, Single System Image eXtreme RAID GAM Server TCP / IP GAM Server GAM Client
47
Q1’99 Mylex Confidential Slide 47 Mylex Clustering Approach Same FW, BIOS, Driver and utilities for clustering and non-clustering support Support full featured Mylex RAID controller functions – Full RAID configuration through DACCF and GAM – Hot Swap, Hot spare, RAID Rebuild – Background Consistent Check – Background Initialization – SAF-TE enclosure management MORE -- Mylex Online Capacity Expansion and RAID migration are not supported in a cluster configuration Maintain TPC-C world record leader performance – Minimum impact on master/slave heartbeat monitoring – Write back is disabled for cluster data availability and integrity
48
Q1’99 Mylex Confidential Slide 48 WHQL Clustering Certification Passed Microsoft SDG 1.0 (Server Design Guide), submit for WHQL certification queue Passed MSCS HCT 8.0 and Clustering Certification Pre- submission test – MSCS System Validation -- Phase 1 - 3 tested Tested on Intel Madrona, Nightshade, Sitka based systems Will submit test log to Microsoft in early DEC. 1998 eXtreme RAID Shared Disk Interconnect “Heartbeat” Interconnect eXtreme RAID HBA NIC Cluster Admin ………. Clients
49
Q1’99 Mylex Confidential Slide 49 Mylex Clustering Restrictions Only support 2 node MSCS clustering Boot and File system needs to be in local drive, separate from shared bus -- Per MSCS requirement The shared bus includes all SCSI channels in both controllers. All shared devices should be in the same channel for the 2 clustered controllers Only SCSI hard disks and SAF-TE devices are allowed on the shared bus. Write-back caching is disabled MORE is not supported SCSI device must be capable of supporting multi- initiators, SCSI bus reset and device reset
50
Q1’99 Mylex Confidential Slide 50 Mylex: Recommended Installation Setup controller initiator ID and enable cluster support for each node through DACCF while two-nodes are still separate. Disable RAID controller BIOS for both node, since the RAID controller is not controlling boot device Run RAID configuration, using DACCF on one node. Connect the two node together using Mylex terminator switch box and cabling. Ready to go Ready to go -- Just follow Microsoft Cluster Server Administrator’s Guide for clustering installation.
51
Q1’99 Mylex Confidential Slide 51 Mylex’s Installation Tips Disable termination on all of the drives and the drive box. Be sure there are no SCSI ID conflicts with the drives and SAF-TE processors. Use LVD (Low Voltage Differential) over SE (Single Ended) drives and enclosures because of SE cable length restrictions. If using SE, suggest using repeaters. For optimum performance, create 2 packs. One pack per controller. Do not create multiple partitions on a shared drive. MSCS can only Failover a physical drive. MSCS only supports NTFS partitions. Failback needs to be set manually within MSCS, otherwise the server that loads the MSCS services first will get ALL of the resources.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.