Clustering Technology In Windows NT Server, Enterprise Edition Jim Gray Microsoft Research research.Microsoft.com/~gray
Todays Agenda Windows NT ® clustering Windows NT ® clustering MSCS (Microsoft Cluster Server) Demo MSCS (Microsoft Cluster Server) Demo MSCS background MSCS background Design goals Design goals Terminology Terminology Architectural details Architectural details Setting up a MSCS cluster Setting up a MSCS cluster Hardware considerations Hardware considerations Cluster application issues Cluster application issues Q&A Q&A
Extra Credit Included in your presentation materials but not covered in this session Included in your presentation materials but not covered in this session Reference materials Reference materials SCSI primer SCSI primer Speakers notes included Speakers notes included Hardware Certification Hardware Certification
MSCS In Action
High Availability Versus Fault Tolerance High Availability: mask outages through service restoration High Availability: mask outages through service restoration Fault-Tolerance: mask local faults Fault-Tolerance: mask local faults RAID disks RAID disks Uninterruptible Power Supplies Uninterruptible Power Supplies Cluster Failover Cluster Failover Disaster Tolerance: masks site failures Disaster Tolerance: masks site failures Protects against fire, flood, sabotage,.. Protects against fire, flood, sabotage,.. Redundant system and service at remote site Redundant system and service at remote site
Windows NT Clusters What is clustering to Microsoft? Group of independent systems that appear as a single system Group of independent systems that appear as a single system Managed as a single system Managed as a single system Common namespace Common namespace Services are cluster-wide Services are cluster-wide Ability to tolerate component failures Ability to tolerate component failures Components can be added transparently to users Components can be added transparently to users Existing client connectivity is not effected by clustered applications Existing client connectivity is not effected by clustered applications
Microsoft Cluster Server 2-node available 97Q3 2-node available 97Q3 Commoditize fault-tolerance (high availability) Commoditize fault-tolerance (high availability) Commodity hardware (no special hardware) Commodity hardware (no special hardware) Easy to set up and manage Easy to set up and manage Lots of applications work out of the box. Lots of applications work out of the box. Multi-node Scalability in NT5 timeframe Multi-node Scalability in NT5 timeframe
MSCA Initial Goals Manageability Manageability Manage nodes as a single system Manage nodes as a single system Perform server maintenance without affecting users Perform server maintenance without affecting users Mask faults, so repair is non-disruptive Mask faults, so repair is non-disruptive Availability Availability Restart failed applications and servers Restart failed applications and servers Un-availability ~ MTTR / MTBF, so quick repair Un-availability ~ MTTR / MTBF, so quick repair Detect/warn administrators of failures Detect/warn administrators of failures Reliability Reliability Accommodate hardware and software failures Accommodate hardware and software failures Redundant system without mandating a dedicated stand by solution Redundant system without mandating a dedicated stand by solution
Client PCs Server A Server B Disk cabinet A Disk cabinet B Heartbeat Cluster management MSCS Cluster
Web site Database Web site files Database files Server 1 Server 2 Browser Web site Database Server 1 Server 2 Failover Example
Basic MSCS Terms Resource - basic unit of failover Resource - basic unit of failover Group - collection of resources Group - collection of resources Node - Windows NT ® Server running cluster software Node - Windows NT ® Server running cluster software Cluster - one or more closely-coupled nodes, managed as a single entity Cluster - one or more closely-coupled nodes, managed as a single entity
Cluster name Node name Virtual server name Virtual Virtual Virtual MSCS Namespace Cluster view
Cluster Node 1 Node 2 Virtual server 1 Virtual server 2 Virtual server 3 IP address: Network name: WHECCLUS IP address: Network name: WHECNode1 IP address: Network name: WHECNode2 IP address: Network name: WHEC-VS1 IP address: Network name: WHEC-VS2 IP address: Network name: WHEC-VS3 Internet Information Server SQL MTSFalcon Microsoft Exchange MSCS Namespace Outside world view
Windows NT Clusters Target applications Application & Database servers Application & Database servers , groupware, productivity applications server , groupware, productivity applications server Transaction processing servers Transaction processing servers Internet Web servers Internet Web servers File and print servers File and print servers
MSCS Design Philosophy Shared nothing Shared nothing Simplified hardware configuration Simplified hardware configuration Remoteable tools Remoteable tools Windows NT manageability enhancements Windows NT manageability enhancements Never take a cluster down: shell game rolling upgrade Never take a cluster down: shell game rolling upgrade Microsoft ® BackOffice product support Microsoft ® BackOffice product support Provide clustering solutions for all levels of customer requirements Provide clustering solutions for all levels of customer requirements Eliminate cost and complexity barriers Eliminate cost and complexity barriers
MSCS Design Philosophy Availability is core for all releases Availability is core for all releases Single server image for administration, client interaction Single server image for administration, client interaction Failover provided for unmodified server applications, unmodified clients (cluster-aware server applications get richer features) Failover provided for unmodified server applications, unmodified clients (cluster-aware server applications get richer features) Failover for file and print are default Failover for file and print are default Scalability is phase 2 focus Scalability is phase 2 focus
Non-Features Of MSCS Not lock-step/fault-tolerant Not lock-step/fault-tolerant Not able to move running applications Not able to move running applications MSCS restarts applications that are failed over to other cluster members MSCS restarts applications that are failed over to other cluster members Not able to recover shared state between client and server (i.e., file position) Not able to recover shared state between client and server (i.e., file position) All client/server transactions should be atomic All client/server transactions should be atomic Standard client/server development rules still apply Standard client/server development rules still apply ACID always wins ACID always wins
Setting Up MSCS Applications
Attributes Of Cluster- Aware Applications A persistence model that supports orderly state transition A persistence model that supports orderly state transition Database example Database example ACID transactions ACID transactions Database log recovery Database log recovery Client application support Client application support IP clients only IP clients only How are retries supported? How are retries supported? No name service location dependencies No name service location dependencies Custom resource DLL is a good thing Custom resource DLL is a good thing
MSCS Services For Application Support Name service mapper Name service mapper GetComputerName resolves to virtual server name GetComputerName resolves to virtual server name Registry replication Registry replication Key and underlying keys and values are replicated to the other node Key and underlying keys and values are replicated to the other node Atomic Atomic Logged to insure partitions in time are handled Logged to insure partitions in time are handled
Application Deployment Planning System configuration is crucial System configuration is crucial Adequate hardware configuration Adequate hardware configuration You cant run Microsoft BackOffice on a 32-MB 75mhz Pentium You cant run Microsoft BackOffice on a 32-MB 75mhz Pentium Planning of preferred group owners Planning of preferred group owners Good understanding of single-server performance is critical Good understanding of single-server performance is critical See Windows NT Resource Kit performance planning section See Windows NT Resource Kit performance planning section Understand working set size Understand working set size What is acceptable performance to the business units? What is acceptable performance to the business units?
Evolution Of Cluster- Aware Applications Active/passive - general out-of- the-box applications Active/passive - general out-of- the-box applications Active/active - applications that can run simultaneously on multiple nodes Active/active - applications that can run simultaneously on multiple nodes Highly scalable - extending the active/active through I/O shipping, process groups, and other techniques Highly scalable - extending the active/active through I/O shipping, process groups, and other techniques
ApplicationNode 1Node 2 Microsoft SQL Server Microsoft SQL Server Microsoft Transaction Server (MTS) Internet Information Server (IIS) Microsoft Exchange Server Application Evolution
ApplicationNode 1Node 2Node 3Node 4 Microsoft SQL Server Microsoft Transaction Server (MTS) Internet Information Server (IIS) Microsoft Exchange Server Evolution Of Cluster- Aware Applications
Resources What are they? Resources are basic system components such as physical disks, processes, databases, IP addresses, etc., that provide a service to clients in a client/server environment Resources are basic system components such as physical disks, processes, databases, IP addresses, etc., that provide a service to clients in a client/server environment They are online in only one place in the cluster at a time They are online in only one place in the cluster at a time They can fail over from one system in the cluster to another system in the cluster They can fail over from one system in the cluster to another system in the cluster
Resources MSCS includes resource DLL support for: MSCS includes resource DLL support for: Physical and logical disk Physical and logical disk IP address and network name IP address and network name Generic service or application Generic service or application File share File share Print queue Print queue Internet Information Server virtual roots Internet Information Server virtual roots Distributed Transaction Coordinator (DTC) Distributed Transaction Coordinator (DTC) Microsoft Message Queue (MSMQ) Microsoft Message Queue (MSMQ) Supports resource dependencies Supports resource dependencies Controlled via well-defined interface Controlled via well-defined interface Group: offers a virtual server Group: offers a virtual server
Windows NT cluster service Resource monitor Physical disk resource DLL IP address resource DLL Generic app resource DLL Database resource DLL Resource events Initiate changes DiskNetworkAppDatabase Cluster Service To Resource
ClusterResourceGroup Resource Resource: program or device managed by a cluster e.g., file service, print service, database server can depend on other resources (startup ordering) can be online, offline, paused, failed Resource Group: a collection of related resources hosts resources; belongs to a cluster unit of co-location; involved in naming resources Cluster: a collection of nodes, resources, and groups cooperation for authentication, administration, naming Cluster Abstractions
ClusterGroup Resource Resources Resources have... Resources have... Type: what it does (file, DB, print, Web…) Type: what it does (file, DB, print, Web…) An operational state (online/offline/failed) An operational state (online/offline/failed) Current and possible nodes Current and possible nodes Containing Resource Group Containing Resource Group Dependencies on other resources Dependencies on other resources Restart parameters (in case of resource failure) Restart parameters (in case of resource failure)
Resource Fails over (moves) from one machine to another Fails over (moves) from one machine to another Logical disk Logical disk IP address IP address Server application Server application Database Database May depend on another resource May depend on another resource Well-defined properties controlling its behavior Well-defined properties controlling its behavior
Resource Dependencies A resource may depend on other resources A resource may depend on other resources A resource is brought online after any resources it depends on A resource is brought online after any resources it depends on A resource is taken offline before any resources it depends on A resource is taken offline before any resources it depends on All dependent resources must fail over together All dependent resources must fail over together
Drive E: resource DLL IP address resource DLL Generic application resource DLL Database Drive F: resource DLL Dependency Example
Payroll group Drive E: resource DLL IP address resource DLL Generic application resource DLL Database Drive F: resource DLL Group Example
Cluster API stub Cluster administrator Database Manager Membership Manager Global Update Manager Failover Manager Event Processor Node Manager Resource Manager Physical resource DLL Logical resource DLL Application resource DLL Resource API Reliable Cluster Transport + Heartbeat Application resource DLL Resource monitors Object Manager Cluster API Cluster.Exe Log Manager Checkpoint Manager Cluster API DLL Cluster API DLL Network MSCS Architecture
Cluster service is comprised of the following objects Cluster service is comprised of the following objects Failover Manager (FM) Failover Manager (FM) Resource Manager (RM) Resource Manager (RM) Node Manager (NM) Node Manager (NM) Membership Manager (MM) Membership Manager (MM) Event Processor (EP) Event Processor (EP) Database Manager (DM) Database Manager (DM) Object Manager (OM) Object Manager (OM) Global Update Manager (LM) Global Update Manager (LM) Checkpoint Manager (CM) Checkpoint Manager (CM) More about these in the next session More about these in the next session
Setting Up An MSCS Cluster
MSCS Key Components Two servers Two servers Multi versus uniprocessor Multi versus uniprocessor Heterogeneous servers Heterogeneous servers Shared SCSI bus Shared SCSI bus SCSI HBAs, SCSI RAID HBAs, HW RAID boxes SCSI HBAs, SCSI RAID HBAs, HW RAID boxes Interconnect Interconnect Many types can be supported Many types can be supported Remember, two NICs per node Remember, two NICs per node PCI for cluster interconnect PCI for cluster interconnect Complete MSCS HCL configuration Complete MSCS HCL configuration
MSCS Setup Most common problems Most common problems Duplicate SCSI IDs on adapters Duplicate SCSI IDs on adapters Incorrect SCSI cabling Incorrect SCSI cabling SCSI Card order on PCI bus SCSI Card order on PCI bus Configuration of SCSI Firmware Configuration of SCSI Firmware Lets walk through getting a cluster operational Lets walk through getting a cluster operational
Test Before You Build Bring each system up independently Bring each system up independently Network adapters Network adapters Cluster interconnect Cluster interconnect Organization interconnect Organization interconnect SCSI and disk function SCSI and disk function NTFS volume(s) NTFS volume(s)
Top Ten Setup Concerns 10.SCSI is not well known. Please use the MSCS and IHV setup documentation. Consider the SCSI book reference for this session 9.Build a support model that will support clustering requirements. For example, in clustering components are paired exactly (i.e., SCSI bios revision levels. Include this in your plans) 8.Build extra time into your deployment planning to accommodate cluster setup, both for hardware and software. Hardware examples include SCSI setup. Software issues would include installation across cluster nodes 7.Know the certification process and its support implications
Top Ten Setup Concerns 6.Applications will become more cluster-aware through time. This will include better setup, diagnostics, and documentation. In the meantime, plan and test accordingly 5.Clustering will impact your server maintenance and upgrade methodologies. Plan accordingly 4.Use multiple network adapters and hubs to eliminate single points of failure (everywhere possible) 3.Todays clustering solutions are more complex to install and configure than single servers. Plan your deployments accordingly 2.Make sure that your cabinet solutions and peripherals both fit and function well. Consider the serviceability implications 1.Cabling is a nightmare. Color coded, heavily documented, Y cable inclusive, maintenance-designed products are highly desirable
Cluster Management Tools Cluster administrator Cluster administrator Monitor and manage cluster Monitor and manage cluster Cluster CLI/COM Cluster CLI/COM Command line and COM interface Command line and COM interface Minor modifications to existing tools Minor modifications to existing tools Performance monitor Performance monitor Add ability to watch entire cluster Add ability to watch entire cluster Disk administrator Disk administrator Add understanding of shared disks Add understanding of shared disks Event logger Event logger Broadcast events to all nodes Broadcast events to all nodes
In Search of Clusters; The Coming Battle In Lowly Parallel Computing Gregory F. Pfister ISBN The Book of SCSI Peter M. Ridge ISBN MSCS Reference Materials
The Basics Of SCSI Why SCSI? Why SCSI? Types of interfaces? Types of interfaces? Caching and performance… Caching and performance… RAID RAID The future… The future…
Why SCSI? Faster then IDE - intelligent card/drive Faster then IDE - intelligent card/drive Uses less processor time Uses less processor time Can transfer data up to 100 MB/sec. Can transfer data up to 100 MB/sec. More devices on a single chain - up to 15 More devices on a single chain - up to 15 Wider variety of devices Wider variety of devices DASD DASD Scanners Scanners CD-ROM writers and optical drives CD-ROM writers and optical drives Tape drives Tape drives
Types Of Interfaces SCSI and SCSI II SCSI and SCSI II 50-pin, 8-bit, max transfer = 10 MB/s (early 1.5 to 5 MB/s ) 50-pin, 8-bit, max transfer = 10 MB/s (early 1.5 to 5 MB/s ) Internal transfer rate = 4 to 8 MB/s Internal transfer rate = 4 to 8 MB/s Wide SCSI Wide SCSI 68-pin, 16-bit, max transfer = 20 MB/s 68-pin, 16-bit, max transfer = 20 MB/s Internal transfer rate = 7 to 15.5 MB/s Internal transfer rate = 7 to 15.5 MB/s Ultra SCSI Ultra SCSI 50-pin, 8-bit, higher transfer rate, max transfer = 20 MB/s 50-pin, 8-bit, higher transfer rate, max transfer = 20 MB/s Internal transfer rate = 7 to 15.5 MB/s Internal transfer rate = 7 to 15.5 MB/s Ultra wide Ultra wide 68-pin, 16-bit, max transfer rate = 40 MB/s 68-pin, 16-bit, max transfer rate = 40 MB/s Internal transfer rate = 7 to 30 MB/s Internal transfer rate = 7 to 30 MB/s
Performance Factors Cache on the drive or controller Cache on the drive or controller Caching in the OS Caching in the OS Different variables Different variables Seek time Seek time Transfer rates Transfer rates
Redundant Array Of Inexpensive Disks (RAID) Developed from paper published in 1987 at University of California Berkeley Developed from paper published in 1987 at University of California Berkeley The idea is to combine multiple inexpensive drives (eliminate SLED - single large expensive drive) The idea is to combine multiple inexpensive drives (eliminate SLED - single large expensive drive) Provided redundancy by storing parity information Provided redundancy by storing parity information
The Future For SCSI Faster interfaces - why? Faster interfaces - why? Fibre Channel Fibre Channel Optical standard Optical standard Proposed as part of SCSI III (not final) Proposed as part of SCSI III (not final) Up to 100 MB/s transfer Up to 100 MB/s transfer Still using ultra-wide SCSI inside enclosures Still using ultra-wide SCSI inside enclosures Drives with optical interfaces not available yet in quantity, higher cost than SCSI Drives with optical interfaces not available yet in quantity, higher cost than SCSI
The Future Of SCIS Fibre Channel-arbitrated loop Fibre Channel-arbitrated loop Ring instead of bus architecture Ring instead of bus architecture Can support up to 126 devices/hosts Can support up to 126 devices/hosts Hot pluggable through the use of a port bypass circuit Hot pluggable through the use of a port bypass circuit No disruption of the loop as devices are added/removed No disruption of the loop as devices are added/removed Generally implemented using a backplane design Generally implemented using a backplane design
HCL List For MSCS Servers on normal Windows NT HCL Servers on normal Windows NT HCL Self-test of MP machines soon Self-test of MP machines soon MSCS SCSI component HCL MSCS SCSI component HCL Tested by WHQL Tested by WHQL Must pass Windows NT HCT as well Must pass Windows NT HCT as well MSCS interconnect HCL MSCS interconnect HCL Tested by WHQL Tested by WHQL Not required to pass 100% of HCT Not required to pass 100% of HCT I.e., point-to-point adapters I.e., point-to-point adapters
Windows NT 4.0+ ServerHCL SCSIHCL NetworkHCL MSCSSCSIHCL Complete MSCS configuration ready for self-test MSCS System Certification Process
Testing Phases HW compatibility (24 hours) HW compatibility (24 hours) SCSI and interconnect testing SCSI and interconnect testing One-node testing (24 hours) One-node testing (24 hours) Eight clients Eight clients Two-node with failover (72 hours) Two-node with failover (72 hours) Eight-client with asynchronous failovers Eight-client with asynchronous failovers Stress testing (24 hours) Stress testing (24 hours) Dual initiator I/O, split-brain problems Dual initiator I/O, split-brain problems Simultaneous reboots Simultaneous reboots
Final MSCS HCL Only complete configurations are supported Only complete configurations are supported Self test results sent to Microsoft Self test results sent to Microsoft Logs checked and configuration reviewed Logs checked and configuration reviewed HCL updated on Web and for next major Windows NT release HCL updated on Web and for next major Windows NT release For more details see the MSCS Certification document For more details see the MSCS Certification document