Presentation is loading. Please wait.

Presentation is loading. Please wait.

System and Communication Faults

Similar presentations


Presentation on theme: "System and Communication Faults"— Presentation transcript:

1 System and Communication Faults

2 Topic 1: Ensuring Data Integrity
After completing this topic, you will be able to describe VERITAS recommendations for ensuring data integrity. Jade: I made the arrow head a little bigger. This is <topic 1>, <topic one title>. As a reminder, the objective(s) for this topic is/are: <topic 1 objective(s)>.

3 Ensuring Data Integrity
For VCS 4.x, use I/O fencing to protect data on shared storage, which supports SCSI-3 persistent reservation (PR). For environments that do not have SCSI-3 PR support, VCS supports additional protection mechanisms for membership arbitration: Redundant communication links Separate heartbeat infrastructures Jeopardy cluster membership Autodisabled service groups

4 System Failure Example
C B C S1 S3 S2 S3 faults; C started on S1 or S2 Regular Membership: S1, S2 No Membership: S3

5 Failover Duration on a System Failure
In the case of a system failure, service group failover time is the sum of the duration of each of these tasks. Detect the system failure—21 seconds for heartbeat timeouts. Select a failover target—less than one second. Bring the service group online on another system in the cluster. Failover Duration

6 Topic 2: Cluster Interconnect Failures
After completing this topic, you will be able to describe how VCS responds to cluster interconnect failures. Jade: I made the arrow head a little bigger. This is <topic 1>, <topic one title>. As a reminder, the objective(s) for this topic is/are: <topic 1 objective(s)>.

7 Single LLT Link Remaining
B C S1 S3 S2 Regular Membership: S1, S2, S3 Jeopardy Membership: S3

8 Jeopardy Membership A special type of cluster membership called jeopardy is formed when one or more systems have only a single LLT link. Service groups continue to run, and the cluster functions normally. Failover due to resource faults and switching at operator request are unaffected. The service groups running on a system in jeopardy cannot fail over to another system if that system in jeopardy then faults or loses its last link. Reconnect the link to recover from jeopardy condition.

9 Transition from Jeopardy to Network Partition
3 A, B autodisabled for S3 3 C autodisabled for S1, S2 A B C 1 S1 S3 S2 2 1 Jeopardy membership: S3 Mini-cluster with regular membership: S3 No Jeopardy membership Mini-cluster with regular membership: S1, S2 2 3 SGs autodisabled

10 Recovering from a Network Partition
4 A, B autoenabled for S3 4 C autoenabled for S1, S2 A B C 1 S1 S3 2 S2 3 1 Stop HAD on S3. 2 Fix LLT links. Mini-cluster with S1, S2 continues to run. Start HAD on S3. A, B, C are autoenabled by HAD. 3

11 Recovery Behavior If you did not stop HAD before reconnecting the cluster interconnect after a network partition, VCS is automatically stopped and restarted as follows: Two-system cluster: The system with the lowest LLT node number continues to run VCS. VCS is stopped on the higher-numbered system. Multisystem cluster: The mini-cluster with the most systems running continues to run VCS. VCS is stopped on the systems in the smaller mini-clusters. If split into two equal size mini-clusters, the cluster containing the lowest node number continues to run VCS.

12 Modifying the Default Recovery Behavior
You can configure GAB to force an immediate reboot without a system shutdown in the case where LLT links are reconnected after a network partition. Modify gabtab to start GAB with the –j option. For example: gabconfig -c -n 2 –j This causes the high-numbered node to shut down if GAB tries to start after all LLT links simultaneously stop and then restart.

13 Potential Split Brain Condition
2 S1 S3 1 S2 S1 and S2 determine that S3 is faulted. No jeopardy occurs, so no SGs are autodisabled. 1 If all systems are in all SGs SystemList, VCS tries to bring them online on a failover target. S3 determines that S1 and S2 are faulted. 1 2

14 Topic 3: Changing the Interconnect Configuration
After completing this topic, you will be able to change the cluster interconnect configuration. Jade: I made the arrow head a little bigger. This is <topic 2>, <topic two title>. As a reminder, the objective(s) for this topic is/are: <topic 2 objective(s)>.

15 Example Scenarios These are some examples where you may need to
change the cluster interconnect configuration: Adding or removing cluster nodes Merging clusters Changing communication parameters, such as the heartbeat time interval Changing recovery behavior Changing or adding interfaces used for the cluster interconnect Configuring additional disk or network heartbeat links for increasing heartbeat redundancy

16 Example LLT Link Specification
Range (all) SAP /etc/llttab set-node S1 set-cluster 10 # Solaris example link qfe0 /dev/qfe:0 - ether - - link qfe4 /dev/qfe:4 - ether - - link qfe5 /dev/qfe:5 - ether - - link-lowpri eri0 /dev/eri:0 - ether - - Solaris Device:Unit Link Type MTU Tag Name AIX HP-UX Linux


Download ppt "System and Communication Faults"

Similar presentations


Ads by Google