FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper

Slides:



Advertisements
Similar presentations
Introduction to DBA.
Advertisements

High Availability 24 hours a day, 7 days a week, 365 days a year… Vik Nagjee Product Manager, Core Technologies InterSystems Corporation.
Manageware For Documentum ESI SOFTWARE 2006
Distributed components
1 © Copyright 2010 EMC Corporation. All rights reserved. EMC RecoverPoint/Cluster Enabler for Microsoft Failover Cluster.
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
The Internet Useful Definitions and Concepts About the Internet.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 1: Introduction to Windows Server 2003.
Systems Architecture, Fourth Edition1 Internet and Distributed Application Services Chapter 13.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 8: Implementing and Managing Printers.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 8 Introduction to Printers in a Windows Server 2008 Network.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.
1© Copyright 2011 EMC Corporation. All rights reserved. EMC RECOVERPOINT/ CLUSTER ENABLER FOR MICROSOFT FAILOVER CLUSTER.
Installing software on personal computer
Module 14: Scalability and High Availability. Overview Key high availability features available in Oracle and SQL Server Key scalability features available.
Workflow Steps Perform a datacenter switchover for a database availability group Version 1.2 (Updated 12/2012)
11 SERVER CLUSTERING Chapter 6. Chapter 6: SERVER CLUSTERING2 OVERVIEW  List the types of server clusters.  Determine which type of cluster to use for.
Implementing High Availability
Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 1: Introduction to Windows Server 2003.
Configuring and Troubleshooting Identity and Access Solutions with Windows Server® 2008 Active Directory®
Presented by INTRUSION DETECTION SYSYTEM. CONTENT Basically this presentation contains, What is TripWire? How does TripWire work? Where is TripWire used?
1 SAMBA. 2 Module - SAMBA ♦ Overview The presence of diverse machines in the network environment is natural. So their interoperability is critical. This.
SANPoint Foundation Suite HA Robert Soderbery Sr. Director, Product Management VERITAS Software Corporation.
Module 13: Configuring Availability of Network Resources and Content.
Name Resolution Domain Name System.
Oracle10g RAC Service Architecture Overview of Real Application Cluster Ready Services, Nodeapps, and User Defined Services.
INSTALLING MICROSOFT EXCHANGE SERVER 2003 CLUSTERS AND FRONT-END AND BACK ‑ END SERVERS Chapter 4.
High-Availability Linux.  Reliability  Availability  Serviceability.
1 Client Server Architecture over the Internet Week - 2.
Microsoft Active Directory(AD) A presentation by Robert, Jasmine, Val and Scott IMT546 December 11, 2004.
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
© 2005 Mt Xia Technical Consulting Group - All Rights Reserved. HACMP – High Availability Introduction Presentation November, 2005.
Networks – Network Architecture Network architecture is specification of design principles (including data formats and procedures) for creating a network.
Computer Emergency Notification System (CENS)
1 Windows 2008 Configuring Server Roles and Services.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
1 Week #10Business Continuity Backing Up Data Configuring Shadow Copies Providing Server and Service Availability.
Module 7: Managing Message Transport. Overview Introduction to Message Transport Implementing Message Transport.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
 High-Availability Cluster with Linux-HA Matt Varnell Cameron Adkins Jeremy Landes.
Distributed database system
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
The Process Manager in the ATLAS DAQ System G. Avolio, M. Dobson, G. Lehmann Miotto, M. Wiesmann (CERN)
Mark E. Fuller Senior Principal Instructor Oracle University Oracle Corporation.
Oracle 10g Clusterware (CRS) Overview 18 Aug 2005 John Sheaffer Platform Solution Specialist
Chapter 3 - VLANs. VLANs Logical grouping of devices or users Configuration done at switch via software Not standardized – proprietary software from vendor.
INTRUSION DETECTION SYSYTEM. CONTENT Basically this presentation contains, What is TripWire? How does TripWire work? Where is TripWire used? Tripwire.
Jini Architecture Introduction System Overview An Example.
Donna C. Hamby Sr. Principal Instructor Oracle University Oracle Corporation.
High Availability in DB2 Nishant Sinha
70-412: Configuring Advanced Windows Server 2012 services
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
17 Establishing Dial-up Connection to the Internet Using Windows 9x 1.Install and configure the modem 2.Configure Dial-Up Adapter 3.Configure Dial-Up Networking.
Linux Operations and Administration
Group Communication Theresa Nguyen ICS243f Spring 2001.
VCS Building Blocks. Topic 1: Cluster Terminology After completing this topic, you will be able to define clustering terminology.
How to setup DSS V6 iSCSI Failover with XenServer using Multipath Software Version: DSS ver up55 Presentation updated: February 2011.
ZOOKEEPER. CONTENTS ZooKeeper Overview ZooKeeper Basics ZooKeeper Architecture Getting Started with ZooKeeper.
An Introduction to GPFS
Troubleshooting Tools
High Availability 24 hours a day, 7 days a week, 365 days a year…
Cluster Communications
Unit OS10: Fault Tolerance
SUBMITTED BY: NAIMISHYA ATRI(7TH SEM) IT BRANCH
An Introduction to Computer Networking
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Oracle10g RAC Service Architecture
Presentation transcript:

FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper

FailSafe - What is it? High Availability for business critical applications at a low cost User level software running in a clustered environment providing –single point of failure recovery –cluster administration services GUI –a simple way to make applications HA aware

FailSafe - What it looks like

FailSafe - Terminology Node : a single Linux image Cluster : one or more nodes connected via some interconnect Pool : entire set of nodes involved with a group of clusters Node Membership : list of nodes in a cluster on which FailSafe can allocate resource groups

FailSafe - Terminology (contd.) Process Membership : list of process instances in a cluster which form a process group Resource : a single physical or logical entity Resource Group : Collection of inter- dependent resources –cannot overlap –Behaves like an atomic unit of failover –Must have a unique name throughout the cluster

FailSafe - Terminology (contd.) Failover : process of moving a resource group from one node to another Failover Policy : method used by FailSafe to determine the destination node of a failover Failover Domain : ordered list of nodes on which a given resource group can be allocated

FailSafe - Terminology (contd.) Failover Attributes: Auto Failback, Controlled Failback, InPlace Recovery Failover policy script : shell script which generates an ordered set of node names on which the resource group can be placed Action scripts : scripts which determine how a resource is started, stopped and monitored

FailSafe - Architecture Cluster Administration services (CAS) {CAD, CDBD, CDB} Cluster Infrastructure (CI) {CMS, GCS, SRM, CRS} FailSafe Cluster Manager GUI and CLI

FailSafe - Acronyms (so many!) CMS = Cluster Membership Service GCS = Group Communication Service SRM = System Resource Manager CRS = Cluster Reset Service CAD = Cluster Administration Daemon CDB = Cluster Database CDBD = Cluster Database Daemon

FailSafe - Cluster Database Repository for all cluster configuration Dynamic changes supported Consistency is automatically supported Replicated in all nodes of the pool Provides read and write transactional semantics

FailSafe - Cluster Database Daemon Controls read and write accesses to the CDB Notifies clients of dynamic changes to the CDB Keeps global portions of the CDB consistent across the pool

FailSafe - Cluster Administration Daemon Daemon responsible for dynamically updating the GUI CAD is a client of CDBD CDBD notifies CAD of any changes Provides notification (default = ) of status changes in node, cluster or resource groups

FailSafe - Cluster Membership Service Provides cluster node membership information to its clients Node membership information includes –nodes that are currently part of the cluster –Node status i.e. up, down or unknown –Node name –IP address currently being used for inter-CMSD communication Inactive cluster node membership information is also provided

FailSafe - Cluster Membership Service (contd.) Any change in cluster status results in a node membership message issued by CMSD to its clients on all nodes of the cluster CMSD implements failstop and quorum policy CMSDs monitor each other by exchanging heartbeat messages directly with each other

FailSafe - Group Communication Service Provides a consistent view of process group memberships in presence of process failures, new processes joining, and changing node memberships Provides a reliable ordered atomic messaging service to members of the process group under changing node and group memberships GCS operates in the context of a cluster as defined by CMS

FailSafe - System Resource Manager Manages the resources and resource groups in a cluster Co-ordinates access to physically shared resources Monitors availability of resources Performs local failover of resources Maps a set of resources into a resource group Atomically allocate resource groups

FailSafe - Failsafe Daemon A policy implementor for Resource Groups (RG) Provides the ability to enable/disable monitoring an application dynamically Provides ability to failover an application if monitoring fails Failover can be either local (restart) or remote

FailSafe - Failsafe Daemon (contd.) Failover Policy Module (PM) PM’s components –Failover script –Initial Failure Domain –Attributes

FailSafe - Cluster Reset Service Provides reset facility in a cluster upon request from one of its clients Provides facility to monitor each reset line that connects to a machine that it is expected to reset Special reset network to ensure connectivity for resetting remote machines

FailSafe - Agents Glue between a resource type and the Failsafe daemon Collection of action scripts and binaries that the action scripts could be calling Goal : Make a resource a highly available service Examples: a file server agent, a web server agent, an agent for making an IP address, a filesystem or a volume highly available

FailSafe - Action Scripts Determine how a resource is started, stopped and monitored Action scripts are per resource type Types: start, stop, monitor, exclusive, restart Returns status for each resource acted on Called by SRM

FailSafe - Related HA Technologies A journaled file system for fast recovery FailSafe can support multiple journaled filesystems such as XFS, GFS, ext3fs Volume manager for disk failures (lvm) Network mirroring Monitoring tool (mon)

FailSafe - Docs, Contacts Documentation : Contact :

FailSafe - Q & A Questions - Sure! Answers …. Well maybe :)