ENABLING HIGHLY AVAILABLE GRID SITES Michael Bryant Louisiana Tech University.

Slides:



Advertisements
Similar presentations
Express5800/ft series servers Product Information Fault-Tolerant General Purpose Servers.
Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
DOSAR Workshop VI April 17, 2008 Louisiana Tech Site Report Michael Bryant Louisiana Tech University.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
High Availability 24 hours a day, 7 days a week, 365 days a year… Vik Nagjee Product Manager, Core Technologies InterSystems Corporation.
© 2007 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Data protection and disaster recovery.
Dinker Batra CLUSTERING Categories of Clusters. Dinker Batra Introduction A computer cluster is a group of linked computers, working together closely.
1 © Copyright 2010 EMC Corporation. All rights reserved. EMC RecoverPoint/Cluster Enabler for Microsoft Failover Cluster.
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
CS526 Dr.Chow1 HIGH AVAILABILITY LINUX VIRTUAL SERVER By P. Jaya Sunderam and Ankur Deshmukh.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Lesson 1: Configuring Network Load Balancing
A Progressive Fault Tolerant Mechanism in Mobile Agent Systems Michael R. Lyu and Tsz Yeung Wong July 27, 2003 SCI Conference Computer Science Department.
Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.
1© Copyright 2011 EMC Corporation. All rights reserved. EMC RECOVERPOINT/ CLUSTER ENABLER FOR MICROSOFT FAILOVER CLUSTER.
CH 13 Server and Network Monitoring. Hands-On Microsoft Windows Server Objectives Understand the importance of server monitoring Monitor server.
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Hands-On Microsoft Windows Server 2008
INSTALLING MICROSOFT EXCHANGE SERVER 2003 CLUSTERS AND FRONT-END AND BACK ‑ END SERVERS Chapter 4.
High-Availability Linux.  Reliability  Availability  Serviceability.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
An Introduction to IBM Systems Director
IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.
HA-OSCAR Chuka Okoye Himanshu Chhetri. What is HA-OSCAR? “High Availability Open Source Cluster Application Resources”
DB-2: OpenEdge® Replication: How to get Home in Time … Brian Bowman Sr. Solutions Engineer Sandy Caiado Sr. Solutions Engineer.
©2006 Merge eMed. All Rights Reserved. Energize Your Workflow 2006 User Group Meeting May 7-9, 2006 Disaster Recovery Michael Leonard.
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science.
Grid job submission using HTCondor Andrew Lahiff.
FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper
High-Availability Linux Project (Last updated 24 Septemver, 2003) 30 th December, 2003 Seo Dong Mahn.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
 High-Availability Cluster with Linux-HA Matt Varnell Cameron Adkins Jeremy Landes.
Review of Condor,SGE,LSF,PBS
1 © 2003, Cisco Systems, Inc. All rights reserved. CCNP 1 v3.0 Module 1 Overview of Scalable Internetworks.
 Load balancing is the process of distributing a workload evenly throughout a group or cluster of computers to maximize throughput.  This means that.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Welcome to the PVFS BOF! Rob Ross, Rob Latham, Neill Miller Argonne National Laboratory Walt Ligon, Phil Carns Clemson University.
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
High Availability in DB2 Nishant Sinha
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Tool Integration with Data and Computation Grid “Grid Wizard 2”
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
Deploying Highly Available SAP in the Cloud
LTU Site Report Dick Greenwood LTU Site Report Dick Greenwood Louisiana Tech University DOSAR II Workshop at UT-Arlington March 30-31, 2005.
Enterprise Vitrualization by Ernest de León. Brief Overview.
High Availability Clusters in Linux Sulamita Garcia EDS Unix Specialist
SUSE Linux Enterprise Server for SAP Applications
Failover and High Availability
High Availability 24 hours a day, 7 days a week, 365 days a year…
Dynamic Deployment of VO Specific Condor Scheduler using GT4
High Availability Linux (HA Linux)
High Availability in HTCondor
Study course: “Computing clusters, grids and clouds” Andrey Y. Shevel
Building Grids with Condor
Grid Aware: HA-OSCAR By
DHCP, DNS, Client Connection, Assignment 1 1.3
SpiraTest/Plan/Team Deployment Considerations
Fault Tolerance Distributed Web-based Systems
Presentation transcript:

ENABLING HIGHLY AVAILABLE GRID SITES Michael Bryant Louisiana Tech University

2 What is High Availability?Current State of High Availability in Grid ComputingSite-level High Availability of Grid Resources and Services Grid Computing and High Availability 4/5/2007Louisiana Tech University Section One

What exactly is High Availability? 3

High Availability is a system design protocol and associated implementation that ensures a certain absolute degree of operational continuity during a given measurement period. 4

Simply put, high availability means increasing the availability of resources and services that a system provides. 5 In order to increase availability, you must first eliminate single points of failure…

6 Single Points of Failure: Single headnode scenario 4/5/2007Louisiana Tech University Condor, Globus, Monalisa, SSH, etc

7 Single Points of Failure: Dual headnode scenario 4/5/2007Louisiana Tech University Optional Image servers

Current State of High Availability in Grid Computing 8

9  Clusters  Many commercial solutions  Few open source solutions  Servers  Web/mail/storage/etc  Downtime is costly  Mission-critical applications  HA more important with increased usage of Linux  Decreases cost  Most grid sites are not HA, but many do use some kind of fault-tolerant techniques (RAID, SAN)  The complexity of grids slow adoption of HA  Common HA techniques doesn’t translate well to grids  Site-to-site failovers are problematic Each site is different (hardware, networks, policies, etc…)  Grids connect over WANs Short network timeouts possible (triggering false replication of services) IP failover techniques don’t work (different subnets) 4/5/2007Louisiana Tech University CommonplaceHA in Grids Where are HA solutions used and why not in grids?

So, are highly available grids even possible? 10

Most certainly. 11

Site-level High Availability of Grid Resources and Services 12

13 Highly Available Grid Resources and Services 4/5/2007  Using component redundancy and self-healing capabilities, we can eliminate single points of failure at grid sites.  A site refers to an organization, such as an institution or corporation, that lends its computing resources to a grid.  These resources are accessible through grid services setup at the site.  By eliminating single points of failure, we can make the grid resources and services of a particular site highly available. Louisiana Tech University

14 Grid Site ResourcesServices  Grid-level high availability  Site-to-site failovers (Linux-HA claims it can in v2.0.8)  Failover of idle/running jobs to designated grid sites  Allowing for new jobs to be submitted to other sites (fault- tolerant grid resource selector)  Site-level high availability  Removes single points of failure on cluster-based grid sites  Deals directly with the site’s resources and services making them highly available What is site- vs grid-level HA? 4/5/2007Louisiana Tech University

Thus, if grid sites are made highly available, it is then possible to have a grid that is highly available. 15

16 Mon and Net-SNMPLinux-HAApplication HA (Condor, PBS) High Availability Techniques 4/5/2007Louisiana Tech University Section Two

HA Technique: Mon and Net-SNMP 17

18 What is Mon and Net-SNMP? 4/5/2007Louisiana Tech University  Underneath, HA-OSCAR uses two main components:  mon – “is a general-purpose scheduler and alert management tool used for monitoring service availability and triggering alerts upon failure detection.” (  Net-SNMP – “is a widely used protocol for monitoring the health and welfare of network equipment (eg. routers), computer equipment and even devices like UPSs.” (

19 How Mon and Net-SNMP work together… 4/5/2007  On standby headnode:  Mon uses fping every 15 sec to check if primary is down Failover: If primary is down, mon calls a set of alert scripts that are used to confirm and act on the failure The IP address for the standby changes to the primary’s IP All services running on primary are started on standby PBS/Torque queue is transferred over using rsync and jobs are restarted from last checkpoint Failback: When primary is back online, mon reverses process  On primary headnode:  Mon uses Net-SNMP to check if a service is running Mon will automatically restart any service that is not running Louisiana Tech University

HA Technique: Linux-HA 20

21 What is Linux-HA? 4/5/2007Louisiana Tech University  A high availability UNIX package called heartbeat which consists of many HA components:  Heartbeat program  Local resource manager  Cluster resource manager  Cluster information base  Stonith daemon (restarts failed node)  Works similarly to Mon/Net-SNMP except it has more features and is more widely known.  Comparable to most commercial HA solutions in terms of easy-of-use and features.  Researching how it compares to the Mon/Net-SNMP technique

HA Technique: Application HA (Condor, PBS) 22

23 Highly Available Condor Daemons 4/5/2007Louisiana Tech University  Condor has built-in HA mechanisms (condor_ had daemon)  Condor can failover/back both the central manger daemons (condor_ negotiator and condor_ collector) and the job queue (condor_schedd).  Central manager failover: The CM daemons are the heart of the Condor matchmaking system and is critical to the pool’s functionality. Can handle network partitioning Note: does not work with flocking!  Job queue failover: Jobs caused to stop without finishing can be restarted from the beginning, or can continue execution using the most recent checkpoint. New jobs can enter the job queue. Uses a shared file system to keep a lock file to prevent multiple condor_schedd daemon from running at the same time.  Researching possibility of using in the OSG

24 PBS/Torque Job Failover 4/5/2007Louisiana Tech University  Developed here at Tech by graduate students  In progress of polishing the package to include in HA-OSCAR  Job queue failover:  PBS job queue is rsync’d from the primary to the standby headnode periodically – keeping both idle and running jobs.  Standby’s jobs are put in the held state, while primary’s jobs remains running.  When primary fails, jobs on standby are changed to a running state.  Job checkpointing and failover:  Jobs are checkpointed on worker nodes and rsync’d to a file server.  If a worker node fails, the running jobs on that node are started on a spare node from the last checkpoint. Or restarted if no checkpoint.  Uses the BLCR (Berkeley Lab Checkpoint/Restart) checkpointing software

25 HA-OSCARHA-ROCKSCreating a grid-centric HA OSG package High Availability Solutions 4/5/2007Louisiana Tech University Section Three

HA Solution: HA-OSCAR 26

27 HA-enabled OSCAR Clusters 4/5/2007Louisiana Tech University  Built on top of OSCAR (Open Source Cluster Application Resources), which is an open source cluster management tool (similar to ROCKS)  Developed by Dr. Box and his students  Provides an easy to install and setup open source HA solution  OSCAR installs on RHEL, Fedora, SUSE, and others  Component redundancy is adopted to eliminate single points of failure  HA-OSCAR incorporates a self-healing mechanism; failure detection & recovery, automatic failover and fail-back  Stable release 1.2 works with OSCAR 5.0

28 4/5/2007Louisiana Tech University Self-healing with 5-20 sec automatic failover time The first known field- grade open source HA Beowulf cluster release HA-OSCAR (continued)

HA Solution: HA-ROCKS 29

30 HA-enabled ROCKS Clusters 4/5/2007Louisiana Tech University  HA-ROCKS was in development by Dr. Box and students, but focused shifted back to OSCAR because of lack of interest.  Uses SIS, the system installation suite, for cloning the primary headnode. This is what OSCAR uses, too.  Plans to start back developing on when time permits

Creating a grid-centric HA OSG package 31

32 Bringing HA to all OSG sites 4/5/2007Louisiana Tech University  Our first step to making a truly highly available grid is to develop a HA solution for the OSG, which will contain:  An easy to use installer to build a HA cluster  Standard HA package that can be installed on any Linux distribution (ROCKS, RHEL, Debian, SUSE, etc)  All HA functionality included in HA-OSCAR  Documentation on how to setup a HA cluster using OSCAR, ROCKS, RH Cluster Manger, etc.  This HA solution will not depend on a cluster suite, so building highly available OSG headnodes without a cluster, such as LTU_CCT, will be possible.

33 In closing… 4/5/2007Louisiana Tech University Section Four ConclusionQuestions/Comments?Links

34 Conclusion 4/5/2007Louisiana Tech University  Highly available grids can be enabled by reducing single points of failure at grid sites.  At the same time, eliminating a lot of the single point failures will make grids more reliable.  There are already existing open source HA tools, such as HA-OSCAR, that target mission critical applications.  In the coming months, expect to see Louisiana Tech playing a bigger role in High Availability and Grid Computing.

35 Questions/Comments? 4/5/2007Louisiana Tech University

36 Links  HA-OSCAR -  Linux-HA -  Condor-HAD  Mon -  Net-SNMP /5/2007Louisiana Tech University