Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.

Slides:



Advertisements
Similar presentations
Todd Tannenbaum Condor Team GCB Tutorial OGF 2007.
Advertisements

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Condor use in Department of Computing, Imperial College Stephen M c Gough, David McBride London e-Science Centre.
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.
Distributed Xrootd Derek Weitzel & Brian Bockelman.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Skyward Disaster Recovery Options
2004 Cross-Platform Automated Regression Test Framework Ramkumar Ramalingam, Rispna Jain IBM Software Labs, India.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Slide 1 Written by Dr Caspar Ryan, Project Leader ATcrc project 1.2 What is MobJeX? Next Generation Java Application Framework providing transparent component.
Software Frameworks for Acquisition and Control European PhD – 2009 Horácio Fernandes.
Scalable Distributed Stream Processing Presented by Ming Jiang.
CS526 Dr.Chow1 HIGH AVAILABILITY LINUX VIRTUAL SERVER By P. Jaya Sunderam and Ankur Deshmukh.
Computer Science Lecture 12, page 1 CS677: Distributed OS Last Class Vector timestamps Global state –Distributed Snapshot Election algorithms.
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Cheap cycles from the desktop to the dedicated cluster: combining opportunistic and dedicated scheduling with Condor Derek Wright Computer Sciences Department.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
AUTOBUILD Build and Deployment Automation Solution.
Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.
AUTOBUILD Build and Deployment Automation Solution.
Nightly Releases and Testing Alexander Undrus Atlas SW week, May
User Manager Pro Suite Taking Control of Your Systems Joe Vachon Sales Engineer November 8, 2007.
Hao Wang Computer Sciences Department University of Wisconsin-Madison Security in Condor.
1 1 Vulnerability Assessment of Grid Software Jim Kupsch Associate Researcher, Dept. of Computer Sciences University of Wisconsin-Madison Condor Week 2006.
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
Grid Computing I CONDOR.
ETICS All Hands meeting Bologna, October 23-25, 2006 NMI and Condor: Status + Future Plans Andy PAVLO Peter COUVARES Becky GIETZEL.
FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
A Networked Machine Management System 16, 1999.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Quill / Quill++ Tutorial.
Samba – Good Just Keeps Getting Better The new and not so new features available in Samba, and how they benefit your organization. Copyright 2002 © Dustin.
Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of.
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
A Fully Automated Fault- tolerant System for Distributed Video Processing and Off­site Replication George Kola, Tevfik Kosar and Miron Livny University.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Weekly Work Dates:2010 8/20~8/25 Subject:Condor C.Y Hsieh.
Software Testing Process
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
Krzysztof Genser/Fermilab For the Fermilab Geant4 Performance Team.
John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison, WI Caging the CCLRC Compute Zoo (Activities at.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Introduction.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
Gabi Kliot Computer Sciences Department Technion – Israel Institute of Technology Adding High Availability to Condor Central Manager Adding High Availability.
Since computing power is everywhere, how can we make it usable by anyone? (From Condor Week 2003, UW)
ENABLING HIGHLY AVAILABLE GRID SITES Michael Bryant Louisiana Tech University.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Condor Week 09Condor WAN scalability improvements1 Condor Week 2009 Condor WAN scalability improvements A needed evolution to support the CMS compute model.
HTCondor Networking Concepts
HTCondor Networking Concepts
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Workload Management System
ETICS Pool for IPv6 tests
High Availability in HTCondor
Adding High Availability to Condor Central Manager Tutorial
Building Grids with Condor
Building and Testing using Condor
Accounting, Group Quotas, and User Priorities
Basic Grid Projects – Condor (Part I)
Condor: Firewall Mirroring
Condor-G Making Condor Grid Enabled
Open Source Tool Based Automation solution with Continuous Integration and end to end BDD Implementation Arun Krishnan - Automation Manager Maria Afzal-
Presentation transcript:

Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology

Condor week – March 2005©Gabriel Kliot, Technion2 Outline Current Condor pool Motivation for Highly Available Central Manager The solution - HA Daemon Performance impacts Testing Future Work

Condor week – March 2005©Gabriel Kliot, Technion3 Collector Negotiator Current Condor Pool Startd and Schedd Central Manager

Condor week – March 2005©Gabriel Kliot, Technion4 Why Highly Available Central Manager Central manager is a single-point-of- failure Negotiator ’ s failure - No additional matches will be possible Collector ’ s failure – negotiator is out of job, tools querying collector won ’ t work, etc. Our goal Allow continuous pool functioning in case of failure

Condor week – March 2005©Gabriel Kliot, Technion5 Highly Available Condor Pool Startd and Schedd Idle Central Manager Idle Central Manager Active Central Manager Highly Available Central Manager

Condor week – March 2005©Gabriel Kliot, Technion6 Highly Available Central Manager Our solution - Highly Available Central Manager Automatic failure detection Transparent failover to backup matchmaker (no global configuration change for the pool entities) “ Split brain ” reconciliation after network partitions State replication between active and backups No changes to Negotiator/Collector code

Condor week – March 2005©Gabriel Kliot, Technion7 Highly Available Central Manager How it works Collector ’ s HA is provided by redundancy Negotiator ’ s HA is provided by HA daemons

Condor week – March 2005©Gabriel Kliot, Technion8 Collector HAD How it works – Election Election msg Workstation – Startd and Schedd

Condor week – March 2005©Gabriel Kliot, Technion9 How it works – Election Collector Leader HAD Collector HAD Collector HAD Negotiator I ’ m alive Workstation – Startd and Schedd

Condor week – March 2005©Gabriel Kliot, Technion10 How it works – basic scenario Collector Leader HAD Collector HAD Collector HAD Negotiator Active CMIdle CM I ’ m alive Workstation – Startd and Schedd

Condor week – March 2005©Gabriel Kliot, Technion11 Collector Leader HAD Collector HAD Collector HAD Negotiator Active CMIdle CM Workstation – Startd and Schedd I ’ m alive How it works – crash event

Condor week – March 2005©Gabriel Kliot, Technion12 How it works – crash event Collector Leader HAD Collector HAD Collector HAD Negotiator Active CMIdle CM Workstation – Startd and Schedd Election

Condor week – March 2005©Gabriel Kliot, Technion13 Collector Leader HAD Collector HAD Collector LEADER HAD Negotiator Active CMIdle CM Active CM Workstation – Startd and Schedd I ’ m Alive How it works – crash event Negotiator

Condor week – March 2005©Gabriel Kliot, Technion14 High Availability Daemon State machine

Condor week – March 2005©Gabriel Kliot, Technion15 Performance impact Stabilization time – the time it takes for HA daemons to detect failure and fix it Depends on number of CMs and network performance HAD_CONNECT_TIMEOUT – time it takes to establish TCP connection (depends on network type, presence of encryption, etc … ) Assuming it takes up to 2 seconds to establish TCP connection and 2 CMs are used - new Negotiator is up and running after 48 seconds

Condor week – March 2005©Gabriel Kliot, Technion16 Testing Special automatic distributed testing framework was built: simulation of node crashes network disconnections network partition and merges Extensive testing effort: distributed testing on 5 linux machines in the Technion interactive distributed testing in Wisc pool automatic testing with NMI framework Already deployed and fully functioning for 3 weeks on our production pool in the Technion

Condor week – March 2005©Gabriel Kliot, Technion17 Future development HAD publishing in Collectors condor_status – had Accounting file replication current solution is provided for NFS Software High Availability

Condor week – March 2005©Gabriel Kliot, Technion18 Collaboration with Condor team Compliance with high Condor coding standards Peer-reviewed code Integration with NMI framework Automation of testing Open-minded attitude of Condor team to numerous requests and questions Unique experience of working with large peer- managed group of talented programmers

Condor week – March 2005©Gabriel Kliot, Technion19 Collaboration with Condor team This work was a collaborative effort of: Distributed Systems Laboratory in Technion Prof Assaf Schuster, Mark Silberstein, Gabi Kliot, Svetlana Kantorovitch, Dedi Carmeli, Artiom Sharov Condor team Prof Miron Livni, Nick, Todd, Derek, Erik, Carey, Peter, Becky, Parag, Zack, Dan

Condor week – March 2005©Gabriel Kliot, Technion20 Part of the official development release Full support by the Technion team More information: s/ha/ha.html s/ha/ha.html more details + configuration on my tutorial tomorrow Contact: You should definitely try it !

Condor week – March 2005©Gabriel Kliot, Technion21 In case of time

Condor week – March 2005©Gabriel Kliot, Technion22 How it works – basic scenario

Condor week – March 2005©Gabriel Kliot, Technion23 How it works – crash event

Condor week – March 2005©Gabriel Kliot, Technion24 Usability and administration Configuration sanity check perl script Disable HAD perl script Detailed manual section Full support by Technion team