Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of.

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Condor use in Department of Computing, Imperial College Stephen M c Gough, David McBride London e-Science Centre.
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
18 April 2002 e-Science Architectural Roadmap Open Meeting 1 Support for the UK e-Science Roadmap David Boyd UK Grid Support Centre CLRC e-Science Centre.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Welcome Data Imports Instant Imports & How to Create an Import File Ryan McIntire Digital Measures.
Task Scheduling and Distribution System Saeed Mahameed, Hani Ayoub Electrical Engineering Department, Technion – Israel Institute of Technology
Slide 1 Written by Dr Caspar Ryan, Project Leader ATcrc project 1.2 What is MobJeX? Next Generation Java Application Framework providing transparent component.
Software Frameworks for Acquisition and Control European PhD – 2009 Horácio Fernandes.
Client-Server Computing in Mobile Environments
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Miron Livny Computer Sciences Department University of Wisconsin-Madison From Compute Intensive to Data.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.
PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
On P2P Collaboration Infrastructures Manfred Hauswirth, Ivana Podnar, Stefan Decker Infrastructure for Collaborative Enterprise, th IEEE International.
INFSO-RI Enabling Grids for E-sciencE The US Federation Miron Livny Computer Sciences Department University of Wisconsin – Madison.
Hao Wang Computer Sciences Department University of Wisconsin-Madison Security in Condor.
1 1 Vulnerability Assessment of Grid Software Jim Kupsch Associate Researcher, Dept. of Computer Sciences University of Wisconsin-Madison Condor Week 2006.
EGEE is a project funded by the European Union under contract IST Testing processes Leanne Guy Testing activity manager JRA1 All hands meeting,
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
Presents computation Grid Harness the power of Windows, Unix, Linux and Mac OS/X machines.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Welcome and Condor Project Overview.
Clever Framework Name That Doesn’t Violate Copyright Laws MARCH 27, 2015.
ETICS All Hands meeting Bologna, October 23-25, 2006 NMI and Condor: Status + Future Plans Andy PAVLO Peter COUVARES Becky GIETZEL.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Association Rule Mining in Peer-to-Peer Systems Ran Wolff Assaf Shcuster Department of Computer Science Technion I.I.T. Haifa 32000,Isreal.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Multi-level monitoring - an overview James.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Plug-in for Singleton Service in Clustered environment and improving failure detection methodology Advisor:By: Dr. Chung-E-WangSrinivasa c Kodali Department.
Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams Condor Administrator’s How-to.
A Personal Cloud Controller Yuan Luo School of Informatics and Computing, Indiana University Bloomington, USA PRAGMA 26 Workshop.
CERN-IT Oracle Database Physics Services Maria Girone, IT-DB 13 December 2004.
Automatic Statistical Evaluation of Resources for Condor Daniel Nurmi, John Brevik, Rich Wolski University of California, Santa Barbara.
Fault Tolerant Services
STAR Collaboration, July 2004 Grid Collector Wei-Ming Zhang Kent State University John Wu, Alex Sim, Junmin Gu and Arie Shoshani Lawrence Berkeley National.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.
Infrastructure as code. “Enable the reconstruction of the business from nothing but a source code repository, an application data backup, and bare metal.
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.
Highly Available Services and Transactions with Replicated Data Jason Lenthe.
 OpenBSD runs on many different hardware platforms.  OpenBSD is thought of by many security professionals to be the most secure UNIX-like operating system.
Mardi 8 mars 2016 Status of new features in CIC Portal Latest Release of 22/08/07 Osman Aidel, Hélène Cordier, Cyril L’Orphelin, Gilles Mathieu IN2P3/CNRS.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Introduction.
Cloudsim: simulator for cloud computing infrastructure and modeling Presented By: SHILPA V PIUS 1.
Gabi Kliot Computer Sciences Department Technion – Israel Institute of Technology Adding High Availability to Condor Central Manager Adding High Availability.
Since computing power is everywhere, how can we make it usable by anyone? (From Condor Week 2003, UW)
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Condor Week 09Condor WAN scalability improvements1 Condor Week 2009 Condor WAN scalability improvements A needed evolution to support the CMS compute model.
Condor A New PACI Partner Opportunity Miron Livny
HTCondor Networking Concepts
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Adding High Availability to Condor Central Manager Tutorial
THE STEPS TO MANAGE THE GRID
Building Grids with Condor
Condor: Job Management
View Change Protocols and Reconfiguration
EECS 498 Introduction to Distributed Systems Fall 2017
Getting Started.
Basic Grid Projects – Condor (Part I)
Getting Started.
Optena: Enterprise Condor
Computer Networks and Operating Systems UFCFQ Lecture-1
Condor-G Making Condor Grid Enabled
GLOW A Campus Grid within OSG
Presentation transcript:

Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of Technology, Haifa

Condor week – April 2006Artyom Sharov, Technion, Haifa2 Collector Negotiator Startd and Schedd Central Manager Condor Pool without High Availability

Condor week – April 2006Artyom Sharov, Technion, Haifa3 Central Manager is a single-point-of-failure No additional matches are possible Condor tools do not work Unfair resource sharing and user priorities Our goal - continuous pool functioning in case of failure Why Highly Available CM?

Condor week – April 2006Artyom Sharov, Technion, Haifa4 Highly Available Central Manager Startd and Schedd Highly Available Condor Pool

Condor week – April 2006Artyom Sharov, Technion, Haifa5 Automatic failure detection Transparent failover “Split brain” reconciliation Persistency of CM state No changes to CM code Solution Requirements

Condor week – April 2006Artyom Sharov, Technion, Haifa6 Highly Available Central Manager Negotiator Condor Pool with HA Replicator HAD Replicator HAD Collector Replicator HAD Collector

Condor week – April 2006Artyom Sharov, Technion, Haifa7 Backup 1Backup 2Backup 3 HA – Election + Main #1 #2 Election message I win Raise Negotiator I loose Active I am alive I loose

Condor week – April 2006Artyom Sharov, Technion, Haifa8 HA – Crash ActiveBackup 1Backup 2 #3 Election messages #4 I am alive I win Raise Negotiator I loose Active

Condor week – April 2006Artyom Sharov, Technion, Haifa9 ActiveBackupJoining #1 #2 State update Solicit version Solicit version reply #3 Downloading request Replication – Main + Joining

Condor week – April 2006Artyom Sharov, Technion, Haifa10 Replication – Crash ActiveBackup 1Backup 2 #4 #5 State update Active

Condor week – April 2006Artyom Sharov, Technion, Haifa11 Stabilization time Depends on number of CMs and network performance HAD_CONNECT_TIMEOUT – upper bound on the time to establish TCP connection Example: HAD_CONNECT_TIMEOUT = 2 and 2 CMs - new Negotiator is guaranteed to be up and running after 48 seconds Replication frequency REPLICATION_INTERVAL Configuration

Condor week – April 2006Artyom Sharov, Technion, Haifa12 Automatic distributed testing framework: simulation of node crashes, network disconnections, network partition and merges Extensive testing: distributed testing on 5 machines in the Technion interactive distributed testing in Wisconsin pool automatic testing with NMI framework Testing

Condor week – April 2006Artyom Sharov, Technion, Haifa13 Already deployed and fully functioning for more than a year in Technion GLOW, UW California Department of Water Resources, Delta Modeling Section, Sacramento, CA Hartford Life Cycle Computing Additional commercial users HA in Production

Condor week – April 2006Artyom Sharov, Technion, Haifa14 HAD Monitoring System Configuration/administration utilities Detailed manual section Full support by Technion team Usability and Administration

Condor week – April 2006Artyom Sharov, Technion, Haifa15 HA in WAN HAIFA – High Availability Is For Anyone HA for any Condor service (e.g.: HA for schedd) More consistency schemes and HA semantics Dynamic registration of services requiring HA Dynamic addition/removal of replicas More details in "Materializing Highly Available Grids" - hot topic paper, to appear in HPDC Future Work

Condor week – April 2006Artyom Sharov, Technion, Haifa16 Ongoing collaboration for 3 years Compliance with Condor coding standards Peer-reviewed code Integration with NMI framework Automation of testing Open-minded attitude of Condor team to numerous requests and questions Unique experience of working with large peer- managed group of talented programmers Collaboration with Condor Team

Condor week – April 2006Artyom Sharov, Technion, Haifa17 This work was a collaborative effort of: Distributed Systems Laboratory in Technion Prof. Assaf Schuster, Gabi Kliot, Mark Zilberstein, Artyom Sharov Condor team Prof. Miron Livny, Nick, Todd, Derek, Greg, Anatoly, Peter, Becky, Bill, Tim Collaboration with Condor Team

Condor week – April 2006Artyom Sharov, Technion, Haifa18 Part of the official development release Will soon appear in stable 6.8 release More information: ages/ha/ha.html ages/ha/ha.html ages/replication/replication.html ages/replication/replication.html more details + configuration in my tutorial Contact: You Should Definitely Try It

Condor week – April 2006Artyom Sharov, Technion, Haifa19 In case of time

Condor week – April 2006Artyom Sharov, Technion, Haifa20 Replication – “ Split Brain ” Active 1Active 2 Merge of networks I am alive, Active 1 I am alive, Active 2 Decision making: my ID > ‘ Active 2 ’ ID, I am a leader Decision making: my ID < ‘ Active 1 ’ ID, give up HAD Replication HAD Replication

Condor week – April 2006Artyom Sharov, Technion, Haifa21 ActiveBackup Merge of networks HAD Replication HAD Replication You ’ re leader ‘ Active 2 ’ last version before merge merging versions from two pools State update Replication – “ Split Brain ”

Condor week – April 2006Artyom Sharov, Technion, Haifa22 HAD State Diagram

Condor week – April 2006Artyom Sharov, Technion, Haifa23 RD State Diagram