Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

Slides:



Advertisements
Similar presentations
Andrew McNab - Manchester HEP - 17 September 2002 Putting Existing Farms on the Testbed Manchester DZero/Atlas and BaBar farms are available via the Testbed.
Advertisements

GXP in nutshell You can send jobs (Unix shell command line) to many machines, very fast Very small prerequisites –Each node has python (ver or later)
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Greg Thain Computer Sciences Department University of Wisconsin-Madison Condor Parallel Universe.
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Workload Management Massimo Sgaravatto INFN Padova.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Grids and Globus at BNL Presented by John Scott Leita.
Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.
Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.
Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
STRATEGIES INVOLVED IN REMOTE COMPUTATION
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.
1 port BOSS on Wenjing Wu (IHEP-CC)
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Tools and Utilities for parallel and serial codes in ENEA-GRID environment CRESCO Project: Salvatore Raia SubProject I.2 C.R. ENEA-Portici. 11/12/2007.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison
Job Submission Condor, Globus, Java CoG Kit Young Suk Moon.
Grid Computing I CONDOR.
Compiled Matlab on Condor: a recipe 30 th October 2007 Clare Giacomantonio.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
17-April-2007 High Performance Computing Basics April 17, 2007 Dr. David J. Haglin.
Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley
Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
Rochester Institute of Technology Job Submission Andrew Pangborn & Myles Maxfield 10/19/2015Service Oriented Cyberinfrastructure Lab,
Grid job submission using HTCondor Andrew Lahiff.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
July 28' 2011INDIA-CMS_meeting_BARC1 Tier-3 TIFR Makrand Siddhabhatti DHEP, TIFR Mumbai July 291INDIA-CMS_meeting_BARC.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Review of Condor,SGE,LSF,PBS
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
CIS250 OPERATING SYSTEMS Chapter One Introduction.
Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
LSF Universus By Robert Stober Systems Engineer Platform Computing, Inc.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Wouter Verkerke, NIKHEF 1 Using ‘stoomboot’ for NIKHEF-ATLAS batch computing What is ‘stoomboot’ – Hardware –16 machines, each 2x quad-core Pentium = 128.
Active-HDL Server Farm Course 11. All materials updated on: September 30, 2004 Outline 1.Introduction 2.Advantages 3.Requirements 4.Installation 5.Architecture.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
Five todos when moving an application to distributed HTC.
Advanced Computing Facility Introduction
The Distributed Application Debugger (DAD)
GWE Core Grid Wizard Enterprise (
CRESCO Project: Salvatore Raia
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Basic Grid Projects – Condor (Part I)
Module 01 ETICS Overview ETICS Online Tutorials
Genre1: Condor Grid: CSECCR
Wide Area Workload Management Work Package DATAGRID project
Production Manager Tools (New Architecture)
Job Submission Via File Transfer
Presentation transcript:

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC

Outline  Softwares required for the cluster.  Network Topology  CONDOR and different roles of machine in condor pool.  Various Condor Environments.  Pre-requisite of Condor  Configuration of Condor on our LAN.  Running jobs using Condor and some commonly used condor commands  MPI  Conclusion `

Softwares required for the cluster The Cluster requires following softwares: Operating System : Scientific Linux CERN 5.4 – 64 bit version Cluster Management : Management is done through IPMI (inbuilt) Cluster Usage and Statistics : Using “Ganglia” Cluster Middleware : CONDOR Parallel Programming Environment : MPI

Network Topology One head node containing all users home directory 16 worker node that will provide computational power Head node will be connected to both public and private network Public Network : Allow users to login on Head Node Private Network : Connect all worker node to head node using Gigabit and Infiniband Network. Used for job submission and execution File System : Network File System to have a shared area among Head node and Worker nodes

Prototype distributed and parallel computing environment A Prototype distributed and parallel computing environment for the cluster is setup on a LAN of 4 computer. Distributed computing environment : Using CONDOR Parallel computing environment : Using MPI CONDOR Condor is an open source high-throughput computing software package for distributed parallelization of computationally intensive task. Used to manage workload on a cluster of computing nodes. It can integrate both dedicated resourced (rack mounted clusters) and non dedicated desktop machines by making use of cycle scavenging. Can run both sequential and parallel jobs. Provide different universes to run jobs (vanilla, standard, MPI, Java etc..)

Condor exceptional features: Checkpointing and Migration Remote System calls No changes are necessary to user source code Sensitive to desires of Machine owner (in case of non dedicated machine). Different roles of a machine in condor pool Central Manager : The main administration machine. Execute : These are machine where job executes. Submit : These machine are used to submit the job.

Various Condor Daemons Following condor daemons runs on different machine in the condor pool Condor_master : Take care of rest of the daemons running on a machine. Condor_collector : Responsible for collecting information about status of pool Condor_negotiator : Responsible for all the match-making within Condor System Condor_schedd : This daemon represent resource requests to the Condor pool Condor_startd : This daemon represents a given resource to the Condor pool

Various condor environment to run different types of jobs Condor provides several universes to run different type of jobs, some are as follows: Standard : This universe provides condor’s full power, it provides following features 1. Checkpointing 2. Migration 3. Remote System Calls. The job needs to be relinked with condor libraries in order to run in standard universe. This can be easily achieved by putting condor_compile in front of usual link command Eg. Normal linking of a job : gcc –o my_prog my_prog.c for standard universe the job is prepared by condor_compile gcc –o my_prog my_prog.c Now this job can utilize the power of standard universe.

Vanilla : This universe is intended for programs which cannot be successfully relinked with condor libraries. 1. Shell scripts are one of the example of jobs where vanilla is useful. 2. Jobs that run under vanilla universe cannot utilize checkpointing or remote system calls. 3. Since remote system call feature is not available so we need a shared file system such as NFS or AFS. Parallel : This universe allow parallel programs, such as MPI jobs to be run in condor environment.

Prerequisites of condor configuration  Setup of Private network of machine in computing pool  Passwordless login from submit machines to all execute machine (rsh or ssh) Configuration of condor on our small LAN of 4 computers On our LAN of four machines we have one head node and remaining 3 worker nodes Condor is installed and configured on our pool and role of each machine is mentioned below: 1. Head Node : Central Manager, Submit 2. Worker Node : Execute Home directory of all the users resides on head node. These home directories resides in a shared area (using NFS) which can be accessed by all the worker nodes. (required for vanilla universe). Now user can submit job from their home directories.

Running jobs using Condor Following are the steps to run the condor job. Prepare the Code Chose the Condor Universe Make the Submit description file (submit.ip), a sample file is shown below: # # # # # # # # # # # # #Sample Submit Description file # # # # # # # # # # # # Executable = getIp Universe = standard Output = getIp.out Error = getIp.err Log = hello.log Queue 15 Submit the Job: Now this job can be submitted by following condor command Condor_submit submit.ip

Commonly used Condor commands Condor_submit : Used to submit the job Condor_q : displays information about jobs in condor job queue

Condor_status : used to monitor, query and display status of the Condor pool `

Condor_history : helps the users to View log of Condor jobs completed up to date. Condor_rm : removes one or more jobs from the Condor job queue. Condor_compile : used to relink the job with condor libraries, so that now it can be executed in standard universe.

MPI MPI is language independent communication protocol used to do parallel programming. Different languages provides their wrapper compiler to do MPI programming. Here we have installed MPICH that will allow us to do parallel programming in C,C++, fortran etc. Computation v/s communication. SISD,SIMD,MISD,MIMD (Flynn’s classification) MPI requires the executable to present on all the machine in the pool This achieved via NFS shared area. Testing is done through matrix multiplication program. Considerable reduction in execution time.

Conclusion:  CONDOR is installed and configured on a small LAN of 4 computers and it is working properly and is giving expected results.  Later on this prototype setup will be replicated on a computing cluster having 16 worker nodes that will provide a processing power of 1.3 TFlops plus a storage of 20 TBytes.  The setup is also ready to run parallel jobs. So in future if we have some parallel job application then we are ready for it. ``