First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Slides:



Advertisements
Similar presentations
Installation and evaluation of the Globus toolkit WP 1 INFN-GRID Workload management WP 1 DATAGRID WP 2.1 INFN-GRID Massimo Sgaravatto INFN Padova.
Advertisements

INFN & Globus activities Massimo Sgaravatto INFN Padova.
Grid Workload Management (WP 1) Report to INFN-GRID TB Massimo Sgaravatto INFN Padova.
Work Package 1 Installation and Evaluation of the Globus Toolkit Massimo Sgaravatto INFN Padova.
Evaluation of the Globus Toolkit: Status Roberto Cucchi – INFN Cnaf Antonia Ghiselli – INFN Cnaf Giuseppe Lo Biondo – INFN Milano Francesco Prelz – INFN.
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
A Computation Management Agent for Multi-Institutional Grids
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
WP 1 Grid Workload Management Massimo Sgaravatto INFN Padova.
CMS HLT production using Grid tools Flavia Donno (INFN Pisa) Claudio Grandi (INFN Bologna) Ivano Lippi (INFN Padova) Francesco Prelz (INFN Milano) Andrea.
GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.
Status of Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group
LNL M.Biasotto, Bologna, 20 novembre Providing the Grid Information Service with information of local farms Massimo Biasotto – INFN LNL Massimo.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group
INFN-GRID Globus evaluation Massimo Sgaravatto INFN Padova for the INFN Globus group
Report on the INFN-GRID Globus evaluation Massimo Sgaravatto INFN Padova for the INFN Globus group
GRID Workload Management System Massimo Sgaravatto INFN Padova.
Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group
Workload Management Massimo Sgaravatto INFN Padova.
Status of Globus activities within INFN (update) Massimo Sgaravatto INFN Padova for the INFN Globus group
First ideas for a Resource Management Architecture for Productions Massimo Sgaravatto INFN Padova.
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
INFN-GRID Globus evaluation (WP 1) Massimo Sgaravatto INFN Padova for the INFN Globus group
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
GRID The GRID distribution toolkit at INFN Flavia Donno (INFN Pisa) Andrea Sciaba` (INFN Pisa) Zhen Xie (INFN Pisa) presented by Massimo Sgaravatto (INFN.
Grid Computing I CONDOR.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.
Grid Workload Management Massimo Sgaravatto INFN Padova.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -
Grid Security: Authentication Most Grids rely on a Public Key Infrastructure system for issuing credentials. Users are issued long term public and private.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Globus Toolkit Massimo Sgaravatto INFN Padova. Massimo Sgaravatto Introduction Grid Services: LHC regional centres need distributed computing Analyze.
Review of Condor,SGE,LSF,PBS
Proposal for a IS schema Massimo Sgaravatto INFN Padova.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid.
Report on the INFN-GRID Globus evaluation Massimo Sgaravatto INFN Padova for the INFN Globus group
Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.
Condor on WAN D. Bortolotti - INFN Bologna T. Ferrari - INFN Cnaf A.Ghiselli - INFN Cnaf P.Mazzanti - INFN Bologna F. Prelz - INFN Milano F.Semeria - INFN.
Summary from WP 1 Parallel Section Massimo Sgaravatto INFN Padova.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
JSS Job Submission Service Massimo Sgaravatto INFN Padova.
STAR Scheduling status Gabriele Carcassi 9 September 2002.
4/9/ 2000 I Datagrid Workshop- Marseille C.Vistoli Wide Area Workload Management Work Package DATAGRID project Parallel session report Cristina Vistoli.
Status of Globus activities Massimo Sgaravatto INFN Padova for the INFN Globus group
Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
First evaluation of the Globus GRAM service Massimo Sgaravatto INFN Padova.
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
Workload Management Workpackage
First proposal for a modification of the GIS schema
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Peter Kacsuk – Sipos Gergely MTA SZTAKI
Basic Grid Projects – Condor (Part I)
Wide Area Workload Management Work Package DATAGRID project
GRID Workload Management System for CMS fall production
Condor-G Making Condor Grid Enabled
Presentation transcript:

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

High throughput workload management system architecture (simplified design) GRAM CONDOR GRAM LSF GRAM PBS globusrun Site1 Site2Site3 condor_submit (Globus Universe) Condor-G MasterGIS Submit jobs (using Class-Ads) Resource Discovery Information on characteristics and status of local resources

Overview PC farms in different sites managed by possible different local resource management systems GRAM as uniform interface to these different local resource management systems Condor-G able to provide robustness and reliability Master smart enough to decide in which Globus resources the jobs must be submitted The Master uses the information on characteristics and status of resources published in the GIS

First step: evaluation of GRAM GRAM CONDOR GRAM LSF GRAM PBS Site1 Site2Site3 Submit jobs (using Globus tools) GIS Information on characteristics and status of local resources

Evaluation of GRAM Service Job submission tests using Globus tools (globusrun, globus-job-run, globus-job- submit) GRAM as uniform interface to different underlying resource management systems “Cooperation” between GRAM and GIS Evaluation of RSL as uniform language to specify resources Tests performed with Globus and and Linux machines

GRAM & fork system call Client Server (fork) Globus

GRAM & Condor Client Server (Condor front-end machine) Globus Condor Condor pool

GRAM & Condor Tests considering: Standard Condor jobs (relinked with Condor library) INFN WAN Condor pool configured as Globus resource ~ 200 machines spread across different sites Heterogeneous environment No single file system and UID domain Vanilla jobs (“normal” jobs) PC farm configured as Globus resource Single file system and UID domain

GRAM & LSF Server (LSF front-end machine) Client Globus LSF Cluster

GRAM & PBS (by F. Giacomini-INFN Cnaf) Client Server (PBS) Globus PBS Linux Server (4 processors)

Results Some bugs found and fixed Standard output and error for vanilla Condor jobs globus-job-status … Some bugs seem solvable without major re-design and/or re- implementation: For LSF the RSL parameter (count=x) is translated into: bsub –n x … Just allocates x processors, and dispatches the job to the first one Used for parallel applications Should be: bsub … x times … Two major problems: Scalability Fault tolerance

Globus GRAM Architecture Client LSF/ Condor/ PBS/ … Globus front-end machine Jobmanager Job pc1% globusrun –b –r pc2.pd.infn.it/jobmanager-xyz \ –f file.rsl file.rsl: & (executable=/diskCms/startcmsim.sh) (stdin=/diskCms/PythiaOut/filename (stdout=/diskCms/Cmsim/filename) (count=1) pc1 pc2

Scalability One jobmanager for each globusrun If I want to submit 1000 jobs ??? 1000 globusrun  1000 jobmanagers running in the front-end machine !!! %globusrun –b –r pc2.infn.it/jobmanager-xyz –f file.rsl file.rsl: & (executable=/diskCms/startcmsim.sh) (stdin=/diskCms/PythiaOut/filename) (stdout=/diskCms/CmsimOut/filename) (count=1000) It is not possible to specify in the RSL file 1000 different input files and 1000 different output files … $(Process) in Condor Problems with job monitoring (globus-job-status)

Fault tolerance The jobmanager is not persistent If the jobmanager can’t be contacted, Globus assumes that the job(s) has been completed Example of problem Submission of n jobs on a cluster managed by a local resource management systems Reboot of the front end machine The jobmanager(s) doesn’t restart Orphan jobs  Globus assumes that the jobs have been successfully completed

GRAM & GIS How the local GRAMs provide the GIS with characteristics and status of local resources ? The Master will need this (and other) information Tests performed considering: Condor pool LSF cluster

GRAM & Condor & GIS

GRAM & LSF & GIS

Jobs & GIS Info on Globus jobs published in the GIS: User Subject of certificate Local user name RSL string Globus job id LSF/Condor/… job id Status: Run/Pending/…

GRAM & GIS The information on characteristics and status of local resources and on jobs is not enough As local resources we must consider Farms and not the single workstations Other information (i.e. total and available CPU power) needed Fortunately the default schema can be integrated with other info provided by specific agents WE need to identify which other info are necessary Much more clear during Master design

RSL We need a uniform language to specify resources, between different resource management systems The RSL syntax model seems suitable to define even complicated resource specification expressions The common set of RSL attributes is often not sufficient The attributes not belonging to the common set are ignored

RSL More flexibility is required Resource administrators should be allowed to define new attributes and users should be allowed to use them in resource specification expressions (Condor Class-Ads model) Same language to describe the offered resources and the requested resources (Condor Class-Ads model) seems a better approach

Second step: Condor-G GRAM CONDOR GRAM LSF GRAM PBS globusrun Site1 Site2Site3 Using condor_submit and Globus Universe Condor-G Submit jobs condor_q condor_rm …

Condor-G ? Condor Schedd + Grid manager Why Condor-G ? Usage of Condor architecture and mechanisms able to provide robustness and reliability The user can submit his 10,000 jobs and he will be sure that they will be completed (even if there are problems in the submitting machine, in the executing machines, in the network, …) without human intervention Usage of Condor interface and tools to “manage” the jobs “Robust” tools with all the required capabilities (monitor, logging, …)

Condor-G (Globus Universe) Condor-G tested considering as Globus resource: Workstation using the fork system call LSF Cluster Condor pool Submission (condor_submit), monitoring (condor_q), removing (condor_rm) seem working fine, but…

Condor-G The Globus Universe architecture is only a prototype Documentation not available Scalability in the submitting side One shadow per each globusrun Very difficult to understand about errors Some errors in the log files Some improvements foreseen in the next future Scalability, … The problems of scalability and fault tolerance in the Globus resources are not solved Fault tolerance only in the submitting side

Condor-G Architecture Condor G (Globus Client) LSF/ Condor/ PBS/ … Globus front-end machine Jobmanager Job condor_submit  globusrun polling (globus_job_status) Jobs pc1% condor_submit file.cnd File.cnd: Universe = globus Executable = /diskCms/startcmsim.sh GlobusRSL = (stdin=/diskCms/PythiaOut/filename) stdout=/diskCms/CmsimOut/filename) log = /diskCms/log.$(Process) GlobusScheduler = pc2.pd.infn.it/jobmanager-xyz queue 1 pc1 pc2

Condor GlideIn Submission of Condor jobs on Globus resources Condor daemons (master, startd) run on Globus resources These resources temporarily become part of the Condor pool Usage of Condor-G to run Condor daemons Local resource management systems (LSF, PBS, …) of Globus resources used only to run Condor daemons For a cluster it is necessary to install Globus only on one front-end machine, while the Condor daemons will run on each workstation of the cluster

GlideIn pc3 Cluster managed by LSF/Condor/… Globus Personal Condor Globus pc1 pc2 pc1% condor_glidein pc2.pd.infn.it … pc1% condor_glidein pc3.pd.infn.it …

Condor GlideIn Usage of all Condor mechanisms and capabilities Robustness and fault tolerance Only “ready-to-use” solution if we want to use Globus tools Also Master functionalities (Condor matchmaking system) Viable solution if the goal is just to find idle CPUs The architecture must be integrated/modified if we have to take into account other parameters (i.e. location of input files)

Condor GlideIn GlideIn tested (considering standard and vanilla jobs) with: Workstation using the fork system call as job manager Seems working Condor pool Seems working Condor flocking better solution if authentication is not required LSF cluster Problems with glidein of multiple nodes with a single condor_glidein command (because of the problem related with the LSF parameter (count=x))  Multiple condor_glidein commands  Problems of scalability (number of jobmanagers)  Modification of Globus scripts for LSF

Conclusions (problems) Major problems related with scalability and fault tolerance with Globus GRAM The Globus team is going to re-implement the Globus GRAM service When ??? How ??? The local GRAMs provide the GIS with not enough information  The default schema must be integrated WE must identify which information are necessary RSL not enough flexible Condor Class-Ads seems a much better solution to specify resources Condor-G can provide robustness only in the submitting side

Future activities Complete on-going GRAM evaluations (i.e. PBS) Bug fixes Modification of Globus LSF scripts … Tests with real applications Solve the scalability and robustness problems Not so simple and straightforward !!! Possible collaboration between WP1, Globus team and Condor team

Other info INFN-GRID INFN-GRID/Globus