Improving ARC backends: Condor and SGE/GE LRMS interface

Slides:



Advertisements
Similar presentations
Enabling Cost-Effective Resource Leases with Virtual Machines Borja Sotomayor University of Chicago Ian Foster Argonne National Laboratory/
Advertisements

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Condor Project Computer Sciences Department University of Wisconsin-Madison Eager, Lazy, and Just-in-Time.
Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”
Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
KARMA with ProActive Parallel Suite 12/01/2009 Air France, Sophia Antipolis Solutions and Services for Accelerating your Applications.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Condor Birdbath Web Service interface to Condor
Carrying Your Environment With You or Virtual Machine Migration Abstraction for Research Computing.
Using Virtual Servers for the CERN Windows infrastructure Emmanuel Ormancey, Alberto Pace CERN, Information Technology Department.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Remote Cluster Connect Factories David Lesny University of Illinois.
The eMinerals minigrid and the national grid service: A user’s perspective NGS169 (A. Marmier)
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Job-Property-Based Scheduling at Sites This is about the way sites present their local resources at the Grid Interface level – So far Sites maintained.
Operating Systems COMP 4850/CISG 5550 Basic Memory Management Swapping Dr. James Money.
Unified scripts ● Currently they are composed of a main shell script and a few auxiliary ones that handle mostly the local differences. ● Local scripts.
Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison
Nicholas Coleman Computer Sciences Department University of Wisconsin-Madison Distributed Policy Management.
The Hungarian ClusterGRID Project Péter Stefán research associate NIIF/HUNGARNET
STAR Scheduling status Gabriele Carcassi 9 September 2002.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
2004 Queue Scheduling and Advance Reservations with COSY Junwei Cao Falk Zimmermann C&C Research Laboratories NEC Europe Ltd.
First evaluation of the Globus GRAM service Massimo Sgaravatto INFN Padova.
CE design report Luigi Zangrando
A GOS Interoperate Interface's Design & Implementation GOS Adapter For JSAGA Meng You BUAA.
CREAM Status and plans Massimo Sgaravatto – INFN Padova
HTCondor Accounting Update
HPC In The Cloud Case Study: Proteomics Workflow
Grid Computing: Running your Jobs around the World
The advances in IHEP Cloud facility
Current Generation Hypervisor Type 1 Type 2.
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
Dag Toppe Larsen UiB/CERN CERN,
U.S. ATLAS Grid Production Experience
Dag Toppe Larsen UiB/CERN CERN,
Workload Management System
Management of Virtual Machines in Grids Infrastructures
IW2D migration to HTCondor
Harnessing the Power of Condor for Human Genetics
Grid Compute Resources and Job Management
Accounting Information: MPI
CREAM-CE/HTCondor site
Building Grids with Condor
A Marriage of Open Source and Custom William Deck
A Distributed Policy Scenario
Management of Virtual Machines in Grids Infrastructures
Building and Testing using Condor
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Introduction to Makeflow and Work Queue
Haiyan Meng and Douglas Thain
Bridging Unicore and Condor
Privilege Separation in Condor
IMPROVING THE RELIABILITY OF COMMODITY OPERATING SYSTEMS
Condor Glidein: Condor Daemons On-The-Fly
Handles disk file 0000: array of file-offsets 0001: 0002: 0003: 0: …
Wide Area Workload Management Work Package DATAGRID project
LO2 – Understand Computer Software
GRID Workload Management System for CMS fall production
Backfilling the Grid with Containerized BOINC in the ATLAS computing
Condor-G Making Condor Grid Enabled
Presentation transcript:

Improving ARC backends: Condor and SGE/GE LRMS interface Adrian Taga University of Oslo

LRMS bakends in ARC supported: PBS&variants, LSF, Condor, SGE, LL perl/sh wrappers around the command line interface of LRMS advantage: easy to modify No clear structure Scripts for: Populating InfoSystem Job control (submit, get status, kill) Automated system like atlas production needs to clearly distinguish failures of the job from failures related to LRMS. In special, failures due to exceeded resource limits.

Queues in Condor Why have queues? Inhomogeneous clusters Problem: Condor has no notion of queues Solution: use ClassAd mechanism to partition the cluster. Example: Queues defined based on memory size So far, Condor backend did not support queues. Why d we need queues? - Condor pools tend to be quite heterogeneous. The way heterogenity is handled innordugrid is by dividing the cluster into separate queues of more or less identical machines. However, condor does not have the notion of queues like other batch systems have. Still, it's possible to divide up the condor pool in virtual 'queues' by expploiting the flexible classAd mechanism. [queue/large] requirements="(Opsys == "linux" && Arch == "intel"" requirements=" && (Disk > 30000000 && Memory > 2000)" [queue/small] requirements="(Opsys == "linux" && Arch == "intel”" requirements=" && (Disk > 30000000 && Memory <= 2000 && Memory > 1000)"

Error reporting Many LRMS backends Inconsistent reporting of LRMS errors Clear diagnosis is needed for ATLAS prodsys Scenario: job exceeded resource limits PBS & Co – natively reports WallTime, CpuTime, Memory limit exceeded Exitcode < 256 from job, Exitcode ≥ 256 resource limit hit Condor, SGE - provide no clear diagnosis, workarounds needed Automated system like atlas production needs to clearly distinguish failures of the job from failures related to LRMS. In special, failures due to exceeded resource limits.

Thank you