Improving ARC backends: Condor and SGE/GE LRMS interface

Slides:

Advertisements

Similar presentations

Enabling Cost-Effective Resource Leases with Virtual Machines Borja Sotomayor University of Chicago Ian Foster Argonne National Laboratory/

Advertisements

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.

Condor Project Computer Sciences Department University of Wisconsin-Madison Eager, Lazy, and Just-in-Time.

Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.

CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.

Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.

Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”

Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,

Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.

KARMA with ProActive Parallel Suite 12/01/2009 Air France, Sophia Antipolis Solutions and Services for Accelerating your Applications.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.

Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.

Condor Birdbath Web Service interface to Condor

Carrying Your Environment With You or Virtual Machine Migration Abstraction for Research Computing.

Using Virtual Servers for the CERN Windows infrastructure Emmanuel Ormancey, Alberto Pace CERN, Information Technology Department.

Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison

Remote Cluster Connect Factories David Lesny University of Illinois.

The eMinerals minigrid and the national grid service: A user’s perspective NGS169 (A. Marmier)

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Job-Property-Based Scheduling at Sites This is about the way sites present their local resources at the Grid Interface level – So far Sites maintained.

Operating Systems COMP 4850/CISG 5550 Basic Memory Management Swapping Dr. James Money.

Unified scripts ● Currently they are composed of a main shell script and a few auxiliary ones that handle mostly the local differences. ● Local scripts.

Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison

Nicholas Coleman Computer Sciences Department University of Wisconsin-Madison Distributed Policy Management.

The Hungarian ClusterGRID Project Péter Stefán research associate NIIF/HUNGARNET

STAR Scheduling status Gabriele Carcassi 9 September 2002.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

2004 Queue Scheduling and Advance Reservations with COSY Junwei Cao Falk Zimmermann C&C Research Laboratories NEC Europe Ltd.

First evaluation of the Globus GRAM service Massimo Sgaravatto INFN Padova.

CE design report Luigi Zangrando

A GOS Interoperate Interface's Design & Implementation GOS Adapter For JSAGA Meng You BUAA.

CREAM Status and plans Massimo Sgaravatto – INFN Padova

HTCondor Accounting Update

HPC In The Cloud Case Study: Proteomics Workflow

Grid Computing: Running your Jobs around the World

The advances in IHEP Cloud facility

Current Generation Hypervisor Type 1 Type 2.

By Chris immanuel, Heym Kumar, Sai janani, Susmitha

Dag Toppe Larsen UiB/CERN CERN,

U.S. ATLAS Grid Production Experience

Dag Toppe Larsen UiB/CERN CERN,

Workload Management System

Management of Virtual Machines in Grids Infrastructures

IW2D migration to HTCondor

Harnessing the Power of Condor for Human Genetics

Grid Compute Resources and Job Management

Accounting Information: MPI

CREAM-CE/HTCondor site

Building Grids with Condor

A Marriage of Open Source and Custom William Deck

A Distributed Policy Scenario

Management of Virtual Machines in Grids Infrastructures

Building and Testing using Condor

The Scheduling Strategy and Experience of IHEP HTCondor Cluster

Introduction to Makeflow and Work Queue

Haiyan Meng and Douglas Thain

Bridging Unicore and Condor

Privilege Separation in Condor

IMPROVING THE RELIABILITY OF COMMODITY OPERATING SYSTEMS

Condor Glidein: Condor Daemons On-The-Fly

Handles disk file 0000: array of file-offsets 0001: 0002: 0003: 0: …

Wide Area Workload Management Work Package DATAGRID project

LO2 – Understand Computer Software

GRID Workload Management System for CMS fall production

Backfilling the Grid with Containerized BOINC in the ATLAS computing

Condor-G Making Condor Grid Enabled

Presentation transcript:

Improving ARC backends: Condor and SGE/GE LRMS interface Adrian Taga University of Oslo

LRMS bakends in ARC supported: PBS&variants, LSF, Condor, SGE, LL perl/sh wrappers around the command line interface of LRMS advantage: easy to modify No clear structure Scripts for: Populating InfoSystem Job control (submit, get status, kill) Automated system like atlas production needs to clearly distinguish failures of the job from failures related to LRMS. In special, failures due to exceeded resource limits.

Queues in Condor Why have queues? Inhomogeneous clusters Problem: Condor has no notion of queues Solution: use ClassAd mechanism to partition the cluster. Example: Queues defined based on memory size So far, Condor backend did not support queues. Why d we need queues? - Condor pools tend to be quite heterogeneous. The way heterogenity is handled innordugrid is by dividing the cluster into separate queues of more or less identical machines. However, condor does not have the notion of queues like other batch systems have. Still, it's possible to divide up the condor pool in virtual 'queues' by expploiting the flexible classAd mechanism. [queue/large] requirements="(Opsys == "linux" && Arch == "intel"" requirements=" && (Disk > 30000000 && Memory > 2000)" [queue/small] requirements="(Opsys == "linux" && Arch == "intel”" requirements=" && (Disk > 30000000 && Memory <= 2000 && Memory > 1000)"

Error reporting Many LRMS backends Inconsistent reporting of LRMS errors Clear diagnosis is needed for ATLAS prodsys Scenario: job exceeded resource limits PBS & Co – natively reports WallTime, CpuTime, Memory limit exceeded Exitcode < 256 from job, Exitcode ≥ 256 resource limit hit Condor, SGE - provide no clear diagnosis, workarounds needed Automated system like atlas production needs to clearly distinguish failures of the job from failures related to LRMS. In special, failures due to exceeded resource limits.

Thank you