Condor and Multi-core Scheduling

Slides:

Advertisements

Similar presentations

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.

Advertisements

Open Science Grid Discovering and understanding the site environment Or, yet another site test kit.

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.

DCC/FCUP Grid Computing 1 Resource Management Systems.

USING THE GLOBUS TOOLKIT This summary by: Asad Samar / CALTECH/CMS Ben Segal / CERN-IT FULL INFO AT:

6/2/20071 Grid Computing Sun Grid Engine (SGE) Manoj Katwal.

©Brooks/Cole, 2003 Chapter 7 Operating Systems Dr. Barnawi.

Workload Management Massimo Sgaravatto INFN Padova.

Jim Basney Computer Sciences Department University of Wisconsin-Madison Managing Network Resources in.

Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”

Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.

Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.

High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, Uwisc.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

©Brooks/Cole, 2003 Chapter 7 Operating Systems. ©Brooks/Cole, 2003 Define the purpose and functions of an operating system. Understand the components.

1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.

High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, UWisc Condor Week April 13, 2010.

PROOF work progress. Progress on PROOF The TCondor class was rewritten. Tested on a condor pool with 44 nodes. Monitoring with Ganglia page. The tests.

VO-Ganglia Grid Simulator Catalin Dumitrescu, Mike Wilde, Ian Foster Computer Science Department The University of Chicago.

Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams Condor Administrator’s How-to.

Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.

Trusted Virtual Machine Images a step towards Cloud Computing for HEP? Tony Cass on behalf of the HEPiX Virtualisation Working Group October 19 th 2010.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Workload management, virtualisation, clouds & multicore Andrew Lahiff.

INFSO-RI Enabling Grids for E-sciencE Policy management and fair share in gLite Andrea Guarise HPDC 2006 Paris June 19th, 2006.

Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.

What’s Coming? What are we Planning?. › Better docs › Goldilocks – This slot size is just right › Storage › New.

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor NT Condor ported.

Chapter 7 Operating Systems Foundations of Computer Science  Cengage Learning 1.

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI John Gordon EGI Virtualisation and Cloud Workshop Amsterdam 12 th May 2011.

Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.

Getting the Most out of Scientific Computing Resources

Workload Management Workpackage

Getting the Most out of Scientific Computing Resources

Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016

Operating Systems : Overview

Towards GLUE Schema 2.0 Sergio Andreozzi INFN-CNAF Bologna, Italy

Globus —— Toolkits for Grid Computing

DIRAC services.

I2G CrossBroker Enol Fernández UAB

CREAM-CE/HTCondor site

Practical aspects of multi-core job submission at CERN

Building Grids with Condor

Condor: Job Management

Virtual Memory Networks and Communication Department.

7 Operating system Foundations of Computer Science ã Cengage Learning.

US CMS Testbed.

Building and Testing using Condor

The Scheduling Strategy and Experience of IHEP HTCondor Cluster

Operating Systems.

Basic Grid Projects – Condor (Part I)

COT 4600 Operating Systems Spring 2011

Operating Systems : Overview

Operating Systems : Overview

The Condor JobRouter.

CGS 3763 Operating Systems Concepts Spring 2013

Chapter 2: Operating-System Structures

Wide Area Workload Management Work Package DATAGRID project

Resource and Service Management on the Grid

Introduction to OS (concept, evolution, some keywords)

Grid Laboratory Of Wisconsin (GLOW)

Introduction to OS (concept, evolution, some keywords)

Chapter 2: Operating-System Structures

GLOW A Campus Grid within OSG

CGS 3763 Operating Systems Concepts Spring 2013

Exploring Multi-Core on

PU. Setting up parallel universe in your pool and when (not

EGI High-Throughput Compute

Presentation transcript:

Condor and Multi-core Scheduling Dan Bradley dan@hep.wisc.edu University of Wisconsin with generous support from the National Science Foundation and productive collaboration with Red Hat

What Exists Today Parallel scheduling Custom Batch Slots Designed for MPI-type jobs No way to limit matchmaking to slots on same machine Custom Batch Slots One “slot” can be <> one core Slot policy can depend on type of job

Example: Custom Batch Slots only accept 1-core jobs Slot 4 when claimed by normal 1-core job, behaves normally when claimed by 4-core job stop accepting jobs for slots 1, 2, and 3 suspends 4-core job until slots 1, 2, and 3 drain Related Condor How-to: http://nmi.cs.wisc.edu/node/1482

shortcomings Awkward to extend to N-core jobs, but doable Very awkward to extend to N-core jobs and also support dynamic partitioning of memory, etc. Requires custom configuration by admins; no standard JDL for submitting grid jobs to sites Accounting does not charge multi-core job at higher rate than 1-core job

in development: dynamic slots Start with one “batch slot” representing whole machine Trim slot to what job requires (cores, memory, disk, network, etc.) Leftovers assigned to new slot

TODO improve accounting and any other monitoring to support multi-core slots support standard JDL for requesting multiple cores deal with preemption and/or mechanism for preventing starvation of multi-core jobs speed up dynamic slot creation

Questions Is it desirable to have the capability to lock a job to a CPU? Does Globus provide RSL attributes with well-defined semantics for multi-core jobs? example: (count=N)(jobtype=single) in my experience, this is not well-defined can mean N cores x 1 job or N x job