Condor and Multi-core Scheduling

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Open Science Grid Discovering and understanding the site environment Or, yet another site test kit.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
DCC/FCUP Grid Computing 1 Resource Management Systems.
USING THE GLOBUS TOOLKIT This summary by: Asad Samar / CALTECH/CMS Ben Segal / CERN-IT FULL INFO AT:
6/2/20071 Grid Computing Sun Grid Engine (SGE) Manoj Katwal.
©Brooks/Cole, 2003 Chapter 7 Operating Systems Dr. Barnawi.
Workload Management Massimo Sgaravatto INFN Padova.
Jim Basney Computer Sciences Department University of Wisconsin-Madison Managing Network Resources in.
Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”
Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, Uwisc.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
©Brooks/Cole, 2003 Chapter 7 Operating Systems. ©Brooks/Cole, 2003 Define the purpose and functions of an operating system. Understand the components.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, UWisc Condor Week April 13, 2010.
PROOF work progress. Progress on PROOF The TCondor class was rewritten. Tested on a condor pool with 44 nodes. Monitoring with Ganglia page. The tests.
VO-Ganglia Grid Simulator Catalin Dumitrescu, Mike Wilde, Ian Foster Computer Science Department The University of Chicago.
Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams Condor Administrator’s How-to.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Trusted Virtual Machine Images a step towards Cloud Computing for HEP? Tony Cass on behalf of the HEPiX Virtualisation Working Group October 19 th 2010.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Workload management, virtualisation, clouds & multicore Andrew Lahiff.
INFSO-RI Enabling Grids for E-sciencE Policy management and fair share in gLite Andrea Guarise HPDC 2006 Paris June 19th, 2006.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
What’s Coming? What are we Planning?. › Better docs › Goldilocks – This slot size is just right › Storage › New.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor NT Condor ported.
Chapter 7 Operating Systems Foundations of Computer Science  Cengage Learning 1.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI John Gordon EGI Virtualisation and Cloud Workshop Amsterdam 12 th May 2011.
Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.
Getting the Most out of Scientific Computing Resources
Workload Management Workpackage
Getting the Most out of Scientific Computing Resources
Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016
Operating Systems : Overview
Towards GLUE Schema 2.0 Sergio Andreozzi INFN-CNAF Bologna, Italy
Globus —— Toolkits for Grid Computing
DIRAC services.
I2G CrossBroker Enol Fernández UAB
CREAM-CE/HTCondor site
Practical aspects of multi-core job submission at CERN
Building Grids with Condor
Condor: Job Management
Virtual Memory Networks and Communication Department.
7 Operating system Foundations of Computer Science ã Cengage Learning.
US CMS Testbed.
Building and Testing using Condor
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Operating Systems.
Basic Grid Projects – Condor (Part I)
COT 4600 Operating Systems Spring 2011
Operating Systems : Overview
Operating Systems : Overview
The Condor JobRouter.
CGS 3763 Operating Systems Concepts Spring 2013
Chapter 2: Operating-System Structures
Wide Area Workload Management Work Package DATAGRID project
Resource and Service Management on the Grid
Introduction to OS (concept, evolution, some keywords)
Grid Laboratory Of Wisconsin (GLOW)
Introduction to OS (concept, evolution, some keywords)
Chapter 2: Operating-System Structures
GLOW A Campus Grid within OSG
CGS 3763 Operating Systems Concepts Spring 2013
Exploring Multi-Core on
PU. Setting up parallel universe in your pool and when (not
EGI High-Throughput Compute
Presentation transcript:

Condor and Multi-core Scheduling Dan Bradley dan@hep.wisc.edu University of Wisconsin with generous support from the National Science Foundation and productive collaboration with Red Hat

What Exists Today Parallel scheduling Custom Batch Slots Designed for MPI-type jobs No way to limit matchmaking to slots on same machine Custom Batch Slots One “slot” can be <> one core Slot policy can depend on type of job

Example: Custom Batch Slots only accept 1-core jobs Slot 4 when claimed by normal 1-core job, behaves normally when claimed by 4-core job stop accepting jobs for slots 1, 2, and 3 suspends 4-core job until slots 1, 2, and 3 drain Related Condor How-to: http://nmi.cs.wisc.edu/node/1482

shortcomings Awkward to extend to N-core jobs, but doable Very awkward to extend to N-core jobs and also support dynamic partitioning of memory, etc. Requires custom configuration by admins; no standard JDL for submitting grid jobs to sites Accounting does not charge multi-core job at higher rate than 1-core job

in development: dynamic slots Start with one “batch slot” representing whole machine Trim slot to what job requires (cores, memory, disk, network, etc.) Leftovers assigned to new slot

TODO improve accounting and any other monitoring to support multi-core slots support standard JDL for requesting multiple cores deal with preemption and/or mechanism for preventing starvation of multi-core jobs speed up dynamic slot creation

Questions Is it desirable to have the capability to lock a job to a CPU? Does Globus provide RSL attributes with well-defined semantics for multi-core jobs? example: (count=N)(jobtype=single) in my experience, this is not well-defined can mean N cores x 1 job or N x job