No Idle Cores DYNAMIC LOAD BALANCING IN ATLAS & OTHER NEWS WILLIAM STRECKER-KELLOGG.

Slides:

Advertisements

Similar presentations

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.

Advertisements

SLA-Oriented Resource Provisioning for Cloud Computing

Abdulrahman Idlbi COE, KFUPM Jan. 17, Past Schedulers: 1.2 & : circular queue with round-robin policy. Simple and minimal. Not focused on.

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.

HTCondor scheduling policy

Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)

UC Berkeley Job Scheduling with the Fair and Capacity Schedulers Matei Zaharia Wednesday, June 10, 2009 Santa Clara Marriott.

Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Priority and Provisioning Greg Thain HTCondorWeek 2015.

Dr Mohamed Menacer College of Computer Science and Engineering Taibah University CS-334: Computer.

CSE 326: Data Structures B-Trees Ben Lerner Summer 2007.

Computer Organization and Architecture

Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.

Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.

HTCondor at the RACF SEAMLESSLY INTEGRATING MULTICORE JOBS AND OTHER DEVELOPMENTS William Strecker-Kellogg RHIC/ATLAS Computing Facility Brookhaven National.

U.S. ATLAS Tier-1 Site Report Michael Ernst U.S. ATLAS Facilities Workshop – March 23,

Authors: Weiwei Chen, Ewa Deelman 9th International Conference on Parallel Processing and Applied Mathmatics 1.

Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

OPERATING SYSTEMS CPU SCHEDULING.  Introduction to CPU scheduling Introduction to CPU scheduling  Dispatcher Dispatcher  Terms used in CPU scheduling.

Large, Fast, and Out of Control: Tuning Condor for Film Production Jason A. Stowe Software Engineer Lead - Condor CORE Feature Animation.

The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.

Network Aware Resource Allocation in Distributed Clouds.

Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis [1] 4/24/2014 Presented by: Rakesh Kumar [1 ]

Chapter 5 Operating System Support. Outline Operating system - Objective and function - types of OS Scheduling - Long term scheduling - Medium term scheduling.

Ian Alderman A Little History…

1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.

Data Structures Week 8 Further Data Structures The story so far  Saw some fundamental operations as well as advanced operations on arrays, stacks, and.

Condor Week 2005Optimizing Workflows on the Grid1 Optimizing workflow execution on the Grid Gaurang Mehta - Based on “Optimizing.

CPU Scheduling CSCI 444/544 Operating Systems Fall 2008.

Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.

Chapter 7 – Deadlock (Pgs 283 – 306). Overview  When a set of processes is prevented from completing because each is preventing the other from accessing.

Multi-core jobs at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly February 25 th 2014.

Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams Condor Administrator’s How-to.

Using Map-reduce to Support MPMD Peng

Migration to 7.4, Group Quotas, and More William Strecker-Kellogg Brookhaven National Lab.

1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.

Operational Research & ManagementOperations Scheduling Economic Lot Scheduling 1.Summary Machine Scheduling 2.ELSP (one item, multiple items) 3.Arbitrary.

Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Workload management, virtualisation, clouds & multicore Andrew Lahiff.

Landing in the Right Nest: New Negotiation Features for Enterprise Environments Jason Stowe.

Using Map-reduce to Support MPMD Peng

Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

HTCondor-CE for USATLAS Bob Ball AGLT2/University of Michigan OSG AHM March, 2015 Bob Ball AGLT2/University of Michigan OSG AHM March, 2015.

Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.

GangLL Gang Scheduling on the IBM SP Andy B. Yoo and Morris A. Jette Lawrence Livermore National Laboratory.

Matchmaker Policies: Users and Groups HTCondor Week, Madison 2016 Zach Miller Jaime Frey Center for High Throughput.

Virtual Cluster Computing in IHEPCloud Haibo Li, Yaodong Cheng, Jingyan Shi, Tao Cui Computer Center, IHEP HEPIX Spring 2016.

Lecture 4 Page 1 CS 111 Summer 2013 Scheduling CS 111 Operating Systems Peter Reiher.

Lecture 4 CPU scheduling. Basic Concepts Single Process  one process at a time Maximum CPU utilization obtained with multiprogramming CPU idle :waiting.

CPU Scheduling CS Introduction to Operating Systems.

CPU scheduling.  Single Process  one process at a time  Maximum CPU utilization obtained with multiprogramming  CPU idle :waiting time is wasted 2.

Five todos when moving an application to distributed HTC.

CHTC Policy and Configuration

Examples Example: UW-Madison CHTC Example: Global CMS Pool

Matchmaker Policies: Users and Groups HTCondor Week, Madison 2017

CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016

High Availability in HTCondor

William Stallings Computer Organization and Architecture

CREAM-CE/HTCondor site

CyberShake Study 16.9 Discussion

CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017

The Scheduling Strategy and Experience of IHEP HTCondor Cluster

Negotiator Policy and Configuration

Basic Grid Projects – Condor (Part I)

CPU SCHEDULING.

PU. Setting up parallel universe in your pool and when (not

Presentation transcript:

No Idle Cores DYNAMIC LOAD BALANCING IN ATLAS & OTHER NEWS WILLIAM STRECKER-KELLOGG

RHIC/ATLAS Computing Facility Overview  Main HTCondor Pools  STAR, PHENIX, ATLAS  Each just over 15kCPU  Running stable  RHIC Pools  Individual+Special Users  Workload management done by experiments  ATLAS: Focus of this talk  Local batch systems driven by external workload manager (PANDA)  Jobs are pilots  Scheduling  provisioning BCF CDCE Sigma-7 2 HTCondor Week 2015

ATLAS Configuration  Use Hierarchical Group Quotas + Partitionable Slots  My HTCondor Week talk last year was all about thisHTCondor Week talk  A short recap:  PANDA Queues map to groups in a hierarchical tree  Leaf-nodes have jobs  Surplus-sharing is selectively allowed  Group allocation controlled via web-interface to DB  Config file written when DB changes  All farm has one STARTD config SLOT_TYPE_1=100% NUM_SLOTS=1 NUM_SLOTS_TYPE_1=1 SLOT_TYPE_1_PARTITIONABLE=True SlotWeight=Cpus 3 HTCondor Week 2015

Partitionable Slots  Each batch node is configured to be partitioned into arbitrary slices of CPUs  Condor terminology:  Partitionable slots are automatically sliced into dynamic slots  Multicore jobs are thus accommodated with no administrative effort  Only minimal (~1-2%) defragmentation necessary  Empirically based on our farm—factors include cores/node, job sizes & proportions, and runtimes. Something like draining=(job-length*job-size^2)/(machine-size*%mcore*occupancy) HTCondor Week

Defragmentation Policy Defragmentation Daemon  Start Defragmentation  (PartitionableSlot && !Offline && TotalCpus > 12 )  End Defragmentation  (Cpus >= 10)  Rate: max 4/hr Key change: Negotiator Policy  Setting NEGOTIATOR_POST_JOB_RANK  Default policy is breadth-first filling of equivalent machines  (Kflops – SlotId)  Depth-first filling preserves continuous blocks longer  (-Cpus) 5 HTCondor Week 2015

PANDA Queues  PANDA Queues  One species of job per-queue  Map to groups in our tree  Currently two non-single-core queues  8-core ATHENA-MP  2-core (Actually high-memory)  No support yet for SlotWeight!=cpus  Have 2Gb/core, so 4Gb jobs get 2 cores 6 HTCondor Week 2015

ATLAS Tree Structure atlas analysis prod himem single mcore shortlong grid sw 7 HTCondor Week 2015

Surplus Sharing  Surplus sharing is controlled by boolean accept_surplus flag on each queue  Quotas are normalized in units of SlotWeight (CPUs)  Groups with flag set to True can take unused slots from their siblings  Parent groups with flag allow surplus to “flow down” the tree from their siblings to their children  Parent groups without accept_surplus flag constrain surplus-sharing to among their children 8 HTCondor Week 2015

Surplus Sharing  Scenario: analysis has quota of 2000 and no accept_surplus; short and long have a quota of 1000 each and accept_surplus on  short=1600, long=400…possible  short=1500, long=700…impossible (violates analysis quota) 9 HTCondor Week 2015

Where’s the problem? (it’s starvation)  Everything works perfectly with all single-core, just set accpet_surplus everywhere!  However… Multicore jobs will not be able to compete for surplus resources fairly  Negotiation is greedy, if 7 slots are free, they won’t match an 8-core job but will match 7 single-core jobs in the same cycle  If any multicore queues compete for surplus with single core queues, the multicore will always lose  A solution outside Condor is needed  Ultimate goal is to maximize farm utilization—No Idle Cores! 10 HTCondor Week 2015

Dynamic Allocation  A program to look at the current state of the demand in various queues and set the surplus-flags appropriately  Based on comparing “weight” of queues  Weight defined as size of jobs in queue (# cores)  Able to cope with any combination of demands  Prevents starvation by allowing surplus into “heaviest” queues first  Avoids both single-core and multicore queues competing for the same resources  Same algorithm is extensible up the tree to allow sharing between entire subtrees  Much credit to Mark Jensen (summer student in ‘14) 11 HTCondor Week 2015

Balancing Algorithm  Groups have the following properties pertinent to the algorithm  Surplus flag  Weight  Threshold  Demand  If Demand > Threshold a queue is considered for sharing 12 HTCondor Week 2015

Balancing: Demand  PANDA Queues are monitored for “activated” jobs  Polling every 2 minutes  Last hour is analyzed  Midpoint sampling  Moving average  Spikes smoothed out  Queue considered “loaded” if calculated demand > threshold 13 HTCondor Week 2015

Extending Weight & Demand  Leaf groups’ weights are the cores they need (8, 2, 1)  How to extend beyond leaf-groups? 1. Define in custom priority order 2. Define as sum() or avg() of child groups’s weights  Problem with 1. is you can’t guarantee starvation-free  For now, manually set weights to match what would be the case for 2.  For demand and threshold: easy—sum of child-groups values 14 HTCondor Week 2015

Balancing: Weight atlas analysis prod himem single mcore shortlong grid sw 8/50 1/5 2/100 1/600 1/200 1/150 1/1 2/800 3/300 2/1105 / 15 HTCondor Week 2015

Balancing: Weight atlas analysis prod himem single mcore shortlong grid sw 8/50 1/5 2/100 1/600 1/200 1/150 1/1 2/800 3/300 2/1105 Priority order == desired order / 16 HTCondor Week 2015

Algorithm  The following algorithm is implemented 1. For each sibling-group in DFS order: 1. For each member in descending weight order 1. Set to TRUE unless it does not have demand and lower-weight groups do 2. Break if set to TRUE  In other words… In each group of siblings, set accept_surplus to TRUE for all the highest-weighted groups that have demand 17 HTCondor Week 2015

Balancing: All Full atlas analysis prod himem single mcore shortlong grid sw HTCondor Week 2015 Surplus ON This is the same as having no surplus anywhere!

Balancing: No mc/hi atlas analysis prod himem single mcore shortlong grid sw HTCondor Week 2015 Surplus ON

Balancing: No prod atlas analysis prod himem single mcore shortlong grid sw HTCondor Week 2015 Surplus ON

Results 21 Wasted Slots HTCondor Week 2015

Results & Context  Multicore is ready to take slack from other production queues  Spotty analysis-demand the past few months has allowed many millions of CPU-hours to go unwasted  If all ATLAS has a lull in demand, OSG jobs can fill the farm  Caveat: Preemption!  Fast Negotiation  Averages for last 3 days:  Who is this useful for?  Algorithm works for any tree  Extensible beyond ATLAS where work is structured outside of batch system  A multi-tenant service provide with a hierarchy of priorities  Really a problem of efficient provisioning, not scheduling  Constraints  Workflow defined outside of HTCondor  Segregation of multicore in separate queues for scheduling HTCondor Week Matches14.99 Duration7.05s

Desired Features & Future Work Preemption  Wish to maintain reasonably minimum-runtime to grid jobs  When ATLAS demand comes back, need OSG jobs to be evicted  Require preempting the dynamic slots that are created under the partitionable one  Work is progressing along these lines, although final state is not clear SlotWeight != CPUs  Would like to “value” RAM less than CPUs for jobs  High-memory kludge is inelegant  Not extensible to different shaped jobs (high-RAM/low-CPU, vice versa)  Tricky because total slot-weight of the farm needs to be constant to give meaning to quota allocation HTCondor Week

The End QUESTIONS? COMMENTS? THANKS TO MARK JENSEN, AND THE HTCONDOR TEAM! HTCondor Week