Late Materialization has (lately) materialized

Slides:



Advertisements
Similar presentations
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Advertisements

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Operating Systems (CSCI2413) Lecture 3 Processes phones off (please)
New condor_submit features in HTCondor 8.3/8.4 John (TJ) Knoeller Condor Week 2015.
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
CN1260 Client Operating System Kemtis Kunanuraksapong MSIS with Distinction MCT, MCITP, MCTS, MCDST, MCP, A+
Condor Project Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Reading Data in Web Pages tMyn1 Reading Data in Web Pages A very common application of PHP is to have an HTML form gather information from a website's.
LiveCycle Data Services Introduction Part 2. Part 2? This is the second in our series on LiveCycle Data Services. If you missed our first presentation,
Peter Keller Computer Sciences Department University of Wisconsin-Madison Quill Tutorial Condor Week.
Compiled Matlab on Condor: a recipe 30 th October 2007 Clare Giacomantonio.
Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Quill / Quill++ Tutorial.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Guide to Linux Installation and Administration, 2e1 Chapter 11 Using Advanced Administration Techniques.
5 1 Data Files CGI/Perl Programming By Diane Zak.
HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.
1 Class 1 Lecture Topic Concepts, Definitions and Examples.
Hands-On Microsoft Windows Server 2008 Chapter 5 Configuring Windows Server 2008 Printing.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor NT Condor ported.
CSx 4091 – Python Programming Spring 2013 Lecture L2 – Introduction to Python Page 1 Help: To get help, type in the following in the interpreter: Welcome.
HTCondor Advanced Job Submission John (TJ) Knoeller Center for High Throughput Computing.
Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi1 Condor Week 2008 Pseudo-interactive monitoring in Condor by Igor Sfiligoi.
Profiling Applications to Choose the Right Computing Infrastructure plus Batch Management with HTCondor Kyle Gross – Operations Support.
Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.
HTCondor stats John (TJ) Knoeller Condor Week 2016.
SQL Database Management
SMOKE Process Make time for the work that matters Simplify Move
Improvements to Configuration
Intermediate HTCondor: More Workflows Monday pm
Development Environment
Printer Management Chapter 5 - Objectives.
Scheduling Policy John (TJ) Knoeller Condor Week 2017.
Chapter 11 Command-Line Master Class
Quick Architecture Overview INFN HTCondor Workshop Oct 2016
Scheduling Policy John (TJ) Knoeller Condor Week 2017.
Topics Introduction Hardware and Software How Computers Store Data
Things you may not know about HTCondor
An Introduction to Workflows with DAGMan
IW2D migration to HTCondor
Moving CHTC from RHEL 6 to RHEL 7
High Availability in HTCondor
Things you may not know about HTCondor
HTCondor Command Line Monitoring Tool
Troubleshooting Your Jobs
Chapter 4 LOOPS © Bobby Hoggard, Department of Computer Science, East Carolina University / These slides may not be used or duplicated without permission.
Accounting, Group Quotas, and User Priorities
Submitting Many Jobs at Once
File Handling Programming Guides.
Teaching London Computing
Topics Introduction Hardware and Software How Computers Store Data
Upgrading Condor Best Practices
Debugging Taken from notes by Dr. Neil Moore
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
The Condor JobRouter.
Debugging Taken from notes by Dr. Neil Moore
Using SQL*Plus.
Condor: Firewall Mirroring
Script server.
Condor-G Making Condor Grid Enabled
Arrays.
Troubleshooting Your Jobs
PU. Setting up parallel universe in your pool and when (not
Presentation transcript:

Late Materialization has (lately) materialized John (TJ) Knoeller Condor Week 2018

How long can this go on? How long would this take to submit? executable = /bin/echo args = Hello World queue 10*1000*1000 ...........................................................................................................................................................................................................................................................................................................

We want this to work Our solution is "Late Materialization" just-in-time creation of job ClassAds in the Schedd

First shown in 8.5 (2017) Lots of limitations It worked! Worked only with Queue <N> No real error checking Not actually included in a release It worked! As jobs finished, new jobs materialized Showed where we were going...

Is useful in 8.7 (2018) Works with all submit Queue options Survives restart of the Schedd Respects Schedd limits Max jobs per owner Can replace dagman submit throttling Keep a fixed number of jobs materialized Keep a fixed number of idle jobs actually non-running jobs (like Dagman)

Why just-in-time? Number of jobs in the queue impacts Building the "priority list" for negotiation Recalculation of autoclusters for negotiation condor_q/hold/qedit/etc Usually scan all materialized job ads (number of running jobs matters more, but...)

You can throttle with Dagman Comparatively expensive way to do it Hides job pressure from the Schedd (And from Glide-in factories and Annexes)

Enough about why, lets talk how But first, I have to explain some things...

What the job "queue" looks like Cluster Not a queue, order is random Schedd operates on Job ads Cluster ad has common attrs (Introduced to save memory) Job ad is overlay of Cluster ad All changes go into job ad Cluster ad is invisible to clients Cluster Cluster Jobs

What submit actually does (send mostly identical jobs) Make job <Cluster>.0 send 80ish attributes as <Cluster>.-1 send 2 attributes as <Cluster>.0 for proc = 1 to <N> ask permission to add a proc to the cluster send 2 attributes as <Cluster>.<proc> plus any attributes that differ from <Cluster>.-1 print a dot

What the job queue will look like Submit Digest Cluster holds Submit Digest used to materialize jobs Jobs created as needed Changes might go into cluster ad condor_q/hold/etc may operate on the cluster ad Cluster Submit Digest Cluster Submit Digest Cluster Jobs

What late materialization does (Send recipe for making jobs) Make cluster ad from job <Cluster>.0 send 80ish attributes as <Cluster>.-1 Teach Schedd to make the job ads Capture and send submit itemdata "Digest" and send submit file Schedd saves these to the $(SPOOL) directory

Submit itemdata If your submit file uses Queue in (a, b, c) Queue from <file> Queue from <script> Queue matching *.dat Items are sent to the Schedd as lines Written to a file in $(SPOOL) Filename is returned

Submit Digest Submit file simplified and frozen $ENV()expanded if and include are processed last keyword wins QUEUE items are loaded and counted QUEUE statement simplified to one of Queue <N> Queue <N> from <items-file> even more "digesting" in the future

How do I enable it? Configure And submit with SCHEDD_ALLOW_LATE_MATERIALIZE = true And submit with max_materialize = <n> or materialize_max_Idle = <n> -factory (name subject to change)

Does it work from python? Coming in 8.7.9 sub = htcondor.Submit(""" executable = bin/echo materialize_max_idle = 1 """ sayings = [ {'Args':"Welcome to Wisconsin"}, {'Args':"Come and freeze in the land of cheese"} ] with schedd.transaction() as txn : sub.queue_from_iter(txn, 1, iter(sayings))

What about tools? condor_q -factory [-wide] ID OWNER SUBMITTED LIMIT PRESNT RUN IDLE HOLD NEXTID MODE DIGEST 107. johnkn 5/12 14:40 100 10 4 6 0 75 Norm /var/li (Otherwise, clusters without materialized jobs are invisible) condor_hold <clusterid> Holds the jobs and pauses materialization condor_qedit <clusterid> Edits the job ads and the cluster ad

More work is needed Suggestions and feedback are welcome! What we are thinking about What should normal condor_q output be? Should you be able to qedit the ClusterAd? What about editing the submit digest? Append items to the itemdata file? Future work? Apply job transforms to the ClusterAd? Materialize on match?

Any Questions?