IW2D migration to HTCondor D.Amorim Thanks to N.Biancacci, A.Mereghetti 2017-06-02 IW2D migration to HTCondor
IW2D migration to HTCondor Outline Motivation In practice What is different for the user Monitoring the jobs Managing the jobs How to get the latest version Issues with HTCondor Conclusion 2017-06-02 IW2D migration to HTCondor
IW2D migration to HTCondor Outline Motivation In practice What is different for the user Monitoring the jobs Managing the jobs How to get the latest version Issues with HTCondor Conclusion 2017-06-02 IW2D migration to HTCondor
IW2D migration to HTCondor Motivation ImpedanceWake2D jobs can be run on the batch system from an lxplus machine allowing to have multiple computations running in parallel. Extensively used for LHC, HL-LHC and FCC impedance scenarios (~40 jobs for the collimators and ~10 for the different beam screens) The batch service has been migrated from LSF (IBM, proprietary) to HTCondor (U. of Wisconsin-Madison, open-source) Only 10% of the computers will remain on LSF until the end of 2017 LSF will be shutdown in 2018 and the remaining computers will be transferred to HTCondor 2017-06-02 IW2D migration to HTCondor
IW2D migration to HTCondor Outline Motivation In practice What is different for the user Monitoring the jobs Managing the jobs How to get the latest version Issues with HTCondor Conclusion 2017-06-02 IW2D migration to HTCondor
What is different for the user Changes are mostly transparent for the users workflow: Python functions keep the same arguments Results files are written in the same folders The queue argument used for LSF (1nh, 8nh, 1nd…) is not used by HTCondor lxplusbatch = None : run on local computer lxplusbatch = ‘launch’ : submit the jobs to HTCondor lxplusbatch = ‘retrieve’ : retrieve the results 2017-06-02 IW2D migration to HTCondor
What is different for the user A job is submitted to a cluster identified by a unique number There are different ways to monitor the jobs From the command line: condor_q –nobatch will show all the jobs currently running From the website https://batch-carbon.cern.ch/grafana/ 2017-06-02 IW2D migration to HTCondor
IW2D migration to HTCondor Monitoring the jobs From the command line: condor_q –nobatch Job cluster Run time Executable launched Job state R: Run I: Idle H: Held watch condor_q –nobatch to get a live view of the jobs (watch launch the command every two seconds) 2017-06-02 IW2D migration to HTCondor
IW2D migration to HTCondor Monitoring the jobs From the website https://batch-carbon.cern.ch/grafana Data is refreshed every 5 minutes 2017-06-02 IW2D migration to HTCondor
IW2D migration to HTCondor Managing the jobs condor_rm is used to delete jobs condor_rm <cluster> to delete a specific job condor_rm –all to delete all the user jobs HTCondor generates for each job (cluster) a log file, an output file and an error file log file contains the submission time, the execution time and machine, information on the job… output file contains the STDOUT of the executable: for IW2D it contains what is printed on the screen (calculation time) error file contains the errors encountered during execution (wrong input file format…) These files are stored along with the resulting impedance files No mail is sent to the user when the job finishes/fails/is removed 2017-06-02 IW2D migration to HTCondor
IW2D migration to HTCondor Outline Motivation In practice What is different for the user Monitoring the jobs Managing the jobs How to get the latest version Issues with HTCondor Conclusion 2017-06-02 IW2D migration to HTCondor
How to get the latest version of IW2D If git is used to manage the repository (git clone was used to download it): Go to the IW2D repository Do git pull Or download the archive from https://gitlab.cern.ch/IRIS/IW2D 2017-06-02 IW2D migration to HTCondor
IW2D migration to HTCondor Outline Motivation In practice What is different for the user Monitoring the jobs Managing the jobs How to get the latest version Issues with HTCondor Conclusion 2017-06-02 IW2D migration to HTCondor
IW2D migration to HTCondor Current issues Errors during job submission might arise ERROR: store_cred failed ERROR: failed to read any data from /usr/bin/batch_krb5_credential Seems to be a credential issue Problem submitted to IT, under investigation Job submission is slow: can take more to 10 minutes to submit 50 jobs Check that all the jobs were properly submitted, otherwise relaunch the script Problem solved by IT: No more credential errors and job submission is much faster 2017-06-02 IW2D migration to HTCondor
IW2D migration to HTCondor Outline Motivation In practice What is different for the user Monitoring the jobs Managing the jobs How to get the latest version Issues with HTCondor Conclusion 2017-06-02 IW2D migration to HTCondor
IW2D migration to HTCondor Conclusions HTCondor is now the default batch system at CERN. ImpedanceWake2D has been modified to handle HTCondor The change is mostly transparent for the user workflow The Python functions work the same The commands to monitor and manage the jobs change The IW2D repository on https://gitlab.cern.ch/IRIS/ is up-to-date Problems remain during job submission, the issue is followed-up by IT Remarks/suggestions/bug reports on IW2D are welcome! Migration of DELPHI is also finished and will soon be uploaded 2017-06-02 IW2D migration to HTCondor
IW2D migration to HTCondor References A list of useful commands for HTCondor http://www.iac.es/sieinvens/siepedia/pmwiki.php?n=HOWTOs.CondorUsefulCommands CERN documentation for HTCondor http://batchdocs.web.cern.ch/batchdocs/index.html Quick start guide for HTCondor from U. Winsconsin-Madison https://research.cs.wisc.edu/htcondor/manual/quickstart.html 2017-06-02 IW2D migration to HTCondor