Long term job submission and monitoring uing grid services Riccardo Bruno INFN, Sez. CT 23/07/2007 Meeting sull'uso di applicazioni parallele in PI2S2
Outline Long term job submission Long term job monitoring References MyProxyServer Renewal The renewal process and JDL tag Long term job monitoring Middleware tools How to do monitoring efficiently The Watchdog Watchdog use example The main script The watchdog flow The main script code Some outputs The future … References Catania, Meeting sull'uso di applicazioni parallele in PI2S2 , 23.07.2007
Long term job submission Catania, Meeting sull'uso di applicazioni parallele in PI2S2 , 23.07.2007
MyProxyServer Proxy has limited lifetime (default is 12 h) • Bad idea to have longer proxy myproxy server: • myproxy-init –voms <voname> -s <host_name> – Allows to create and store a long term proxy certificate: -s: <host_name> specifies the hostname of the myproxyserver • myproxy-info – Get information about stored long living proxy • myproxy-get-delegation – Get a new proxy from the MyProxy server • myproxy-destroy – Removes the stored proxy from the server Catania, Meeting sull'uso di applicazioni parallele in PI2S2 , 23.07.2007
Renewal • A dedicated service on the RB can renew automatically the proxy: [edg-wl-renewd] - /etc/init.d/edg-wl-proxyrenewal • Some dedicated flags are required during the creation of the long term proxy credential with myproxy-init: – -d : Use the proxy certificate subject (DN) as the default username, instead of the LOGNAME env. var. – -n : Don't prompt for passphrase bash-2.05b$ myproxy-init –voms cometa -d -n Your identity: /C=IT/O=GILDA/L=INFN Catania/CN=Riccardo Bruno/ Email=riccardo.bruno@ct.infn.it Enter GRID pass phrase for this identity: Creating proxy ......................................... Done Proxy Verify OK Your proxy is valid until: Fri Jul 23 09:30:33 2007 A proxy valid for 168 hours (7.0 days) for user /C=IT/O=GILDA/L=INFN Catania/ CN=Riccardo Bruno/Email=riccardo.bruno@ct.infn.it now exists on grid001.ct.infn.it. Catania, Meeting sull'uso di applicazioni parallele in PI2S2 , 23.07.2007
The renewal process and JDL tag 5 or 10 minutes before the proxy expires the RB proxy renewal daemon will perform the following steps: Contacts the MyProxyServer indicated into the JDL and asks for a new delegation contacts the VOMS server to add the ACs transfers the new VOMS-enabled proxy to the WNs running the job. An additional attribute has to be added to the JDL MyProxyServer = "grid001.ct.infn.it"; The item informs the RB which MyProxyServer has to be contacted to renew the credentials. Otherwise a default one is taken from UI VO configuration settings: glite_wmsui.conf Catania, Meeting sull'uso di applicazioni parallele in PI2S2 , 23.07.2007
Long term job submission Create the long term proxy on the MyProxy server myproxy-init --voms cometa -d –n Create a new proxy or get the delegation from MyProxy server voms-proxy-init –voms cometa myproxy-get-delegation –d -a $X509_USER_PROXY (Please notice you must have already a valid proxy on the UI) Submit the job normaly edg-job-submit -o jid testmyproxy.jdl bash-2.05b$ myproxy-init –voms cometa -d -n Your identity: /C=IT/O=GILDA/L=INFN Catania/CN=Riccardo Bruno/ Email=riccardo.bruno@ct.infn.it Enter GRID pass phrase for this identity: Creating proxy ......................................... Done Proxy Verify OK Your proxy is valid until: Fri Jul 23 09:30:33 2007 A proxy valid for 168 hours (7.0 days) for user /C=IT/O=GILDA/L=INFN Catania/ CN=Riccardo Bruno/Email=riccardo.bruno@ct.infn.it now exists on grid001.ct.infn.it. Catania, Meeting sull'uso di applicazioni parallele in PI2S2 , 23.07.2007
Renewal feedback Starting at: 20070720124320 subject : /C=IT/O=INFN/…/CN=proxy/CN=proxy/CN=proxy/CN=proxy/CN=limited proxy … type : limited proxy strength : 512 bits path : /tmp/globus-tmp.unime-wn-03.27834.0 timeleft : 0:56:58 === VO cometa extension information === VO : cometa subject : /C=IT/O=INFN/OU=Personal Certificate/L=Catania/CN=Riccardo Bruno issuer : /C=IT/O=INFN/OU=Host/L=Catania/CN=voms.ct.infn.it attribute : /cometa/Role=NULL/Capability=NULL timeleft : 11:56:01 … Other output from job’ core execution (just sleep execution) subject : /C=IT/O=INFN/…/CN=proxy/CN=proxy/CN=proxy/CN=limited proxy timeleft : 8:45:18 timeleft : 10:26:00 Ending at: 20070720141321. This job has been executed with a delegated proxy 1 hr long (myproxy-get-delegation -d -t 1:00 -a $X509_USER_PROXY) The 1° call to voms-proxy-info returns 0:56:58 as time left After the job core execution the 2° call to voms-proxy-info gives 8:45:18 as time left Please notice also the different subjects: /C=IT/O=INFN/…/CN=proxy/CN=proxy/CN=proxy/CN=proxy/CN=limited proxy /C=IT/O=INFN/…/CN=proxy/CN=proxy/CN=proxy/CN=limited proxy Catania, Meeting sull'uso di applicazioni parallele in PI2S2 , 23.07.2007
Long term jobs monitoring Catania, Meeting sull'uso di applicazioni parallele in PI2S2 , 23.07.2007
Middleware tools Currently gLite offers the following services allowing to monitor the job execution Interactive Jobs or direct use of X server communication via SSH tunneling User forced to use interactive JDL Keep open the X client for the whole job duration Use of RGMA The use of dedicated producers need to apply code changes not ever possible. Code changes are error prone and need to be tested Use of AMGA The use of AMGA APIs requires code changes as well Catania, Meeting sull'uso di applicazioni parallele in PI2S2 , 23.07.2007
How to do monitoring efficiently IDEA: Perform the job monitoring using still grid services in the less possible invasive way. Observations: Almost all jobs submitted on the grid are piloted by shell scripts Shell scripting allow to get precious info in case of faults Shell scripting can pilot more complex batch processing Both SE and file catalog can be used as the simplest IS on the grid. lfc-* and lcg-* tools already available for file creation and retrieve The latency of CLI tools for the storage is very low compared to long term jobs Requirements: It would be useful to configure the monitoring tool accordingly to the user needs Few shell environment variables can be used to configure the monitoring tool Catania, Meeting sull'uso di applicazioni parallele in PI2S2 , 23.07.2007
The Watchdog The Watchdog is a shell script to be included in the main script. Some watchdog features: It starts in background before to run the long term job The watchdog runs as long as the main job The main script can stop and wait until the watchdog has finished Easily and highly configurable The watchdog does not compromise the CPU power of the WN The watchdog is really simple and its behavior can be extended by the user The best way to explain the watchdog is to make an use example … Catania, Meeting sull'uso di applicazioni parallele in PI2S2 , 23.07.2007
Watchdog use example The simplest use case foresees the following: The JDL: script.jdl The main script file: script.sh The watchdog script file: watchdog.sh script.jdl Type = "Job"; JobType = "Normal"; Executable = "/bin/bash"; StdOutput = "file.out"; StdError = "file.err"; InputSandbox = {"watchdog.sh", "script.sh"}; OutputSandbox = {"file.out", "file.err", "watchdog.out"}; Arguments = "script.sh"; InputSandbox file.out script.sh file.err watchdog.sh watchdog.out OutputSandbox Catania, Meeting sull'uso di applicazioni parallele in PI2S2 , 23.07.2007
The main script It is a good practice to have a main script like the following structure: Get information about the WN Start the watchdog Stop the watchdog Execute and control the main job Collect information about the job execution Catania, Meeting sull'uso di applicazioni parallele in PI2S2 , 23.07.2007
The watchdog flow Initialization File Catalog/SE USERPATH/JobId Enter the loop For each file in the list Take a snapshoot (just increments will be copied) <timestamp>_<file_1> <timestamp>_<file_2> … <timestamp>_<file_n> VO USERPATH FILE Catalog SE DELAY LIST OF FILES CTLR File exsists Create notification file CTRL file NTFY file Catania, Meeting sull'uso di applicazioni parallele in PI2S2 , 23.07.2007
The main script code # # watchdog – Riccardo Bruno 200707 echo "Starting at: “\ $(date +%y%m%d%H%M%S) HOSTNAME=$(hostname -f) USER=$(whoami) ARG1=$1 LOCALDIR=$(pwd) echo "*****************************" echo "HOST: "$HOSTNAME echo "USER: "$USER echo "ARGS: "$ARG1 echo "LOCALDIR is: "$LOCALDIR echo "HOMEDIR is:"$HOME echo "Content of home:" ls -l $HOME echo "Content of current dir:" ls -l . echo "******************************" #start the watchdog chmod +x watchdog.sh ./watchdog.sh > watchdog.out & # perform 8 iterations, 15 seconds each # 2 minutes for i in $(seq 1 8) do echo "This is mine output at: “\ $(date +%y%m%d%H%M%S) echo "This is mine error at: “\ $(date +%y%m%d%H%M%S) 1>&2 sleep 15 done #stop and wait the dog rm -f watchdog.ctrl while [ ! -e watchdog.done ] sleep 1 echo "Waiting for watchdog: “\ echo "Watchdog closed" echo "done" echo "done" 1>&2 echo "Ending at: "$(date +%y%m%d%H%M%S) Catania, Meeting sull'uso di applicazioni parallele in PI2S2 , 23.07.2007
Some outputs [brunor@glite-tutor tmp]$ lfc-ls -l /grid/gilda/brunor/2DFfQYycd5guISZSU3ZdOQ -rw-rw-r-- 1 1023 102 2211 Jul 18 16:13 070718161318_testmyproxy.out -rw-rw-r-- 1 1023 102 85 Jul 18 16:14 070718161347_testmyproxy.err … [brunor@glite-tutor brunor_2DFfQYycd5guISZSU3ZdOQ]$ cat file.out Starting at: 070713155443 **************************************** <WN INFO …> This is my output at: 070713155443 This is my output at: 070713155633 done Ending at: 070713155643 [brunor@glite-tutor brunor_2DFfQYycd5guISZSU3ZdOQ]$ cat file.err This is my error at: 070713155443 [brunor@glite-tutor brunor_2DFfQYycd5guISZSU3ZdOQ]$ cat watchdog.out Starting watchdog at: 070713155443 guid:205a2902-89e0-4c68-b963-2facf30efb6f guid:a21f30b4-46cf-4e63-919b-ceb911bfe710 Ending watchdog at: 070713155443 Catania, Meeting sull'uso di applicazioni parallele in PI2S2 , 23.07.2007
The future … The watchdog can be easily improved Use a special folder in the catalog to be used as a virtual UI on the WN allowing the user to issue shell commands: WD_USER_PATH/<JobId>/ <timestamp>_file_1 <timestamp>_file_2 … <timestamp>_file_n UI/ commands <timestamp>_cmdresult_1 Use of AMGA/RGMA CLI tools instead of the catalog Catania, Meeting sull'uso di applicazioni parallele in PI2S2 , 23.07.2007
References The watchdog wiki https://grid.ct.infn.it/twiki/bin/view/PI2S2/WatchdogUtility Catania, Meeting sull'uso di applicazioni parallele in PI2S2 , 23.07.2007
Questions… Catania, Meeting sull'uso di applicazioni parallele in PI2S2 , 23.07.2007