FermiGrid School Steven Timm FermiGrid School FermiGrid 201 Scripting and running Grid Jobs.

Slides:



Advertisements
Similar presentations
Grid Resource Allocation Management (GRAM) GRAM provides the user to access the grid in order to run, terminate and monitor jobs remotely. The job request.
Advertisements

Role Based VO Authorization Services Ian Fisk Gabriele Carcassi July 20, 2005.
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Tutorial Getting started with GILDA.
Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.
A Computation Management Agent for Multi-Institutional Grids
GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Minerva Infrastructure Meeting – October 04, 2011.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
OSG End User Tools Overview OSG Grid school – March 19, 2009 Marco Mambelli - University of Chicago A brief summary about the system.
INFSO-RI Enabling Grids for E-sciencE Practicals on VOMS and MyProxy Emidio Giorgio INFN Retreat between GILDA and ESR VO, Bratislava,
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America Luciano Díaz ICN-UNAM Based on Domenico.
December 8 & 9, 2005, Austin, TX SURA Cyberinfrastructure Workshop Series: Grid Technology: The Rough Guide Grid Technology: The Rough Guide Grid Building.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
3-2.1 Topics Grid Computing Meta-schedulers –Condor-G –Gridway Distributed Resource Management Application (DRMAA) © 2010 B. Wilkinson/Clayton Ferner.
G RID M IDDLEWARE AND S ECURITY Suchandra Thapa Computation Institute University of Chicago.
SAMGrid as a Stakeholder of FermiGrid Valeria Bartsch Computing Division Fermilab.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G Operations.
Metrics and Monitoring on FermiGrid Keith Chadwick Fermilab
Grid job submission using HTCondor Andrew Lahiff.
August 13, 2003Eric Hjort Getting Started with Grid Computing in STAR Eric Hjort, LBNL STAR Collaboration Meeting August 13, 2003.
Open Science Grid OSG CE Quick Install Guide Siddhartha E.S University of Florida.
E-infrastructure shared between Europe and Latin America Security Hands-on Christian Grunfeld, UNLP 8th EELA Tutorial, La Plata, 11/12-12/12,2006.
June 24-25, 2008 Regional Grid Training, University of Belgrade, Serbia Introduction to gLite gLite Basic Services Antun Balaž SCL, Institute of Physics.
INFSO-RI Enabling Grids for E-sciencE GILDA Practicals : Security systems GILDA Tutors Singapore, 1st South East Asia Forum -- EGEE.
E-infrastructure shared between Europe and Latin America FP6−2004−Infrastructures−6-SSA Hands-on on security Pedro Rausch IF - UFRJ.
Condor-G A Quick Introduction Alan De Smet Condor Project University of Wisconsin - Madison.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Part Five: Globus Job Management A: GRAM B: Globus Job Commands C: Laboratory: globusrun.
EGEE-II INFSO-RI Enabling Grids for E-sciencE The GILDA training infrastructure.
APST Internals Sathish Vadhiyar. apstd daemon should be started on the local resource Opens a port to listen for apst client requests Runs on the host.
Role Based VO Authorization Services Ian Fisk Gabriele Carcassi July 20, 2005.
Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
Metrics and Monitoring on FermiGrid Keith Chadwick Fermilab
Configuring and Troubleshooting Identity and Access Solutions with Windows Server® 2008 Active Directory®
Farms User Meeting April Steven Timm 1 Farms Users meeting 4/27/2005
4th EELA TUTORIAL - USERS AND SYSTEM ADMINISTRATORS E-infrastructure shared between Europe and Latin America Security Hands-on Vanessa.
Tier 3 Status at Panjab V. Bhatnagar, S. Gautam India-CMS Meeting, July 20-21, 2007 BARC, Mumbai Centre of Advanced Study in Physics, Panjab University,
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America Practicals on Security Miguel Cárdenas Montes.
E-infrastructure shared between Europe and Latin America Security Hands-on Alexandre Duarte CERN Fifth EELA Tutorial Santiago, 06/09-07/09,2006.
Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America Moisés Hernández Duarte UNAM FES Cuautitlán.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.
WLCG Authentication & Authorisation LHCOPN/LHCONE Rome, 29 April 2014 David Kelsey STFC/RAL.
JSS Job Submission Service Massimo Sgaravatto INFN Padova.
Open Science Grid Build a Grid Session Siddhartha E.S University of Florida.
Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.
The GRIDS Center, part of the NSF Middleware Initiative Grid Security Overview presented by Von Welch National Center for Supercomputing.
The VOMS and the SE in Tier2 Presenter: Sergey Dolgobrodov HEP Meeting Manchester, January 2009.
Hands-on security Carlos Fuentes RedIRIS Madrid,26 – 30 de Octubre de 2008.
Hands on Security, Authentication and Authorization Virginia Martín-Rubio Pascual RedIRIS/Red.es Curso Grid y e-Ciencia.
EGI-InSPIRE RI Grid Training for Power Users EGI-InSPIRE N G I A E G I S Grid Training for Power Users Institute of Physics Belgrade.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
Enabling Grids for E-sciencE gLite security pratical tutorial Dario Russo INFN Catania Catania,
RI EGI-TF 2010, Tutorial Managing an EGEE/EGI Virtual Organisation (VO) with EDGES bridged Desktop Resources Tutorial Robert Lovas, MTA SZTAKI.
An introduction to (Fermi)Grid September 14, 2007 Keith Chadwick.
Open Science Grid Consortium Storage on Open Science Grid Placing, Using and Retrieving Data on OSG Resources Abhishek Singh Rana OSG Users Meeting July.
Parag Mhashilkar (Fermi National Accelerator Laboratory)
Enabling Grids for E-sciencE Work Load Management & Simple Job Submission Practical Shu-Ting Liao APROC, ASGC EGEE Tutorial.
Bringing Federated Identity to Grid Computing Dave Dykstra CISRC16 April 6, 2016.
FermiGrid - PRIMA, VOMS, GUMS & SAZ Keith Chadwick Fermilab
Practicals on VOMS and MyProxy
CRC exercises Not happy with the way the document for testbed architecture is progressing More a collection of contributions from the mware groups rather.
Presentation transcript:

FermiGrid School Steven Timm FermiGrid School FermiGrid 201 Scripting and running Grid Jobs

FermiGrid School Steven Timm Course Outline Introduction—Essential definitions and prerequisites Using globus-job-run and globus-url-copy Using Condor_submit Using DAGman Use and care of certificates Monitoring of grid jobs and problem diagnosis Labs

FermiGrid School Steven Timm Introduction This course will cover examples of submitting jobs from client machine fnpcsrv1 to compute resource fngp-osg The examples used here should be good on any Open Science Grid site—but examples of how to identify those sites are beyond the scope of this course. You could install your own client on your own machine—future FermiGrid courses will cover how to do this. Ask lots of questions—we will fill them in and add them to future issues of the course. By the end of this course, you should be able to submit a simple job to the grid, submit a complex job to the grid, and transfer files to the grid resource.

FermiGrid School Steven Timm Introduction—Term Definitions OSG=Open Science Grid  Approximately 80 sites mostly in the United States who share compute and storage resources with each other. Three of those sites are here at FNAL. VDT=Virtual Data Toolkit  Funded and maintained by the Open Science Grid, this is a one-stop collection of all software needed to run on the Grid. Certificate  X509 certificates authenticate you to the grid sites. They are signed by a Certificate Authority, Proxy  A short-lived self-contained representation of your certificate which can be used to submit jobs to the grid Globus Toolkit  A wide set of services for grid job submissions, file transfer and more.

FermiGrid School Steven Timm Before you can submit You need:  Access to the Open Science Grid (OSG) Client software This software is already installed on fnpcsrv1  A personal x.509 certificate All Fermilab staff already have this via the Kerberos Certificate Authority  Membership in a Virtual Organization (VO) All Fermilab staff and users are part of the Fermilab VO automatically  Some place that will accept the jobs of your VO FermiGrid accepts jobs from all VO's in OSG.

FermiGrid School Steven Timm Preparing to submit Log into a machine that has the client software on it:  Ssh -l fnpcsrv1.fnal.gov Source the setup file  Source /usr/local/vdt/setup.sh Obtain a Fermilab KCA certificate  Kx509  Kxlist -p Get the certificate signed by the Fermilab VOMS server  Voms-proxy-init -noregen -voms fermilab:/fermilab Verify that the voms-proxy-init worked  Voms-proxy-info -all

FermiGrid School Steven Timm Preparing to submit—sample output Comments—The warning about missing /home/condor directory is routine bash-3.00$ source /usr/local/vdt/setup.sh bash-3.00$ kx509 bash-3.00$ kxlist -p Service kx509/certificate issuer= /DC=gov/DC=fnal/O=Fermilab/OU=Certificate Authorities/CN=Kerberized CA subject= /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Steven C. Timm/UID=timm serial=7E6C63 hash=03c202fc bash-3.00$ voms-proxy-init -noregen -voms fermilab:/fermilab Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Your identity: /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Steven C. Timm/USERID=timm Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Contacting voms.fnal.gov:15001 [/DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov] "fermilab" Done Creating proxy Done Your proxy is valid until Tue Feb 26 07:41:

FermiGrid School Steven Timm How did you know it worked? bash-3.00$ voms-proxy-info -all WARNING: Unable to verify signature! Server certificate possibly not installed. Error: Cannot find certificate of AC issuer for vo fermilab subject : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Steven C. Timm/USERID=timm/CN=proxy issuer : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Steven C. Timm/USERID=timm identity : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Steven C. Timm/USERID=timm type : proxy strength : 512 bits path : /tmp/x509up_u2904 timeleft : 10:41:35 === VO fermilab extension information === VO : fermilab subject : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Steven C. Timm/USERID=timm issuer : /DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov attribute : /fermilab/Role=NULL/Capability=NULL timeleft : 10:41:35 bash-3.00$ Error message about server certificate above can be ignored

FermiGrid School Steven Timm What if voms-proxy-init didn't work bash-3.00$ voms-proxy-init -noregen -voms cms:/cms Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Your identity: /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Steven C. Timm/USERID=timm Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Contacting lcg-voms.cern.ch:15002 [/DC=ch/DC=cern/OU=computers/CN=lcg-voms.cern.ch] "cms" Failed Error: cms: User unknown to this VO. Trying next server for cms. Contacting voms.cern.ch:15002 [/DC=ch/DC=cern/OU=computers/CN=voms.cern.ch] "cms" Failed Error: cms: User unknown to this VO. None of the contacted servers for cms were capable of returning a valid AC for the user. Above is error message that happens if you are not in the VO Check by going to voms server Voms-proxy-init -debug is your friend To join a VO that you're not in now, use VOMRS to request membership.

FermiGrid School Steven Timm Lab 1 Use the kx509/kxlist -p /voms-proxy-init sequence to get a good voms proxy. Show the instructor when you are ready.

FermiGrid School Steven Timm Grid job submission in English There is a submission machine and a compute element.  In this example, fnpcsrv1=submission machine, fngp-osg=compute element Client side authenticates to the compute resource  Using your certificate and the machine's certificate to make a SSL connection The executable and input files are transferred to the compute resource  By opening an https: connection The executable is submitted to the batch system on the compute resource  Using the GRAM interface When the job completes, the output files are transferred back  Again using an https: port

FermiGrid School Steven Timm Test submit: Globus-job-run Example  Globus-job-run fngp-osg.fnal.gov:2119/jobmanager-fork /usr/bin/id Structure of the example:  Host:port to submit the job to is the default port and can be omitted  Which jobmanager to use. Jobmanager-fork is usually the default. Others available, we will cover.  Command to use This structure will run the /usr/bin/id that's already on the remote machine. Comments  Globus-job-run should be used only for diagnostic purposes  One daemon per globus-job-run is launched on the remote machine and stays running until it exits—or sometime hangs.

FermiGrid School Steven Timm Test transfer: globus-url-copy Globus-url-copy is the command-line client for GRIDFTP Example:  Globus-url-copy file://${HOME}/foo gsiftp://fngp-osg.fnal.gov/grid/data/foo.${USER}file://$ Comments:  Globus-url-copy is for small files and light testing  In the above example, the environment variables are evaluated on submit machine  Works to go to compute elements or storage elements  For big data flows use srmcp, covered this afternoon

FermiGrid School Steven Timm Lab 2 Execute the following sequence: Globus-job-run fngp-osg.fnal.gov:2119/jobmanager-fork /usr/bin/id Globus-url-copy file://${HOME}/helloworld.sh gsiftp://fngp- osg.fnal.gov/grid/data/helloworld.sh.${USER}file://$ Globus-job-run fngp-osg.fnal.gov:2119/jobmanager-fork /bin/chmod 755 \ /grid/data/helloworld.sh.${USER} Globus-job-run fngp-osg.fnal.gov:2119/jobmanager-fork \ /grid/data/helloworld.sh.${USER}

FermiGrid School Steven Timm Condor submission concepts in English Condor is comprehensive batch system and grid submission software Grid submission client components are called Condor-G Have to install all of Condor to use the Condor-G clients. Condor-G runs on the submission host and  Transfers your executable and input files to remote compute element and gets it started  Monitors the status of the job every minute to see if it is done  Transfers the files back when the job is over.

FermiGrid School Steven Timm Condor submission—simple example universe = grid type = gt2 globusscheduler = fngp-osg.fnal.gov/jobmanager-condor executable = recon1 transfer_output = true transfer_error = true transfer_executable = true stream_output = false stream_error = false log = grid_recon1.log.$(Cluster).$(Process) notification = NEVER output = grid_recon1.out.$(Cluster).$(Process) error = grid_recon1.err.$(Cluster).$(Process) globusrsl = (jobtype=single)(maxwalltime=999) queue Grid universe for all jobs type gt2 refers to version 2 of Globus recon1 is a binary that will run for 3 minutes To submit it: condor_submit grid_recon1

FermiGrid School Steven Timm Transferring input and output files bash-3.00$ more fngp-osg-gridsleep-fourargs Universe = grid remote_initialdir = /grid/data/foo GridResource = gt2 fngp-osg/jobmanager-condor executable = gridsleep.sh # Old style of condor arguments arguments = one two three four transfer_output = true transfer_error = true transfer_executable = true stream_output = False stream_error = False should_transfer_files = YES when_to_transfer_output = ON_EXIT_OR_EVICT transfer_input_files = foo transfer_output_files = bar log = gridsleep.log.$(Cluster).$(Process) notification = NEVER output = gridsleep.out.$(Cluster).$(Process) error = gridsleep.err.$(Cluster).$(Process) globusrsl = (condorsubmit=(requirements 'Disk>5000')) queue 1

FermiGrid School Steven Timm Lab 3 Submit the jobs grid_recon1 and fngp-osg-gridsleep-fourargs Monitor their progress with condor_q and condor_q -globus Record any errors

FermiGrid School Steven Timm Globus RSL RSL=Resource Specification Language The way to communicate requirements to the remote batch system Can be used to set memory, wall time, processor type, architecture, and more. We have examples

FermiGrid School Steven Timm Condor DAGman DAG=Directed Acyclic Graph Used to show dependencies—to make one job not start until its predecessor is completed. Example is provided in the example tarball, we will go through it if we have time.

FermiGrid School Steven Timm Using DOEGrids Certificates Why get a DOEGrids cert? (see http;//security.fnal.gov/pki for full explanation) Store your DOEGrids cert and private key—on some non- network-mounted disk.

FermiGrid School Steven Timm Monitoring of Grid Jobs Globus GRAM is meant to hide the remote batch system details from the submitting host. It is very good at this. condor_q condor_q -globus condor_q -held CondorView

FermiGrid School Steven Timm Problem diagnosis Globus error 7—authentication, at Fermilab usually a problem with SAZ or GUMS Globus Error 10—failure to transfer file, means something is out of quota somewhere. Globus error 155—failure to stage out—happens when proxy expires before end of job Globus error 17—either the executable isn't there or there is something wrong with the batch system.