HTCondor and LSST Stephen Pietrowicz Senior Research Programmer National Center for Supercomputing Applications HTCondor Week May 2-5, 2017
Large Synoptic Survey Telescope 8.4 meter ground-based telescope on Cerro Pachón in Chile 3.2 gigapixel camera Transferring 15 terabytes of data nightly, for 10 years Nightly alert processing at NCSA Yearly data release processing at NCSA and IN2P3 in France http://www.lsst.org/
LSST Software Stack Organized into Dozens of Custom and Third Party Packages Applications framework Data Access Sky Tessellation Managing Task Execution
How We’ve Used HTCondor So Far Software stack scaling tests Alert Processing Simulation Orchestration/Execution on different sites using templates Statistics gathering for better insight on how things are running
Alert Processing Changing sky will produce ~10 million alerts nightly; astronomers will be able subscribe to alerts they’re interested in, which will be produced within 60 seconds of observation Alert Processing Simulation Proof of concept of data and job workflows 25 VMs – simulating 240 nodes Two HTCondor Clusters: task execution and custom transfer
Alert Processing Simulation
Orchestration Sets up, Invokes workflow execution, monitors status Captures software environment and records versions Plugins – (eg, DAGMan, Pegasus, simple ssh invocations) User-specified configuration files /home, Software stack locations, scratch directories, etc Configuration can be complicated for new users
Execution Abstract away the details for glide-ins and for execution Platform specific configuration Substitution of common elements on platform specific templates, to help eliminate user errors User specifies the minimum amount of information required Site name Time allocation for nodes Input data and execution script DAG generator (DAGman or Pegasus script)
Example from sites.xml Before: <profile namespace="pegasus" key="auxillary.local">true</profile> <profile namespace="condor" key="getEnv">True</profile> <profile namespace="env" key="PEGASUS_HOME" >$PEGASUS_HOME</profile> <profile namespace="condor" key="requirements">(ALLOCATED_NODE_SET == "$NODE_SET")</profile> <profile namespace="condor" key="+JOB_NODE_SET">"$NODE_SET"</profile> <profile namespace="env" key="EUPS_USERDATA" >/scratch/$USERNAME/eupsUserData</profile> After: <profile namespace="env" key="PEGASUS_HOME" >/software/middleware/pegasus/current</profile> <profile namespace="condor" key="requirements">(ALLOCATED_NODE_SET == "srp_478")</profile> <profile namespace="condor" key="+JOB_NODE_SET">"srp_478"</profile> <profile namespace="env" key="EUPS_USERDATA" >/scratch/srp/eupsUserData</profile>
Statistics The statistics package contains commands to ingest Dagman log event records into a database. The package’s utilities group all of these records according to Condor job id in order to get an overview of what happened during the job.
Grouped records Submitted, executing on node 198.202.101.206, updated to image size 119476, an exception in the Shadow daemon occurs, execution starts on 198.202.101.110, updated image size to 102492, and then again to 682660, disconnected from node, reconnection to node failed, execution starts on 198.202.101.185, and then terminates.
Execution report
More information Alert Production Simulator Description: http://dmtn-003.lsst.io/en/master/ Anim: https://lsst-web.ncsa.illinois.edu/~srp/alert/alert.html Simulator: http://github.com/lsst-dm/ctrl_ap Statistics: http://github.com/lsst/ctrl_stats Orchestration: http://github.com/lsst/ctrl_orca Execution: http://github.com/lsst/ctrl_exec Platform config: http://github.com/lsst/ctrl_platform_lsstvc