HTCondor and LSST Stephen Pietrowicz Senior Research Programmer National Center for Supercomputing Applications HTCondor Week May 2-5, 2017.

Slides:



Advertisements
Similar presentations
Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.
Advertisements

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
A Computation Management Agent for Multi-Institutional Grids
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
Distributed Application Management Using PLuSH Jeannie Albrecht, Christopher Tuttle, Alex C. Snoeren, and Amin Vahdat UC San Diego CSE {jalbrecht, ctuttle,
APPLICATION DEVELOPMENT BY SYED ADNAN ALI.
K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 10: Server Administration.
MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
HORIZONT 1 TWS/WebAdmin The Web Interface for TWS Release Notes HORIZONT Software for Datacenters Garmischer Str. 8 D München Tel ++49(0)89 / 540.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
The ATLAS Production System. The Architecture ATLAS Production Database Eowyn Lexor Lexor-CondorG Oracle SQL queries Dulcinea NorduGrid Panda OSGLCG The.
Introduction and simple using of Oracle Logistics Information System Yaxian Yao
HTCondor workflows at Utility Supercomputing Scale: How? Ian D. Alderman Cycle Computing.
The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign The Dark Energy Survey Middleware LSST Workflow Workshop 09/2010.
Learningcomputer.com SQL Server 2008 – Administration, Maintenance and Job Automation.
IT 456 Seminar 5 Dr Jeffrey A Robinson. Overview of Course Week 1 – Introduction Week 2 – Installation of SQL and management Tools Week 3 - Creating and.
LSST: Preparing for the Data Avalanche through Partitioning, Parallelization, and Provenance Kirk Borne (Perot Systems Corporation / NASA GSFC and George.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Framework of Job Managing for MDC Reconstruction and Data Production Li Teng Zhang Yao Huang Xingtao SDU
Server to Server Communication Redis as an enabler Orion Free
HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.
© 2006 Cisco Systems, Inc. All rights reserved.1.
Workflows Description, Enactment and Monitoring in SAGA Ashiq Anjum, UWE Bristol Shantenu Jha, LSU 1.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Five todos when moving an application to distributed HTC.
SCI-BUS is supported by the FP7 Capacities Programme under contract nr RI CloudBroker usage Zoltán Farkas MTA SZTAKI LPDS
VO Experiences with Open Science Grid Storage OSG Storage Forum | Wednesday September 22, 2010 (10:30am)
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
CACI Proprietary Information | Date 1 Upgrading to webMethods Product Suite Name: Semarria Rosemond Title: Systems Analyst, Lead Date: December 8,
SOFTWARE TESTING TRAINING TOOLS SUPPORT FOR SOFTWARE TESTING Chapter 6 immaculateres 1.
Everything you've ever wanted to know about using Control-M to integrate any application workload September 9, 2016 David Fernandez Senior Presales Consultant.
Accessing the VI-SEEM infrastructure
Build and Test system for FairRoot
Chapter 19: Network Management
Condor DAGMan: Managing Job Dependencies with Condor
Pegasus WMS Extends DAGMan to the grid world
Shared Services with Spotfire
Flowbster: Dynamic creation of data pipelines in clouds
Intermediate HTCondor: Workflows Monday pm
Introduction to the Application Hosting Environment
z/Ware 2.0 Technical Overview
Seismic Hazard Analysis Using Distributed Workflows
GWE Core Grid Wizard Enterprise (
and Alexandre Duarte OurGrid/EELA Interoperability Meeting
Optical Survey Astronomy DATA at NCSA
Work report Xianghu Zhao Nov 11, 2014.
OGSA Data Architecture Scenarios
US CMS Testbed.
Building and Testing using Condor
MIK 2.1 DBNS - introduction to WS-PGRADE, 2013
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
What’s New in DAGMan HTCondor Week 2013
HTCondor at Collin Mehring Introduction My background.
Haiyan Meng and Douglas Thain
Semiconductor Manufacturing (and other stuff) with Condor
Mats Rynge USC Information Sciences Institute
rvGAHP – Push-Based Job Submission Using Reverse SSH Connections
High Throughput Computing for Astronomers
A General Approach to Real-time Workflow Monitoring
Frieda meets Pegasus-WMS
End to End Workflow Monitoring Panorama 360 George Papadimitriou
Production client status
HTCondor in Astronomy at NCSA
Presentation transcript:

HTCondor and LSST Stephen Pietrowicz Senior Research Programmer National Center for Supercomputing Applications HTCondor Week May 2-5, 2017

Large Synoptic Survey Telescope 8.4 meter ground-based telescope on Cerro Pachón in Chile 3.2 gigapixel camera Transferring 15 terabytes of data nightly, for 10 years Nightly alert processing at NCSA Yearly data release processing at NCSA and IN2P3 in France http://www.lsst.org/

LSST Software Stack Organized into Dozens of Custom and Third Party Packages Applications framework Data Access Sky Tessellation Managing Task Execution

How We’ve Used HTCondor So Far Software stack scaling tests Alert Processing Simulation Orchestration/Execution on different sites using templates Statistics gathering for better insight on how things are running

Alert Processing Changing sky will produce ~10 million alerts nightly; astronomers will be able subscribe to alerts they’re interested in, which will be produced within 60 seconds of observation Alert Processing Simulation Proof of concept of data and job workflows 25 VMs – simulating 240 nodes Two HTCondor Clusters: task execution and custom transfer

Alert Processing Simulation

Orchestration Sets up, Invokes workflow execution, monitors status Captures software environment and records versions Plugins – (eg, DAGMan, Pegasus, simple ssh invocations) User-specified configuration files /home, Software stack locations, scratch directories, etc Configuration can be complicated for new users

Execution Abstract away the details for glide-ins and for execution Platform specific configuration Substitution of common elements on platform specific templates, to help eliminate user errors User specifies the minimum amount of information required Site name Time allocation for nodes Input data and execution script DAG generator (DAGman or Pegasus script)

Example from sites.xml Before: <profile namespace="pegasus" key="auxillary.local">true</profile> <profile namespace="condor" key="getEnv">True</profile> <profile namespace="env" key="PEGASUS_HOME" >$PEGASUS_HOME</profile> <profile namespace="condor" key="requirements">(ALLOCATED_NODE_SET == "$NODE_SET")</profile> <profile namespace="condor" key="+JOB_NODE_SET">"$NODE_SET"</profile> <profile namespace="env" key="EUPS_USERDATA" >/scratch/$USERNAME/eupsUserData</profile> After: <profile namespace="env" key="PEGASUS_HOME" >/software/middleware/pegasus/current</profile> <profile namespace="condor" key="requirements">(ALLOCATED_NODE_SET == "srp_478")</profile> <profile namespace="condor" key="+JOB_NODE_SET">"srp_478"</profile> <profile namespace="env" key="EUPS_USERDATA" >/scratch/srp/eupsUserData</profile>

Statistics The statistics package contains commands to ingest Dagman log event records into a database. The package’s utilities group all of these records according to Condor job id in order to get an overview of what happened during the job. 

Grouped records Submitted, executing on node 198.202.101.206, updated to image size 119476, an exception in the Shadow daemon occurs, execution starts on 198.202.101.110, updated image size to 102492, and then again to 682660, disconnected from node, reconnection to node failed, execution starts on 198.202.101.185, and then terminates.

Execution report

More information Alert Production Simulator Description: http://dmtn-003.lsst.io/en/master/ Anim: https://lsst-web.ncsa.illinois.edu/~srp/alert/alert.html Simulator: http://github.com/lsst-dm/ctrl_ap Statistics: http://github.com/lsst/ctrl_stats Orchestration: http://github.com/lsst/ctrl_orca Execution: http://github.com/lsst/ctrl_exec Platform config: http://github.com/lsst/ctrl_platform_lsstvc