HTCondor and LSST Stephen Pietrowicz Senior Research Programmer National Center for Supercomputing Applications HTCondor Week May 2-5, 2017.

Slides:

Advertisements

Similar presentations

Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.

Advertisements

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

A Computation Management Agent for Multi-Institutional Grids

1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.

Distributed Application Management Using PLuSH Jeannie Albrecht, Christopher Tuttle, Alex C. Snoeren, and Amin Vahdat UC San Diego CSE {jalbrecht, ctuttle,

APPLICATION DEVELOPMENT BY SYED ADNAN ALI.

K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 10: Server Administration.

MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.

Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

HORIZONT 1 TWS/WebAdmin The Web Interface for TWS Release Notes HORIZONT Software for Datacenters Garmischer Str. 8 D München Tel ++49(0)89 / 540.

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

The ATLAS Production System. The Architecture ATLAS Production Database Eowyn Lexor Lexor-CondorG Oracle SQL queries Dulcinea NorduGrid Panda OSGLCG The.

Introduction and simple using of Oracle Logistics Information System Yaxian Yao

HTCondor workflows at Utility Supercomputing Scale: How? Ian D. Alderman Cycle Computing.

The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign The Dark Energy Survey Middleware LSST Workflow Workshop 09/2010.

Learningcomputer.com SQL Server 2008 – Administration, Maintenance and Job Automation.

IT 456 Seminar 5 Dr Jeffrey A Robinson. Overview of Course Week 1 – Introduction Week 2 – Installation of SQL and management Tools Week 3 - Creating and.

LSST: Preparing for the Data Avalanche through Partitioning, Parallelization, and Provenance Kirk Borne (Perot Systems Corporation / NASA GSFC and George.

Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”

Framework of Job Managing for MDC Reconstruction and Data Production Li Teng Zhang Yao Huang Xingtao SDU

Server to Server Communication Redis as an enabler Orion Free

HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.

© 2006 Cisco Systems, Inc. All rights reserved.1.

Workflows Description, Enactment and Monitoring in SAGA Ashiq Anjum, UWE Bristol Shantenu Jha, LSU 1.

Tool Integration with Data and Computation Grid “Grid Wizard 2”

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Five todos when moving an application to distributed HTC.

SCI-BUS is supported by the FP7 Capacities Programme under contract nr RI CloudBroker usage Zoltán Farkas MTA SZTAKI LPDS

VO Experiences with Open Science Grid Storage OSG Storage Forum | Wednesday September 22, 2010 (10:30am)

INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.

CACI Proprietary Information | Date 1 Upgrading to webMethods Product Suite Name: Semarria Rosemond Title: Systems Analyst, Lead Date: December 8,

SOFTWARE TESTING TRAINING TOOLS SUPPORT FOR SOFTWARE TESTING Chapter 6 immaculateres 1.

Everything you've ever wanted to know about using Control-M to integrate any application workload September 9, 2016 David Fernandez Senior Presales Consultant.

Accessing the VI-SEEM infrastructure

Build and Test system for FairRoot

Chapter 19: Network Management

Condor DAGMan: Managing Job Dependencies with Condor

Pegasus WMS Extends DAGMan to the grid world

Shared Services with Spotfire

Flowbster: Dynamic creation of data pipelines in clouds

Intermediate HTCondor: Workflows Monday pm

Introduction to the Application Hosting Environment

z/Ware 2.0 Technical Overview

Seismic Hazard Analysis Using Distributed Workflows

GWE Core Grid Wizard Enterprise (

and Alexandre Duarte OurGrid/EELA Interoperability Meeting

Optical Survey Astronomy DATA at NCSA

Work report Xianghu Zhao Nov 11, 2014.

OGSA Data Architecture Scenarios

US CMS Testbed.

Building and Testing using Condor

MIK 2.1 DBNS - introduction to WS-PGRADE, 2013

The Scheduling Strategy and Experience of IHEP HTCondor Cluster

What’s New in DAGMan HTCondor Week 2013

HTCondor at Collin Mehring Introduction My background.

Haiyan Meng and Douglas Thain

Semiconductor Manufacturing (and other stuff) with Condor

Mats Rynge USC Information Sciences Institute

rvGAHP – Push-Based Job Submission Using Reverse SSH Connections

High Throughput Computing for Astronomers

A General Approach to Real-time Workflow Monitoring

Frieda meets Pegasus-WMS

End to End Workflow Monitoring Panorama 360 George Papadimitriou

Production client status

HTCondor in Astronomy at NCSA

Presentation transcript:

HTCondor and LSST Stephen Pietrowicz Senior Research Programmer National Center for Supercomputing Applications HTCondor Week May 2-5, 2017

Large Synoptic Survey Telescope 8.4 meter ground-based telescope on Cerro Pachón in Chile 3.2 gigapixel camera Transferring 15 terabytes of data nightly, for 10 years Nightly alert processing at NCSA Yearly data release processing at NCSA and IN2P3 in France http://www.lsst.org/

LSST Software Stack Organized into Dozens of Custom and Third Party Packages Applications framework Data Access Sky Tessellation Managing Task Execution

How We’ve Used HTCondor So Far Software stack scaling tests Alert Processing Simulation Orchestration/Execution on different sites using templates Statistics gathering for better insight on how things are running

Alert Processing Changing sky will produce ~10 million alerts nightly; astronomers will be able subscribe to alerts they’re interested in, which will be produced within 60 seconds of observation Alert Processing Simulation Proof of concept of data and job workflows 25 VMs – simulating 240 nodes Two HTCondor Clusters: task execution and custom transfer

Alert Processing Simulation

Orchestration Sets up, Invokes workflow execution, monitors status Captures software environment and records versions Plugins – (eg, DAGMan, Pegasus, simple ssh invocations) User-specified configuration files /home, Software stack locations, scratch directories, etc Configuration can be complicated for new users

Execution Abstract away the details for glide-ins and for execution Platform specific configuration Substitution of common elements on platform specific templates, to help eliminate user errors User specifies the minimum amount of information required Site name Time allocation for nodes Input data and execution script DAG generator (DAGman or Pegasus script)

Example from sites.xml Before: <profile namespace="pegasus" key="auxillary.local">true</profile> <profile namespace="condor" key="getEnv">True</profile> <profile namespace="env" key="PEGASUS_HOME" >$PEGASUS_HOME</profile> <profile namespace="condor" key="requirements">(ALLOCATED_NODE_SET == "$NODE_SET")</profile> <profile namespace="condor" key="+JOB_NODE_SET">"$NODE_SET"</profile> <profile namespace="env" key="EUPS_USERDATA" >/scratch/$USERNAME/eupsUserData</profile> After: <profile namespace="env" key="PEGASUS_HOME" >/software/middleware/pegasus/current</profile> <profile namespace="condor" key="requirements">(ALLOCATED_NODE_SET == "srp_478")</profile> <profile namespace="condor" key="+JOB_NODE_SET">"srp_478"</profile> <profile namespace="env" key="EUPS_USERDATA" >/scratch/srp/eupsUserData</profile>

Statistics The statistics package contains commands to ingest Dagman log event records into a database. The package’s utilities group all of these records according to Condor job id in order to get an overview of what happened during the job.

Grouped records Submitted, executing on node 198.202.101.206, updated to image size 119476, an exception in the Shadow daemon occurs, execution starts on 198.202.101.110, updated image size to 102492, and then again to 682660, disconnected from node, reconnection to node failed, execution starts on 198.202.101.185, and then terminates.

Execution report

More information Alert Production Simulator Description: http://dmtn-003.lsst.io/en/master/ Anim: https://lsst-web.ncsa.illinois.edu/~srp/alert/alert.html Simulator: http://github.com/lsst-dm/ctrl_ap Statistics: http://github.com/lsst/ctrl_stats Orchestration: http://github.com/lsst/ctrl_orca Execution: http://github.com/lsst/ctrl_exec Platform config: http://github.com/lsst/ctrl_platform_lsstvc