Job Submission Via File Transfer

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

EU 2nd Year Review – Jan – Title – n° 1 WP1 Speaker name (Speaker function and WP ) Presentation address e.g.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Part 7: CondorG A: Condor-G B: Laboratory: CondorG.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
A Computation Management Agent for Multi-Institutional Grids
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
BaBar MC production BaBar MC production software VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:
Integrating HPC into the ATLAS Distributed Computing environment Doug Benjamin Duke University.
- Distributed Analysis (07may02 - USA Grid SW BNL) Distributed Processing Craig E. Tull HCG/NERSC/LBNL (US) ATLAS Grid Software.
Grid job submission using HTCondor Andrew Lahiff.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Review of Condor,SGE,LSF,PBS
A Personal Cloud Controller Yuan Luo School of Informatics and Computing, Indiana University Bloomington, USA PRAGMA 26 Workshop.
Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
How High Throughput was my cluster? Greg Thain Center for High Throughput Computing.
Data Analysis w ith PROOF, PQ2, Condor Data Analysis w ith PROOF, PQ2, Condor Neng Xu, Wen Guan, Sau Lan Wu University of Wisconsin-Madison 30-October-09.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
Five todos when moving an application to distributed HTC.
HEPiX IPv6 Working Group David Kelsey (STFC-RAL) GridPP33 Ambleside 22 Aug 2014.
Taming Local Users and Remote Clouds with HTCondor at CERN
Network customization
Dynamic Extension of the INFN Tier-1 on external resources
HTCondor Annex (There are many clouds like it, but this one is mine.)
Large Output and Shared File Systems
Open OnDemand: Open Source General Purpose HPC Portal
HTCondor Networking Concepts
Remote execution of long-running CGIs
HTCondor Networking Concepts
HTCondor Security Basics
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Kerberos token renewal & HTCondor
Examples Example: UW-Madison CHTC Example: Global CMS Pool
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
Moving CHTC from RHEL 6 to RHEL 7
High Availability in HTCondor
Introduction to Grid Technology
Migratory File Services for Batch-Pipelined Workloads
Artem Trunov and EKP team EPK – Uni Karlsruhe
Moving CHTC from RHEL 6 to RHEL 7
Building Grids with Condor
Integration of Singularity With Makeflow
Introduction to Makeflow and Work Queue
湖南大学-信息科学与工程学院-计算机与科学系
HTCondor Security Basics HTCondor Week, Madison 2016
CCR Advanced Seminar: Running CPLEX Computations on the ISE Cluster
Condor Glidein: Condor Daemons On-The-Fly
Basic Grid Projects – Condor (Part I)
Upgrading Condor Best Practices
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
Wide Area Workload Management Work Package DATAGRID project
Condor: Firewall Mirroring
Mats Rynge USC Information Sciences Institute
Network customization
Condor-G Making Condor Grid Enabled
GLOW A Campus Grid within OSG
Building an Elastic Batch System with Private and Public Clouds
Credential Management in HTCondor
Condor-G: An Update.
Thursday AM, Lecture 1 Lauren Michael
Presentation transcript:

Job Submission Via File Transfer

Run Everywhere Use all available resources Submit locally Run globally

Foreign Languages Using remote resources Easy with HTCondor and friendly admins Flocking Harder otherwise Lost jobs Unknown queue times Different allocation expectations Different level of service

Paper Over the Differences Don’t send user jobs directly Make everything act like HTCondor Glideins Run HTCondor startd as a job/image/container Send jobs to it Turn remote resources into temporary members of your own HTCondor pool See also Annex

Glidein Example Home Away slurm schedd startd shadow* starter* job * One shadow, starter per running job

Private Networks Machines with no public network access HPC Cloud “Split the schedd”

2-Schedd Glidein Example Home Away slurm schedd schedd startd shadow starter job (Private network)

Really Private Networks Centers with limited remote access Job submission Data transfer No general network communication Submit via a shared file service

Talking Via a File System Goal: Support any file sharing mechanism NFS, Gluster, GPFS Box.com, Google Drive, Dropbox Gridftp, xrootd, rsync Blog site comments

Job Management Primitives Submit Write job description and input sandbox Status (optional) Write status description Completion Write final status description and output sandbox Cleanup or Removal Delete job description and input sandbox

File-Based Submission Example Home Away slurm schedd schedd startd shadow starter job (Private network)

File-Based Job Submission Box Cloud Storage JobXXX Schedd A Schedd B request status.1 status.2 status.3 input input output input output output

And Along Came CMS Stretching the boundaries of Run Everywhere 250,000 cores 100 sites CMS researchers at PIC wanted to run on Mare Nostrum at BSC

BSC Site Setup Execute nodes Login nodes Shared filesystem No public network access (in or out) Login nodes No output network connections Inbound network for ssh and file transfer only No long-lived or cpu-intensive programs Shared filesystem GPFS (IBM General Parallel File System)

Find a New Model Can’t run a schedd at BSC CMS likes late binding Maybe run as part of the glidein CMS likes late binding Jobs stay at home schedd until a machine is ready to run them Let’s split the starter in two

Setting It Up Run a startd at PIC (close to BSC) Advertises the resources of a set of BSC nodes Won’t match until BSC job starts Sshfs mount from PIC to BSC’s GPFS Submit starter job to BSC’s SLURM When a job arrives at PIC startd PIC starter writes job to GPFS BSC starter reads job from GPFS and runs it

File System Example CERN PIC BSC slurm schedd startd launcher shadow starter starter job sshfs (Private network)

How Is It Different? No changes outside of starter Other daemons unaware Some features don’t work Ssh-to-job Chirp Streaming output Periodic checkpointing

Progress Done TODO Run sleep jobs on 2 BSC nodes Jobs started at CERN schedd TODO Use more BSC nodes Larger data transfers Fault tolerance Run CMS application

Questions?