Running persistent jobs in Condor Derek Weitzel & Brian Bockelman Holland Computing Center.

Slides:

Advertisements

Similar presentations

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.

Advertisements

Distributed Xrootd Derek Weitzel & Brian Bockelman.

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.

Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)

Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.

Campus High Throughput Computing (HTC) Infrastructures (aka Campus Grids) Dan Fraser OSG Production Coordinator Campus Grids Lead.

P-GRADE and WS-PGRADE portals supporting desktop grids and clouds Peter Kacsuk MTA SZTAKI

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.

Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

Workflow Management Chris A. Mattmann OODT Component Working Group.

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.

Connecting OurGrid & GridSAM A Short Overview. Content Goals OurGrid: architecture overview OurGrid: short overview GridSAM: short overview GridSAM: example.

The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.

OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.

INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.

Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)

Grid Computing I CONDOR.

Combining the strengths of UMIST and The Victoria University of Manchester Utility Driven Adaptive Workflow Execution Kevin Lee School of Computer Science,

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.

Sep 21, 20101/14 LSST Simulations on OSG Sep 21, 2010 Gabriele Garzoglio for the OSG Task Force on LSST Computing Division, Fermilab Overview OSG Engagement.

SAMGrid as a Stakeholder of FermiGrid Valeria Bartsch Computing Division Fermilab.

CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University.

Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April

Grid job submission using HTCondor Andrew Lahiff.

Stuart Wakefield Imperial College London Evolution of BOSS, a tool for job submission and tracking W. Bacchi, G. Codispoti, C. Grandi, INFN Bologna D.

Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.

Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.

Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.

Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.

Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.

Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -

Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.

Deadlock Detection and Recovery

The EDGeS project receives Community research funding 1 Porting Applications to the EDGeS Infrastructure A comparison of the available methods, APIs, and.

Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.

Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor.

GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.

Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Pilot Factory using Schedd Glidein Barnett Chiu BNL

Nanbor Wang, Balamurali Ananthan Tech-X Corporation Gerald Gieraltowski, Edward May, Alexandre Vaniachine Argonne National Laboratory 2. ARCHITECTURE GSIMF:

Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Source Level Debugging of Parallel Programs Roland Wismüller LRR-TUM, TU München Germany.

INFSO-RI Enabling Grids for E-sciencE Using of GANGA interface for Athena applications A. Zalite / PNPI.

T3g software services Outline of the T3g Components R. Yoshida (ANL)

STAR Scheduler Gabriele Carcassi STAR Collaboration.

D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.

Next Generation of Apache Hadoop MapReduce Owen

TANYA LEVSHINA Monitoring, Diagnostics and Accounting.

Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!

© Arbela Technologies Accounts Payable + Procurement & Sourcing Workflows.

Five todos when moving an application to distributed HTC.

1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.

Running User Jobs In the Grid without End User Certificates - Assessing Traceability Anand Padmanabhan CyberGIS Center for Advanced Digital and Spatial.

Parag Mhashilkar (Fermi National Accelerator Laboratory)

VO Experiences with Open Science Grid Storage OSG Storage Forum | Wednesday September 22, 2010 (10:30am)

Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.

Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.

Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.

湖南大学-信息科学与工程学院-计算机与科学系

What’s Different About Overlay Systems?

gLite Job Management Christos Theodosiou

Presentation transcript:

Running persistent jobs in Condor Derek Weitzel & Brian Bockelman Holland Computing Center

Introduction – HCC University of Nebraska Holland Computing Center – Research groups –Not all of them are using Condor… wide variety of users. Condor Applications: –Bioinformatics: CS-Rosetta, Autodock… –Math: Graph Transversal –Physics: CMS and Dzero

Introduction – HCC 7,000 Cores, all opportunistic for grid HCC’s Use of the Open Science Grid –2.1 Million hours this year Grid usage is based off GlideinWMS & Engage’s OSG-MM

Open Science Non-HEP Grid Usage GPN = HCC

Persistent Jobs Today, I’ll be discussing a unique way we create backfill at HCC using what we call “persistent jobs” We are sharing it because it’s an unexpected combination of several unique-to-Condor capabilities to solve a problem that usually is solved by writing a lot of other code.

Persistent Jobs - Motivation What is a persistent job? –A single task which lasts longer than a single batch job can run. –It is thus necessary to keep state between runs If your cluster limits batch jobs to 12 hours, a 24 hour task would be a persistent job. At a minimum, between runs, the persistent job must keep some state – what job am I? –This could also be a quite large checkpoint!

Persistent Jobs - Examples A few examples: –Traditional Condor standard universe job that uses checkpoints. –Processing streams of data, where there is one job per stream. Taken to the extreme, a persistent job might never exit. We will show an example of persistent jobs for grid backfill.

Persistent Jobs - Difficulties Persistent jobs are difficult because: –We want no more than one instance running, –Prefer more than zero instances running. To satisfy the first, we need some sort of locking and heartbeat mechanisms. To satisfy the second, we need something to automatically re-queue. –And then something to monitor the automatic submission framework. Oh, and deadlock detection!

“Traditional Approach” Keep the state in a database. –Have a locking check-in/check-out mechanisms to make sure checkpoints are used by one job. Have a server listening for job heartbeats and update the DB. Have a submit daemon. A lot of infrastructure! At least three services you have to write, care, and feed.

Yuck! We just invented a ton of architecture. –Too much for a grad student to write, much less operate. Distributed architectures are ripe for bugs. –You can take a semester long course in this and still not be very good at it. Having a cron/daemon that submits jobs is asking for a bug that submit an infinite # of jobs or deadlocks and submits 0.

Persistent Jobs – the Condor Way! Can we come up with a solution that: –Uses Condor only –Runs on grid (Grid Universe or Glide-in). –Allows us to checkpoint. Insight: Use the Condor Global Job ID as the unique state! –And Condor can re-run completed jobs! –Don’t do book-keeping, let Condor do it for you.

Persistent Jobs – the Condor Way! The state kept by each job is the Condor Job ID. Use the fact Condor can resubmit completed jobs. –Set “OnExitRemove = FALSE” –For Grid: Globus_Resubmit = TRUE Globus_Rematch = TRUE Use the job ID to uniquely identify a checkpoint file kept in a storage element. –Periodically overwrite with new checkpoints. Sit back, and use the fact that Condor will keep trying to re-run the job indefinitely and will make sure that the job only has one running instance.

Our Use Case We want to have an infinite amount of work for our clusters. –Yet people will be mad if we just do sleep() Solution: BOINC is an acceptable use of our resources. –BOINC is a persistent job! As long as you feed it the same checkpoint directory(*), it will continue to run indefinitely! Actually, it gets fed jobs internally, but we don’t know anything about the job boundaries.

Our Solution We take the basic BOINC binary, statically compiled for RHEL4/32-bit for maximum compatibility. BOINC uses the md5sum of the host’s mac address to identify the checkpoint directory is correct between executions. –Replace use of mac address with the md5sum of the Condor JobID! –Now, regardless of the worker node running the job, the correct checkpoint is used. Checkpoints (100MB each) kept on SRM storage. No NFS used!

Data Flow Worker/BoincSRM - Nebraska Checkpoints 1000 Jobs = 27MB/s Initial Binary/Checkpoint Submit Host Job Wrapper Job Output

Workflow Job Start Download Checkpoint and Binary Send Checkpoint Kill BOINC, Exit Job Time to Chkpoint? Run lifetime over? FORKED YES RESUBMIT Example Values: Checkpoint = 1hr Lifetime = 12hrs KILL Sleep NO

Boinc Results

Why not Condor backfill? Why this and not use Condor backfill? –Not all of our sites run Condor. We want to be able to utilize an arbitrary site. Lots of exercise for operating grid components! –Standard backfill tied to physical worker node (MAC). No clue when you will see that node again on the grid, so you might just be throwing away work. BOINC progress expires in 3 days. –Can’t push out backfill jobs using OSG-MM. Possibly could do this for gWMS; I don’t know.

GlideinWMS Backfill Keeping a Glide-in instance busy with backfill until you get user jobs. –Preempt backfill jobs for user jobs. –Removes the user ‘ramp up’ time. Use Flocking to GlideinWMS –GlideinWMS is resource hungry, need interactive nodes for deployment. –Run Backfill to have running glideins for the flocked jobs.

Future Uses Where else could persistent jobs be useful? –Glidein Pilot Factory. Removes the need to have processes that submit jobs. If the thing stops working, then it’s a Condor bug, and you have “someone else” to yell at.

Conclusions “Persistent jobs” can be a nightmare due to the necessary bookkeeping. Condor jobs can be made persistent. Don’t go inventing things Condor does for you. By utilizing basic Condor and nothing else (entire project is 2 python scripts), we can generate an arbitrary amount of backfill for our site or the OSG!

Back Page Questions? Code location: svn://t2.unl.edu/brian/osg_boinc

Back Page Backup Slides

Issues found during implementation Rematching in grid universe –Using matchmaking in grid universe –Need special keyword: GlobusRematch = TRUE –Was a huge pain to get right. Job accounting –Job accounting collects data when Condor job finishes, not for each run. –Not solved, relying on grid resource accounting

Issues found during implementation Need to easily modify existing jobs –In ‘Arguments’, use special variables: $$([Variable]) –Variables evaluated at time of match –Used for job lifetime, checkpoint location, binary location

Results Glideins running / Jobs Queued

Glidein Usage by Site