Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley

Slides:



Advertisements
Similar presentations
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Advertisements

Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Condor DAGMan Warren Smith. 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each.
Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.
High End Computing at Cardiff University Focus on Campus Grids James Osborne.
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Introduction to information theory
Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,
Introduction to UNIX/Linux Exercises Dan Stanzione.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
1 Integrating GPUs into Condor Timothy Blattner Marquette University Milwaukee, WI April 22, 2009.
National Alliance for Medical Image Computing Grid Computing with BatchMake Julien Jomier Kitware Inc.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
HTCondor workflows at Utility Supercomputing Scale: How? Ian D. Alderman Cycle Computing.
IntroductiontotoHTCHTC 2015 OSG User School, Monday, Lecture1 Greg Thain University of Wisconsin– Madison Center For High Throughput Computing.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Grid Computing I CONDOR.
Compiled Matlab on Condor: a recipe 30 th October 2007 Clare Giacomantonio.
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Grid job submission using HTCondor Andrew Lahiff.
July 28' 2011INDIA-CMS_meeting_BARC1 Tier-3 TIFR Makrand Siddhabhatti DHEP, TIFR Mumbai July 291INDIA-CMS_meeting_BARC.
NGS Innovation Forum, Manchester4 th November 2008 Condor and the NGS John Kewley NGS Support Centre Manager.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Review of Condor,SGE,LSF,PBS
HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.
Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
HOW A COMPUTER PROCESSES DATA. What is hardware? Hardware is the electric, electronic and mechanical equipment that makes up a computer What is software?
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor and DAGMan Barcelona,
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
Data Analysis w ith PROOF, PQ2, Condor Data Analysis w ith PROOF, PQ2, Condor Neng Xu, Wen Guan, Sau Lan Wu University of Wisconsin-Madison 30-October-09.
CCJ introduction RIKEN Nishina Center Kohei Shoji.
HTCondor Advanced Job Submission John (TJ) Knoeller Center for High Throughput Computing.
Five todos when moving an application to distributed HTC.
Intermediate HTCondor: More Workflows Monday pm
Condor DAGMan: Managing Job Dependencies with Condor
HTCondor Security Basics
Examples Example: UW-Madison CHTC Example: Global CMS Pool
IW2D migration to HTCondor
High Availability in HTCondor
Condor: Job Management
Chapter 1: Introduction
Troubleshooting Your Jobs
Haiyan Meng and Douglas Thain
CCR Advanced Seminar: Running CPLEX Computations on the ISE Cluster
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
Troubleshooting Your Jobs
PU. Setting up parallel universe in your pool and when (not
Introduction to research computing using Condor
Presentation transcript:

Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley

Introduction to HTCondor High throughput rather than high performance. Shares computer power Only one dedicated machine –Each node is a office PC This, no doubt, is a condor! Seagull?

What is it for? Dave has a large simulation which takes too long to run on his desktop. Dave wants a proper cluster. Julie has lots of little simulations which don’t take long but take too long to run on her desktop. Julie could use a cluster but the Daves would be cross. She could use a condor pool. Dave asks “How fast?” Julie asks “How many?”

Why bother? Why don’t we just buy cluster time for Julie? –We could, but Julie is poor Condor runs on desktop machines when they are idle A condor pool is free(ish) Condor can manage the power –Turn it off if someone isn’t using it after a bit It’s a little friendlier than a cluster. –Launch things from your own machine –It copies files to and from your machine –No login –Mostly windows nodes (useful for windows only codes)

Why bother? Its free Apart from the electricity

ci-condor.dl.ac.uk Poor neglected pool (35ish cores) –30 ish Windows Nodes –5 ish Linux Nodes Our master server Powered by coal and steam Setup originally by Jonny Smith

Structure of a pool Central Manager ( Match maker) Job Submit Point run this job transfer files advertise negotiate you’ve been matched Execute Node

Handy commands condor_submit (submit a job) condor_status (what machines are there) condor_q (list the jobs you have queued) condor_rm (remove a job)

Submitting a Job Universe = vanilla Executable = /home/nobody/condor/job.condor Input = job.stdin Output = job.stdout Error = job.stderr Arguments = -arg1 -arg2 InitialDir = /home/nobody/condor/run_1 Queue condor_submit submit_file.sub

Submitting lots of Jobs Universe = vanilla Executable = this_job Arguments = -arg1 –arg2 InitialDir = run_$(Process) Queue 600

DAGMan Directed Acyclic Graph Manager DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. Job A Job B Job C Job D

Storage No storage as such Uses execute nodes storage when running Your machine when finished Can launch a script to copy from/to a NAS

Case Study: Optimising Klystrons The optimisation of a new klystron interaction structure is a many dimensional multi-objective problem. Some well defined decisions. Some ill defined decisions. –Cavity frequencies –drift lengths between cavities –External coupling Objectives –High Efficiency, output power, gain –Short length of interaction –Specification Bandwidth –Avoid reflected electrons –Etc…

Case Study: Optimising Klystrons You could optimise them by hand –Tedious You could let the computer do all the hard work for you –Multi objective optimiser Lot’s of little simulations (it turns out my name is Julie) Admins of clusters like few long large jobs, they don’t like short “high frequency” job submissions. Data analysis “on the fly” in the EA

Impact of condor 10,000 simulations per run (at least 3 different configurations) 3 Minutes per simulation (some take 8-10) One machine 20 Days 60 machines 8 hours

Bandwidth Optimising Optimise for instance a bandwidth of - 1dB Calculation of actual bandwidth is computationally expensive Calculate instead drop The gain is reported up to the gain limit (-1dB), improvements beyond this are not of interest Gain (dB) Frequency Centre -1dB point Bandwidth 4MHz

Did we do better?

Demonstration Find the transfer curve of a klystron. –Vary the input power

Help! Clearly we need more execute nodes. More users probably wouldn’t hurt either.