Peter Couvares Computer Sciences Department University of Wisconsin-Madison High-Throughput Computing With Condor
Who Are We?
The Condor Project (Established ‘85) Distributed systems CS research performed by a team that faces: software engineering challenges in a Unix/Linux/NT environment, active interaction with users and collaborators, daily maintenance and support challenges of a distributed production environment, and educating and training students. Funding - NSF, NASA,DoE, DoD, IBM, INTEL, Microsoft and the UW Graduate School.
The Condor System
The Condor System › Unix and NT › Operational since 1986 › More than 1300 CPUs at UW-Madison › Available on the web › More than 150 clusters worldwide in academia and industry
What is Condor? › Condor converts collections of distributively owned workstations and dedicated clusters into a high- throughput computing facility. › Condor uses matchmaking to make sure that everyone is happy.
What is High-Throughput Computing? › High-performance: CPU cycles/second under ideal circumstances. “How fast can I run simulation X on this machine?” › High-throughput: CPU cycles/day (week, month, year?) under non-ideal circumstances. “How many times can I run simulation X in the next month using all available machines?”
What is High-Throughput Computing? › Condor does whatever it takes to run your jobs, even if some machines… Crash! (or are disconnected) Run out of disk space Don’t have your software installed Are frequently needed by others Are far away & admin’ed by someone else
What is Matchmaking? › Condor uses Matchmaking to make sure that work gets done within the constraints of both users and owners. › Users (jobs) have constraints: “I need an Alpha with 256 MB RAM” › Owners (machines) have constraints: “Only run jobs when I am away from my desk and never run jobs owned by Bob.”
“What can Condor do for me?” Condor can… › …do your housekeeping. › …improve reliability. › …give performance feedback. › …increase your throughput!
Some Numbers: UW-CS Pool 6/98-6/00 4,000,000hours ~450 years “Real” Users1,700,000hours ~260 years CS-Optimization610,000hours CS-Architecture350,000hours Physics245,000hours Statistics80,000hours Engine Research Center38,000hours Math90,000hours Civil Engineering27,000hours Business970hours “External” Users165,000hours ~19 years MIT76,000hours Cornell38,000hours UCSD38,000hours CalTech18,000hours
Condor & Physics
Current CMS Activity › Simulation (CMSIM) for CalTech provided >135,000 CPU hours to date peak day ~ 4000 CPU hours via NCSA Alliance, Condor has allocated 1,000,000 hours total to CalTech › Simulation and Reconstruction (CMSIM + ORCA) for HEP group at UW-Madison
INFN Condor Pool - Italy › Italian National Institute for Research in Nuclear and Subnuclear Physics › 19 locations, each running a Condor pool › as few as 1 CPU -- to >100 CPUs › each locally controlled › each “flocks” jobs to other pools when available
Particle Physics Data Grid › The PPDG Project is... a software engineering effort to design, implement, experiment, evaluate, and prototype HEP-specific data-transfer and caching software tools for Grid environments › For example...
Condor PPDG Work › Condor Data Manager technology to automate & coordinate data movement from a variety of long- term repositories to available Condor computing resources & back again keeping the pipeline full! SRB (SDSC), SAM (Fermi), PPDG HRM
PPDG Collaborators
National Grid Efforts › GriPhyN (Grid Physics Network) › National Technology Grid - NCSA Alliance (NSF-PACI) › Information Power Grid - IPG (NASA) › close collaboration with the Globus project
I have 600 simulations to run. How can Condor help me?
My Application … Simulate the behavior of F(x,y,z) for 20 values of x, 10 values of y and 3 values of z (20*10*3 = 600) F takes on the average 3 hours to compute on a “typical” workstation ( total = 1800 hours ) F requires a “moderate” (128MB) amount of memory F performs “moderate” I/O - (x,y,z) is 5 MB and F(x,y,z) is 50 MB
Step I - get organized! › Write a script that creates 600 input files for each of the (x,y,z) combinations › Write a script that will collect the data from the 600 output files › Turn your workstation into a “ Personal Condor ” › Submit a cluster of 600 jobs to your personal Condor › Go on a long vacation … (2.5 months)
your workstation personal Condor 600 Condor jobs
Step II - build your personal Grid › Install Condor on the desktop machine next door › …and on the machines in the classroom. › Install Condor on the department’s Linux cluster or the O2K in the basement. › Configure these machines to be part of your Condor pool. › Go on a shorter vacation...
your workstation personal Condor 600 Condor jobs Group Condor
Step III - take advantage of your friends › Get permission from “friendly” Condor pools to access their resources › Configure your personal Condor to “flock” to these pools › reconsider your vacation plans...
your workstation friendly Condor personal Condor 600 Condor jobs Group Condor
Think BIG. Go to the Grid.
Upgrade to Condor-G A Grid-enabled version of Condor that uses the inter-domain services of Globus to bring Grid resources into the domain of your Personal Condor Easy to use on different platforms Robust Supports SMPs & dedicated schedulers
Step IV - Go for the Grid › Get access (account(s) + certificate(s)) to a “Computational” Grid › Submit 599 “Grid Universe” Condor- glide-in jobs to your personal Condor › Take the rest of the afternoon off...
your workstation friendly Condor personal Condor 600 Condor jobs Globus Grid PBS LSF Condor Group Condor 599 glide-ins
What Have We Done with the Grid Already? › NUG30 quadratic assignment problem 30 facilities, 30 locations minimize cost of transferring materials between them posed in 1968 as challenge, long unsolved but with a good pruning algorithm & high-throughput computing...
NUG30 Personal Condor Grid For the run we will be flocking to -- the main Condor pool at Wisconsin (600 processors) -- the Condor pool at Georgia Tech (190 Linux boxes) -- the Condor pool at UNM (40 processors) -- the Condor pool at Columbia (16 processors) -- the Condor pool at Northwestern (12 processors) -- the Condor pool at NCSA (65 processors) -- the Condor pool at INFN (200 processors) We will be using glide_in to access the Origin 2000 (through LSF ) at NCSA. We will use "hobble_in" to access the Chiba City Linux cluster and Origin 2000 here at Argonne.
NUG30 - Solved!!! Sender: Subject: Re: Let the festivities begin. Hi dear Condor Team, you all have been amazing. NUG30 required 10.9 years of Condor Time. In just seven days ! More stats tomorrow !!! We are off celebrating ! condor rules ! cheers, JP.
Conclusion Computing power is everywhere, we try to make it usable by anyone.
Need more info? › Condor Web Page ( › Peter Couvares