Peter Couvares Computer Sciences Department University of Wisconsin-Madison High-Throughput Computing With.

Slides:

Advertisements

Similar presentations

Condor Project Computer Sciences Department University of Wisconsin-Madison Introduction Condor.

Advertisements

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.

4/2/2002HEP Globus Testing Request - Jae Yu x Participating in Globus Test-bed Activity for DØGrid UTA HEP group is playing a leading role in establishing.

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.

Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)

1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

CMS HLT production using Grid tools Flavia Donno (INFN Pisa) Claudio Grandi (INFN Bologna) Ivano Lippi (INFN Padova) Francesco Prelz (INFN Milano) Andrea.

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

The flight of the Condor - a decade of High Throughput Computing Miron Livny Computer Sciences Department University of Wisconsin-Madison

Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison High-Throughput.

An Introduction to Grid Computing Research at Notre Dame Prof. Douglas Thain University of Notre Dame

Workload Management Massimo Sgaravatto INFN Padova.

Alain Roy Computer Sciences Department University of Wisconsin-Madison 25-June-2002 Using Condor on the Grid.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin

Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.

Miron Livny Computer Sciences Department University of Wisconsin-Madison From Compute Intensive to Data.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Commodity Computing.

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

History of the National INFN Pool P. Mazzanti, F. Semeria INFN – Bologna (Italy) European Condor Week 2006 Milan, 29-Jun-2006.

Peter Couvares Computer Sciences Department University of Wisconsin-Madison Metronome and The NMI Lab: This subtitle included solely to.

Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.

PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Taking stock of Grid technologies - accomplishments and challenges.

Hao Wang Computer Sciences Department University of Wisconsin-Madison Security in Condor.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

PPDG and ATLAS Particle Physics Data Grid Ed May - ANL ATLAS Software Week LBNL May 12, 2000.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Welcome and Condor Project Overview.

Grid Workload Management Massimo Sgaravatto INFN Padova.

Condor Team Welcome to Condor Week #10 (year #25 for the project)

Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor : A Concept, A Tool and.

Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison

9 February 2000CHEP2000 Paper 3681 CDF Data Handling: Resource Management and Tests E.Buckley-Geer, S.Lammel, F.Ratnikov, T.Watts Hardware and Resources.

RAL Site Report John Gordon IT Department, CLRC/RAL HEPiX Meeting, JLAB, October 2000.

Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.

Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.

1 Getting popular Figure 1: Condor downloads by platform Figure 2: Known # of Condor hosts.

The UK eScience Grid (and other real Grids) Mark Hayes NIEeS Summer School 2003.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Alain Roy Computer Sciences Department University of Wisconsin-Madison Packaging & Testing: NMI & VDT.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor-G: A Computation Management.

Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison

Alain Roy Computer Sciences Department University of Wisconsin-Madison Condor & Middleware: NMI & VDT.

GRID activities in Wuppertal D0RACE Workshop Fermilab 02/14/2002 Christian Schmitt Wuppertal University Taking advantage of GRID software now.

Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.

A Fully Automated Fault- tolerant System for Distributed Video Processing and Offsite Replication George Kola, Tevfik Kosar and Miron Livny University.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

June 29, 2006 P. Capiluppi The First CMS Data Challenge (~1998/99) Using Condor.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.

Welcome!!! Condor Week 2006.

Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.

Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.

Status of Globus activities Massimo Sgaravatto INFN Padova for the INFN Globus group

Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.

George Kola Computer Sciences Department University of Wisconsin-Madison Data Pipelines: Real Life Fully.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Introduction.

10 March Andrey Grid Tools Working Prototype of Distributed Computing Infrastructure for Physics Analysis SUNY.

Since computing power is everywhere, how can we make it usable by anyone? (From Condor Week 2003, UW)

Creating Grid Resources for Undergraduate Coursework John N. Huffman Brown University Richard Repasky Indiana University Joseph Rinkovsky Indiana University.

Condor on Dedicated Clusters Peter Couvares and Derek Wright Computer Sciences Department University of Wisconsin-Madison

10-Feb-00 CERN HepCCC Grid Initiative ATLAS meeting – 16 February 2000 Les Robertson CERN/IT.

Condor A New PACI Partner Opportunity Miron Livny

US CMS Testbed.

Staff Scheduling at USPS Mail Processing & Distribution Centers

Dean Martin Cadwallader Dean of the Graduate School

Presentation transcript:

Peter Couvares Computer Sciences Department University of Wisconsin-Madison High-Throughput Computing With Condor

Who Are We?

The Condor Project (Established ‘85) Distributed systems CS research performed by a team that faces:  software engineering challenges in a Unix/Linux/NT environment,  active interaction with users and collaborators,  daily maintenance and support challenges of a distributed production environment,  and educating and training students. Funding - NSF, NASA,DoE, DoD, IBM, INTEL, Microsoft and the UW Graduate School.

The Condor System

The Condor System › Unix and NT › Operational since 1986 › More than 1300 CPUs at UW-Madison › Available on the web › More than 150 clusters worldwide in academia and industry

What is Condor? › Condor converts collections of distributively owned workstations and dedicated clusters into a high- throughput computing facility. › Condor uses matchmaking to make sure that everyone is happy.

What is High-Throughput Computing? › High-performance: CPU cycles/second under ideal circumstances.  “How fast can I run simulation X on this machine?” › High-throughput: CPU cycles/day (week, month, year?) under non-ideal circumstances.  “How many times can I run simulation X in the next month using all available machines?”

What is High-Throughput Computing? › Condor does whatever it takes to run your jobs, even if some machines…  Crash! (or are disconnected)  Run out of disk space  Don’t have your software installed  Are frequently needed by others  Are far away & admin’ed by someone else

What is Matchmaking? › Condor uses Matchmaking to make sure that work gets done within the constraints of both users and owners. › Users (jobs) have constraints:  “I need an Alpha with 256 MB RAM” › Owners (machines) have constraints:  “Only run jobs when I am away from my desk and never run jobs owned by Bob.”

“What can Condor do for me?” Condor can… › …do your housekeeping. › …improve reliability. › …give performance feedback. › …increase your throughput!

Some Numbers: UW-CS Pool 6/98-6/00 4,000,000hours ~450 years “Real” Users1,700,000hours ~260 years CS-Optimization610,000hours CS-Architecture350,000hours Physics245,000hours Statistics80,000hours Engine Research Center38,000hours Math90,000hours Civil Engineering27,000hours Business970hours “External” Users165,000hours ~19 years MIT76,000hours Cornell38,000hours UCSD38,000hours CalTech18,000hours

Condor & Physics

Current CMS Activity › Simulation (CMSIM) for CalTech  provided >135,000 CPU hours to date  peak day ~ 4000 CPU hours  via NCSA Alliance, Condor has allocated 1,000,000 hours total to CalTech › Simulation and Reconstruction (CMSIM + ORCA) for HEP group at UW-Madison

INFN Condor Pool - Italy › Italian National Institute for Research in Nuclear and Subnuclear Physics › 19 locations, each running a Condor pool › as few as 1 CPU -- to >100 CPUs › each locally controlled › each “flocks” jobs to other pools when available

Particle Physics Data Grid › The PPDG Project is...  a software engineering effort to design, implement, experiment, evaluate, and prototype HEP-specific data-transfer and caching software tools for Grid environments › For example...

Condor PPDG Work › Condor Data Manager  technology to automate & coordinate data movement from a variety of long- term repositories to available Condor computing resources & back again  keeping the pipeline full!  SRB (SDSC), SAM (Fermi), PPDG HRM

PPDG Collaborators

National Grid Efforts › GriPhyN (Grid Physics Network) › National Technology Grid - NCSA Alliance (NSF-PACI) › Information Power Grid - IPG (NASA) › close collaboration with the Globus project

I have 600 simulations to run. How can Condor help me?

My Application … Simulate the behavior of F(x,y,z) for 20 values of x, 10 values of y and 3 values of z (20*10*3 = 600)  F takes on the average 3 hours to compute on a “typical” workstation ( total = 1800 hours )  F requires a “moderate” (128MB) amount of memory  F performs “moderate” I/O - (x,y,z) is 5 MB and F(x,y,z) is 50 MB

Step I - get organized! › Write a script that creates 600 input files for each of the (x,y,z) combinations › Write a script that will collect the data from the 600 output files › Turn your workstation into a “ Personal Condor ” › Submit a cluster of 600 jobs to your personal Condor › Go on a long vacation … (2.5 months)

your workstation personal Condor 600 Condor jobs

Step II - build your personal Grid › Install Condor on the desktop machine next door › …and on the machines in the classroom. › Install Condor on the department’s Linux cluster or the O2K in the basement. › Configure these machines to be part of your Condor pool. › Go on a shorter vacation...

your workstation personal Condor 600 Condor jobs Group Condor

Step III - take advantage of your friends › Get permission from “friendly” Condor pools to access their resources › Configure your personal Condor to “flock” to these pools › reconsider your vacation plans...

your workstation friendly Condor personal Condor 600 Condor jobs Group Condor

Think BIG. Go to the Grid.

Upgrade to Condor-G A Grid-enabled version of Condor that uses the inter-domain services of Globus to bring Grid resources into the domain of your Personal Condor  Easy to use on different platforms  Robust  Supports SMPs & dedicated schedulers

Step IV - Go for the Grid › Get access (account(s) + certificate(s)) to a “Computational” Grid › Submit 599 “Grid Universe” Condor- glide-in jobs to your personal Condor › Take the rest of the afternoon off...

your workstation friendly Condor personal Condor 600 Condor jobs Globus Grid PBS LSF Condor Group Condor 599 glide-ins

What Have We Done with the Grid Already? › NUG30  quadratic assignment problem  30 facilities, 30 locations minimize cost of transferring materials between them  posed in 1968 as challenge, long unsolved  but with a good pruning algorithm & high-throughput computing...

NUG30 Personal Condor Grid For the run we will be flocking to -- the main Condor pool at Wisconsin (600 processors) -- the Condor pool at Georgia Tech (190 Linux boxes) -- the Condor pool at UNM (40 processors) -- the Condor pool at Columbia (16 processors) -- the Condor pool at Northwestern (12 processors) -- the Condor pool at NCSA (65 processors) -- the Condor pool at INFN (200 processors) We will be using glide_in to access the Origin 2000 (through LSF ) at NCSA. We will use "hobble_in" to access the Chiba City Linux cluster and Origin 2000 here at Argonne.

NUG30 - Solved!!! Sender: Subject: Re: Let the festivities begin. Hi dear Condor Team, you all have been amazing. NUG30 required 10.9 years of Condor Time. In just seven days ! More stats tomorrow !!! We are off celebrating ! condor rules ! cheers, JP.

Conclusion Computing power is everywhere, we try to make it usable by anyone.

Need more info? › Condor Web Page ( › Peter Couvares