Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison High-Throughput.

Slides:

Advertisements

Similar presentations

Condor Project Computer Sciences Department University of Wisconsin-Madison Introduction Condor.

Advertisements

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.

Multiple Processor Systems

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.

Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)

1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!

Dr. David Wallom Use of Condor in our Campus Grid and the University September 2004.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

The flight of the Condor - a decade of High Throughput Computing Miron Livny Computer Sciences Department University of Wisconsin-Madison

Workload Management Massimo Sgaravatto INFN Padova.

The Condor Data Access Framework GridFTP / NeST Day 31 July 2001 Douglas Thain.

Alain Roy Computer Sciences Department University of Wisconsin-Madison 25-June-2002 Using Condor on the Grid.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin

Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.

Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.

Cheap cycles from the desktop to the dedicated cluster: combining opportunistic and dedicated scheduling with Condor Derek Wright Computer Sciences Department.

Using Personal Condor to Solve Quadratic Assignment Problems Jeff Linderoth Axioma, Inc.

Miron Livny Computer Sciences Department University of Wisconsin-Madison From Compute Intensive to Data.

Peter Couvares Computer Sciences Department University of Wisconsin-Madison High-Throughput Computing With.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Commodity Computing.

Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.

6d.1 Schedulers and Resource Brokers ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.

Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.

PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Taking stock of Grid technologies - accomplishments and challenges.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

Parallel Optimization Tools for High Performance Design of Integrated Circuits WISCAD VLSI Design Automation Lab Azadeh Davoodi.

Grid Basics Adarsh Patil

Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.

Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Welcome and Condor Project Overview.

Miron Livny Center for High Throughput Computing Computer Sciences Department University of Wisconsin-Madison Open Science Grid (OSG)

Condor Team Welcome to Condor Week #10 (year #25 for the project)

Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor : A Concept, A Tool and.

Condor Project Computer Sciences Department University of Wisconsin-Madison Case Studies of Using.

Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison

NGS Innovation Forum, Manchester4 th November 2008 Condor and the NGS John Kewley NGS Support Centre Manager.

Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.

1 Getting popular Figure 1: Condor downloads by platform Figure 2: Known # of Condor hosts.

Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

What is SAM-Grid? Job Handling Data Handling Monitoring and Information.

Review of Condor,SGE,LSF,PBS

Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.

Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor-G: A Computation Management.

FATCOP: A Mixed Integer Program Solver Michael FerrisQun Chen Department of Computer Sciences University of Wisconsin-Madison Jeff Linderoth, Argonne.

A Fully Automated Fault- tolerant System for Distributed Video Processing and Offsite Replication George Kola, Tevfik Kosar and Miron Livny University.

FATCOP: A Mixed Integer Program Solver Michael FerrisQun Chen University of Wisconsin-Madison Jeffrey Linderoth Argonne National Laboratories.

Scheduling & Resource Management in Distributed Systems Rajesh Rajamani, May 2001.

Miron Livny Center for High Throughput Computing Computer Sciences Department University of Wisconsin-Madison High Throughput Computing (HTC)

Nicholas Coleman Computer Sciences Department University of Wisconsin-Madison Distributed Policy Management.

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.

Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.

George Kola Computer Sciences Department University of Wisconsin-Madison Data Pipelines: Real Life Fully.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Introduction.

Condor on Dedicated Clusters Peter Couvares and Derek Wright Computer Sciences Department University of Wisconsin-Madison

Workload Management Workpackage

Condor A New PACI Partner Opportunity Miron Livny

Condor – A Hunter of Idle Workstation

Basic Grid Projects – Condor (Part I)

Condor-G Making Condor Grid Enabled

GLOW A Campus Grid within OSG

Presentation transcript:

Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison High-Throughput Computing on Commodity Systems.

The Good News: Raw computing power is everywhere - on desk-tops, shelves, racks, and in your pockets. It is:  Cheap  Plentiful  Mass-Produced

The Bad News: GFLOPS per year =/= GFLOPS per second * 30,000,000 seconds/year

A variation on a chestnut: What is a benchmark?

Answer: The throughput which your system is guaranteed never to exceed!

Why? › A community of commodity computers can be difficult to manage:  Dynamic : State and availability change over time  Evolving : New hardware and software is continuously acquired and installed  Heterogeneous : Hardware and software  Distributed ownership : Each machine has a different owner with different requirements and preferences.

Why? › Even traditionally “static” systems (such as professionally managed clusters) suffer the same problems when viewed at a yearly scale:  Power failures  Hardware failures  Software upgrades  Load imbalance  Network imbalance

How do we measure computer performance? › High-Performance Computing:  Achieve max GFLOP per second under ideal circumstances. › High-Throughput Computing  Achieve max GFLOP per months or years in whatever conditions prevail.

High-Throughput Computing › Focuses on maximizing…  simulations run before the paper deadline…  crystal lattices per week…  reconstructions per week…  video frames rendered per year… › …without “babysitting” from the user. › Cannot depend on “ideal” circumstances.

High-Throughput Computing › Is achieved by:  Expanding the CPUs available.  Silently adapting to inevitable changes.  Robust software › Is only marginally affected by:  MB, MHz, MIPS, FLOPS…  Robust hardware

Solution: Condor › Condor is software for creating a high-throughput computing environment on a community of workstations, ranging from commodity PCs to supercomputers.

Who are we?

The Condor Project (Established ‘85) Distributed systems CS research performed by a team that faces  software engineering challenges in a UNIX/Linux/NT environment,  active interaction with users and collaborators,  daily maintenance and support challenges of a distributed production environment,  and educating and training students. Funding - NSF, NASA,DoE, DoD, IBM, INTEL, Microsoft and the UW Graduate School.

Users and collaborators › Scientists - Biochemistry, high energy physics, computer sciences, genetics, … › Engineers - Hardware design, software building and testing, animation,... › Educators - Hardware design tools, distributed systems, networking,...

National Grid Efforts › National Technology Grid - NCSA Alliance (NSF-PACI) › Information Power Grid - IPG (NASA) › Particle Physics Data Grid - PPDG (DoE) › Grid Physics Network GriPhyN (NSF- ITR)

Condor CPUs on the UW Campus

Some Numbers:UW-CS Pool 6/98-6/00 4,000,000hours ~450 years “Real” Users1,700,000hours ~260 years CS-Optimization610,000hours CS-Architecture350,000hours Physics245,000hours Statistics80,000hours Engine Research Center38,000hours Math90,000hours Civil Engineering27,000hours Business970hours “External” Users165,000hours ~19 years MIT76,000hours Cornell38,000hours UCSD38,000hours CalTech18,000hours

Start slow, but think BIG

Start slow, but think big! One Personal Condor Condor Pool Condor-G 1 machine on your desktop 100 machines in your department 1000 machines in the GRID.

Start slow, but think big! › Personal Condor:  Manage just your machine with Condor. Fault tolerance, policy control, logging. Sleep soundly at night. › Condor Pool:  Take advantage of your friends and colleagues: share cycles, gain ~ 100x throughput. › Condor-G:  Jobs from your pool migrate to other computational facilities around the world. Gain 1000x throughput. (Record-breaking results!)

Key Condor User Services › Local control - jobs are stored and managed locally by a personal scheduler. › Priority scheduling - execution order controlled by priority ranking assigned by user. › Job preemption - re-linked jobs can be checkpointed, suspended, hold and resumed. › Local executing environment preserved - re- linked jobs can have their I/O re-directed to submission site.

More Condor User Services › Powerful and flexible means for selecting execution site (requirements and preferences) › Logging of job activities. › Management of large (10K) numbers of jobs per user. › Support for jobs with dependencies - DAGMan (Directed Acyclic Graph Manager) › Support for dynamic MW (PVM and File) applications

How does it work?

Basic HTC Mechanisms › Matchmaking - enables requests for services and offers to provide services find each other (ClassAds). › Fault tolerance - Checkpointing enables preemptive resume scheduling ( go ahead and use it as long as it is available!). › Remote execution – enables transparent access to resources from any machine in the world. › Asynchronicity - enables management of dynamic (opportunistic) resources.

Every Community needs a Matchmaker!

Why? Because..... someone has to bring together community members who have requests for goods and services with members who offer them.  Both sides are looking for each other  Both sides have constraints  Both sides have preferences

ClassAd - Properties Type=“Machine”; Activity=“Idle”; KbdIdle=‘00:22:31’; Disk=2.1G;//2.1 Gigs Memory=64M; // 6.4 Megs State=“Unclaimed”; LoadAverage= Arch=“INTEL”; OpSys=“SOLARIS251”;

ClassAd - Policy RsrchGrp={ “raman”, “miron”, “solomon” }; Friends={ “dilbert”, “wally” }; Untrusted={ “rival”, riffraff”, TPHB” }; Tier=member(RsrchGroup, other.Owner) ? 2 : ( member(Friends, other.Owner) ? 1 : 0 ) Requirements=!member(Untrusted, other.Owener) && (Tier == 2 ? True : Tier == 1 ? LoadAvg < 0.3 && KbdIdle > ‘00:15’ ) : DayTime() ’18:00’ )

Advantages of Matchmaking 4 Hybrid (Centralized+Distributed) resource allocation algorithm 4 End-to-end verification 4 Bilateral specialization 4 Weak consistency requirements 4 Authentication 4 Fault tolerance 4 Incremental system evolution

Fault-Tolerance › Condor can checkpoint a program by writing its image to disk. › If a machine should fail, the program may resume from the last checkpoint. › Ifa job must vacate a machine, it may resume from where it left off.

Remote Execution › Condor might run your jobs on machines spread around the world – not all of them will have your files. › Condor provides an adapter – a library – which converts your job’s I/O operations into remote I/O back to your home machine. › No matter where your job runs, it sees the same environment.

Asynchronicity › A fact of life in a system of 1000s of machines.  Power on/off  Lunch breaks  Jobs start and finish › Condor never depends on a fixed configuration – work with what is available.

Does it work?

An example - NUG28 We are pleased to announce the exact solution of the nug28 quadratic assignment problem (QAP). This problem was derived from the well known nug30 problem using the distance matrix from a 4 by 7 grid, and the flow matrix from nug30 with the last 2 facilities deleted. This is to our knowledge the largest instance from the nugxx series ever provably solved to optimality. The problem was solved using the branch-and-bound algorithm described in the paper "Solving quadratic assignment problems using convex quadratic programming relaxations," N.W. Brixius and K.M. Anstreicher. The computation was performed on a pool of workstations using the Condor high-throughput computing system in a total wall time of approximately 4 days, 8 hours. During this time the number of active worker machines averaged approximately 200. Machines from UW, UNM and (INFN) all participated in the computation.

NUG30 Personal Condor … For the run we will be flocking to -- the main Condor pool at Wisconsin (600 processors) -- the Condor pool at Georgia Tech (190 Linux boxes) -- the Condor pool at UNM (40 processors) -- the Condor pool at Columbia (16 processors) -- the Condor pool at Northwestern (12 processors) -- the Condor pool at NCSA (65 processors) -- the Condor pool at INFN (200 processors) We will be using glide_in to access the Origin 2000 (through LSF ) at NCSA. We will use "hobble_in" to access the Chiba City Linux cluster and Origin 2000 here at Argonne.

It works!!! Date: Thu, 8 Jun :41: (CDT) From: Jeff Linderoth To: Miron Livny Subject: Re: Priority This has been a great day for metacomputing! Everything is going wonderfully. We've had over 900 machines (currently around 890), and all the pieces are working great… Date: Fri, 9 Jun :41: (CDT) From: Jeff Linderoth Still rolling along. Over three billion nodes in about 1 day!

Up to a Point … Date : Fri, 9 Jun :35: (CDT) From: Jeff Linderoth Hi Gang, The glory days of metacomputing are over. Our job just crashed. I watched it happen right before my very eyes. It was what I was afraid of -- they just shut down denali, and losing all of those machines at once caused other connections to time out -- and the snowball effect had bad repercussions for the Schedd.

Back in Business Date: Fri, 9 Jun :55: (CDT) From: Jeff Linderoth Hi Gang, We are back up and running. And, yes, it took me all afternoon to get it going again. There was a (brand new) bug in the QAP "read checkpoint" information that was making the master coredump. (Only with optimization level -O4). I was nearly reduced to tears, but with some supportive words from Jean-Pierre, I made it through.

The First 600K seconds …

We made it!!! Sender: Subject: Re: Let the festivities begin. Hi dear Condor Team, you all have been amazing. NUG30 required 10.9 years of Condor Time. In just seven days ! More stats tomorrow !!! We are off celebrating ! condor rules ! cheers, JP.

Do not be picky, be agile!!!