Miron Livny Computer Sciences Department University of Wisconsin-Madison Commodity Computing.

Slides:



Advertisements
Similar presentations
Condor Project Computer Sciences Department University of Wisconsin-Madison Introduction Condor.
Advertisements

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.
A Computation Management Agent for Multi-Institutional Grids
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
The flight of the Condor - a decade of High Throughput Computing Miron Livny Computer Sciences Department University of Wisconsin-Madison
Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison High-Throughput.
Workload Management Massimo Sgaravatto INFN Padova.
Alain Roy Computer Sciences Department University of Wisconsin-Madison 25-June-2002 Using Condor on the Grid.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Cheap cycles from the desktop to the dedicated cluster: combining opportunistic and dedicated scheduling with Condor Derek Wright Computer Sciences Department.
Using Personal Condor to Solve Quadratic Assignment Problems Jeff Linderoth Axioma, Inc.
Miron Livny Computer Sciences Department University of Wisconsin-Madison From Compute Intensive to Data.
Peter Couvares Computer Sciences Department University of Wisconsin-Madison High-Throughput Computing With.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
December 8 & 9, 2005, Austin, TX SURA Cyberinfrastructure Workshop Series: Grid Technology: The Rough Guide Configuring Resources for the Grid Jerry Perez.
Welcome to CW 2007!!!. The Condor Project (Established ‘85) Distributed Computing research performed by.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Taking stock of Grid technologies - accomplishments and challenges.
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
Grid Computing I CONDOR.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Parallel Optimization Tools for High Performance Design of Integrated Circuits WISCAD VLSI Design Automation Lab Azadeh Davoodi.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Welcome and Condor Project Overview.
Miron Livny Center for High Throughput Computing Computer Sciences Department University of Wisconsin-Madison Open Science Grid (OSG)
Condor Team Welcome to Condor Week #10 (year #25 for the project)
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor : A Concept, A Tool and.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
NGS Innovation Forum, Manchester4 th November 2008 Condor and the NGS John Kewley NGS Support Centre Manager.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
1 Getting popular Figure 1: Condor downloads by platform Figure 2: Known # of Condor hosts.
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Review of Condor,SGE,LSF,PBS
Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL.
Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor-G: A Computation Management.
FATCOP: A Mixed Integer Program Solver Michael FerrisQun Chen Department of Computer Sciences University of Wisconsin-Madison Jeff Linderoth, Argonne.
Alain Roy Computer Sciences Department University of Wisconsin-Madison Condor & Middleware: NMI & VDT.
FATCOP: A Mixed Integer Program Solver Michael FerrisQun Chen University of Wisconsin-Madison Jeffrey Linderoth Argonne National Laboratories.
Scheduling & Resource Management in Distributed Systems Rajesh Rajamani, May 2001.
Miron Livny Center for High Throughput Computing Computer Sciences Department University of Wisconsin-Madison High Throughput Computing (HTC)
Condor Services for the Global Grid: Interoperability between OGSA and Condor Clovis Chapman 1, Paul Wilson 2, Todd Tannenbaum 3, Matthew Farrellee 3,
Tool Integration with Data and Computation Grid “Grid Wizard 2”
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
Welcome!!! Condor Week 2006.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Introduction.
Since computing power is everywhere, how can we make it usable by anyone? (From Condor Week 2003, UW)
Condor A New PACI Partner Opportunity Miron Livny
Condor – A Hunter of Idle Workstation
Dean Martin Cadwallader Dean of the Graduate School
Basic Grid Projects – Condor (Part I)
Condor-G Making Condor Grid Enabled
GLOW A Campus Grid within OSG
Presentation transcript:

Miron Livny Computer Sciences Department University of Wisconsin-Madison Commodity Computing

Computing power is everywhere, how can we make it usable by anyone?

Here is what empowered scientists can do with commodity computing …

NUG28 - Solved!!!! We are pleased to announce the exact solution of the nug28 quadratic assignment problem (QAP). This problem was derived from the well known nug30 problem using the distance matrix from a 4 by 7 grid, and the flow matrix from nug30 with the last 2 facilities deleted. This is to our knowledge the largest instance from the nugxx series ever provably solved to optimality. The problem was solved using the branch-and-bound algorithm described in the paper "Solving quadratic assignment problems using convex quadratic programming relaxations," N.W. Brixius and K.M. Anstreicher. The computation was performed on a pool of workstations using the Condor high-throughput computing system in a total wall time of approximately 4 days, 8 hours. During this time the number of active worker machines averaged approximately 200. Machines from UW, UNM and (INFN) all participated in the computation.

NUG30 Personal Condor … For the run we will be flocking to -- the main Condor pool at Wisconsin (600 processors) -- the Condor pool at Georgia Tech (190 Linux boxes) -- the Condor pool at UNM (40 processors) -- the Condor pool at Columbia (16 processors) -- the Condor pool at Northwestern (12 processors) -- the Condor pool at NCSA (65 processors) -- the Condor pool at INFN (200 processors) We will be using glide_in to access the Origin 2000 (through LSF ) at NCSA. We will use "hobble_in" to access the Chiba City Linux cluster and Origin 2000 here at Argonne.

It works!!! Date: Thu, 8 Jun :41: (CDT) From: Jeff Linderoth To: Miron Livny Subject: Re: Priority This has been a great day for metacomputing! Everything is going wonderfully. We've had over 900 machines (currently around 890), and all the pieces are working great… Date: Fri, 9 Jun :41: (CDT) From: Jeff Linderoth Still rolling along. Over three billion nodes in about 1 day!

Up to a Point … Date : Fri, 9 Jun :35: (CDT) From: Jeff Linderoth Hi Gang, The glory days of metacomputing are over. Our job just crashed. I watched it happen right before my very eyes. It was what I was afraid of -- they just shut down denali, and losing all of those machines at once caused other connections to time out -- and the snowball effect had bad repercussions for the Schedd.

Back in Business Date: Fri, 9 Jun :55: (CDT) From: Jeff Linderoth Hi Gang, We are back up and running. And, yes, it took me all afternoon to get it going again. There was a (brand new) bug in the QAP "read checkpoint" information that was making the master coredump. (Only with optimization level -O4). I was nearly reduced to tears, but with some supportive words from Jean-Pierre, I made it through.

The First 600K seconds …

NUG30 - Solved!!! Sender: Subject: Re: Let the festivities begin. Hi dear Condor Team, you all have been amazing. NUG30 required 10.9 years of Condor Time. In just seven days ! More stats tomorrow !!! We are off celebrating ! condor rules ! cheers, JP.

What can commodity computing do for you?

I have a job parallel MW application with 600 workers. How can I benefit from Commodity Computing?

My Application … Study the behavior of F(x,y,z) for 20 values of x, 10 values of y and 3 values of z (20*10*3 = 600)  F takes on the average 3 hours to compute on a “typical” workstation ( total = 1800 hours )  F requires a “moderate” (128MB) amount of memory  F performs “little” I/O - (x,y,z) is 15 MB and F(x,y,z) is 40 MB

Step I - get organized! › Write a script that creates 600 input files for each of the (x,y,z) combinations › Write a script that collects the data from the 600 output files › Turn your workstation into a “ Personal Condor ” › Submit a cluster of 600 jobs to your personal Condor › Go on a long vacation … (2.5 months)

Your Personal Condor will... ›... keep an eye on your jobs and will keep you posted on their progress ›... implement your policy on when the jobs can run on your workstation ›... implement your policy on the execution order of the jobs ›.. add fault tolerance to your jobs › … keep a log of your job activities

your workstation personal Condor 600 Condor jobs

Step II - build your personal Grid › Install Condor on the desk-top machine next door. › Install Condor on the machines in the class room. › Install Condor on the O2K in the basement. › Configure these machines to be part of your Condor pool. › Go on a shorter vacation...

your workstation personal Condor 600 Condor jobs Group Condor

Resource Local Resource Management Owner Agent Environment Agent Customer Agent Application Agent Application Condor Layers Tasks Jobs

Matchmaking in Condor Submit MachineExecution Machine Collector CA [...A] [...B] [...C] CN RA Negotiator Customer AgentResource Agent Environment Agent

Owner Agent Execution Agent Application Process Customer Agent Application Process Application Agent Request Queue Data & Object Files Ckpt Files Object Files Remote I/O & Ckpt Object Files Submission Execution

Step III - Take advantage of your friends › Get permission from “friendly” Condor pools to access their resources › Configure your personal Condor to “flock” to these pools › reconsider your vacation plans...

your workstation friendly Condor personal Condor 600 Condor jobs Group Condor

Think big. Go to the Grid

Upgrade to Condor-G A Grid enabled version of Condor that uses the inter-domain services of Globus to bring Grid resources into the domain of your Personal-Condor  Supports Grid Universe jobs  Uses GSIFTP to move glide-in software  Uses MDS for submit information

Condor-glide-in Enable an application to dynamically turn allocated grid resources into members of a Condor pool for the duration of the allocation.  Easy to use on different platforms  Robust  Supports SMPs

Step IV - Go for the Grid › Get access (account(s) + certificate(s)) to a “Computational” Grid › Submit 599 “Grid Universe” Condor- glide-in jobs to your personal Condor › Take the rest of the afternoon off...

your workstation friendly Condor personal Condor 600 Condor jobs Globus Grid PBS LSF Condor Group Condor 599 glide-ins

Driving Concepts

HW is a Commodity Raw computing power is everywhere - on desk-tops, shelves, and racks. It is  cheap  dynamic,  distributively owned,  heterogeneous and  evolving.

“ … Since the early days of mankind the primary motivation for the establishment of communities has been the idea that by being part of an organized group the capabilities of an individual are improved. The great progress in the area of inter-computer communication led to the development of means by which stand-alone processing sub-systems can be integrated into multi-computer ‘communities’. … “ Miron Livny, “ Study of Load Balancing Algorithms for Decentralized Distributed Processing Systems.”, Ph.D thesis, July 1983.

Every Community needs a Matchmaker!

Why? Because..... someone has to bring together community members who have requests for goods and services with members who offer them.  Both sides are looking for each other  Both sides have constraints  Both sides have preferences

We use Matchmakers to build Computing Communities out of Commodity Components

The Matchmaking Process Advertising Protocol Each party uses ClassAd to declare its type, target type, constraints on target, ranking of new match, ranking of current match... Matchmaking Algorithm Used by Matchmaker to create matches Match Notification Protocol Used by Matchmaker to notify matched parties Claiming Protocol Used by matched parties to claim each other

High Throughput Computing For many experimental scientists, scientific progress and quality of research are strongly linked to computing throughput. In other words, they are less concerned about instantaneous computing power. Instead, what matters to them is the amount of computing they can harness over a month or a year --- they measure computing power in units of scenarios per day, wind patterns per week, instructions sets per month, or crystal configurations per year.

High Throughput Computing is a activity FLOPY  (60*60*24*7*52)*FLOPS

Obstacles to HTC › Ownership Distribution › Customer Awareness › Size and Uncertainties › Technology Evolution › Physical Distribution (Sociology) (Education) (Robustness) (Portability) (Technology)

Basic HTC Mechanisms › Matchmaking - enables requests for services and offers to provide services find each other (ClassAds). › Checkpointing - enables preemptive resume scheduling ( go ahead and use it as long as it is available!). › Remote I/O - enables remote (from execution site) access to local (at submission site) data. › Asynchronous API - enables management of dynamic (opportunistic) resources.

Master-Worker (MW) computing is Naturally Parallel. It is by no means Embarrassingly Parallel. Doing it right is by no means trivial.

The Tool

Our Answer to High Throughput MW Computing on commodity resources

The Condor System A High Throughput Computing system that supports large dynamic MW applications on large collections of distributively owned resources developed, maintained and supported by the Condor Team at the University of Wisconsin - Madison since ‘86.  Originally developed for UNIX workstations  Based on matchmaking technology.  Fully integrated NT version is available.  Deployed world-wide by academia and industry.  More than 1300 CPUs at the U of Wisconsin.  Available at

Condor CPUs on the UW Campus

Some Numbers:UW-CS Pool 6/98-6/00 4,000,000hours ~450 years “Real” Users1,700,000hours ~260 years CS-Optimization610,000hours CS-Architecture350,000hours Physics245,000hours Statistics80,000hours Engine Research Center38,000hours Math90,000hours Civil Engineering27,000hours Business970hours “External” Users165,000hours ~19 years MIT76,000hours Cornell38,000hours UCSD38,000hours CalTech18,000hours

Key Condor User Services › Local control - jobs are stored and managed locally by a personal scheduler. › Priority scheduling - execution order controlled by priority ranking assigned by user. › Job preemption - re-linked jobs can be checkpointed, suspended, hold and resumed. › Local executing environment preserved - re- linked jobs can have their I/O re-directed to submission site.

More Condor User Services › Powerful and flexible means for selecting execution site (requirements and preferences) › Logging of job activities. › Management of large (10K) numbers of jobs per user. › Support for jobs with dependencies - DAGMan (Directed Acyclic Graph Manager) › Support for dynamic MW (PVM and File) applications

executable = worker requirement = ((OS == “Linux2.2”) && (Memory >= 64)) rank = KFLOP initialdir = worker_dir.$(process) input = in output = out error = err log = Condor_log queue 1000 executable = worker requirement = ((OS == “Linux2.2”) && (Memory >= 64)) rank = KFLOP initialdir = worker_dir.$(process) input = in output = out error = err log = Condor_log queue 1000 A Condor Job-Parallel Submit File

Task Parallel MW Application Potential = start FOR cycle = 1 to 36 FOR location = 1 to 31 totalEnergy =+ Energy(location,potential) END potential = F(totalEnergy) END Potential = start FOR cycle = 1 to 36 FOR location = 1 to 31 totalEnergy =+ Energy(location,potential) END potential = F(totalEnergy) END Implemented as a PVM application with the Condor MW services. Two traces (execution and performance) visualized by DEVise. Worker Tasks Master Tasks

Logical worker ID 36*31 Worker Tasks Node Utilization # of Workers One Cycle (31 worker tasks) Task Duration vs. Location Time (total 6 hours) First Allocation Second Allocation Third Allocation Preemption

We have customers who... › … have job parallel MW applications with more than 5000 jobs. › … have task parallel MW applications with more than 1000 tasks. › … run their job parallel MW applications for more than six months. › … run their task parallel MW applications for more than four weeks.

Who are we?

The Condor Project (Established ‘85) Distributed systems CS research performed by a team that faces  software engineering challenges in a UNIX/Linux/NT environment,  active interaction with users and collaborators,  daily maintenance and support challenges of a distributed production environment,  and educating and training students. Funding - NSF, NASA,DoE, DoD, IBM, INTEL, Microsoft and the UW Graduate School.

Users and collaborators › Scientists - Biochemistry, high energy physics, computer sciences, genetics, … › Engineers - Hardware design, software building and testing, animation,... › Educators - Hardware design tools, distributed systems, networking,...

National Grid Efforts › National Technology Grid - NCSA Alliance (NSF-PACI) › Information Power Grid - IPG (NASA) › Particle Physics Data Grid - PPDG (DoE) › Grid Physics Network GriPhyN (NSF- ITR)

Do not be picky, be agile!!!