Alain Roy Computer Sciences Department University of Wisconsin-Madison 23-June-2002 Introduction to Condor.

Slides:



Advertisements
Similar presentations
Condor Project Computer Sciences Department University of Wisconsin-Madison Introduction Condor.
Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Resource Management with YARN: YARN Past, Present and Future
A Computation Management Agent for Multi-Institutional Grids
Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Workload Management Massimo Sgaravatto INFN Padova.
The Condor Data Access Framework GridFTP / NeST Day 31 July 2001 Douglas Thain.
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Distributed Systems: Client/Server Computing
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.
6d.1 Schedulers and Resource Brokers ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
1 HawkEye A Monitoring and Management Tool for Distributed Systems Todd Tannenbaum Department of Computer Sciences University of.
Grid Computing I CONDOR.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Unit – I CLIENT / SERVER ARCHITECTURE. Unit Structure  Evolution of Client/Server Architecture  Client/Server Model  Characteristics of Client/Server.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
Miron Livny Computer Sciences Department University of Wisconsin-Madison Welcome and Condor Project Overview.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
NGS Innovation Forum, Manchester4 th November 2008 Condor and the NGS John Kewley NGS Support Centre Manager.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Alain Roy Computer Sciences Department University of Wisconsin-Madison ClassAds: Present and Future.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,
9 Systems Analysis and Design in a Changing World, Fourth Edition.
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Tevfik Kosar Computer Sciences Department University of Wisconsin-Madison Managing and Scheduling Data.
Flexibility, Manageability and Performance in a Grid Storage Appliance John Bent, Venkateshwaran Venkataramani, Nick Leroy, Alain Roy, Joseph Stanley,
CEOS Working Group on Information Systems and Services - 1 Data Services Task Team Discussions on GRID and GRIDftp Stuart Doescher, USGS WGISS-15 May 2003.
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Jichuan Chang Computer Sciences Department University of Wisconsin-Madison MW – A Framework to Support.
Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison
Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.
MW: A framework to support Master Worker Applications Sanjeev R. Kulkarni Computer Sciences Department University of Wisconsin-Madison
Scheduling & Resource Management in Distributed Systems Rajesh Rajamani, May 2001.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor NT Condor ported.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Introduction.
BIG DATA/ Hadoop Interview Questions.
Condor on Dedicated Clusters Peter Couvares and Derek Wright Computer Sciences Department University of Wisconsin-Madison
Condor DAGMan: Managing Job Dependencies with Condor
Chapter 1: Introduction
Distributed Cache Technology in Cloud Computing and its Application in the GIS Software Wang Qi Zhu Yitong Peng Cheng
Grid Computing.
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Basic Grid Projects – Condor (Part I)
Condor-G Making Condor Grid Enabled
Presentation transcript:

Alain Roy Computer Sciences Department University of Wisconsin-Madison 23-June-2002 Introduction to Condor

Доброе утро! › Thank you for having me! › I am:  Alain Roy  Computer Science Ph.D. in Quality of Service, with Globus Project  Working with the Condor Project

Condor Tutorials › Today (Sunday) 10:00-12:30  A general introduction to Condor › Monday 17:00-19:00  Using and administering Condor › Tuesday17:00-19:00  Using Condor on the Grid

A General Introduction to Condor

The Condor Project (Established 1985) Distributed Computing research performed by a team of about 30 faculty, full time staff, and students who:  face software engineering challenges in a Unix and Windows environment,  are involved in national and international collaborations,  actively interact with users,  maintain and support a distributed production environment,  and educate and train students.

A Multifaceted Project › Harnessing clusters—opportunistic and dedicated (Condor) › Job management for Grid applications (Condor-G, DaPSched) › Fabric management for Grid resources (Condor, GlideIns, NeST) › Distributed I/O technology (PFS, Kangaroo, NeST) › Job-flow management (DAGMan, Condor) › Distributed monitoring and management (HawkEye) › Technology for Distributed Systems (ClassAD, MW)

Harnessing Computers › We have more than 300 pools with more than 8500 CPUs worldwide. › We have more than 1800 CPUs in 10 pools on our campus. › Established a “complete” production environment for the UW CMS group › Adopted by the “real world” (Galileo, Maxtor, Micron, Oracle, Tigr, … )

The G rid … › Close collaboration and coordination with the Globus Project—joint development, adoption of common protocols, technology exchange, … › Partner in major national Grid R&D 2 (Research, Development and Deployment) efforts (GriPhyN, iVDGL, IPG, TeraGrid) › Close collaboration with Grid projects in Europe (EDG, GridLab, e-Science)

User/Application Fabric ( processing, storage, communication ) Grid

User/Application Fabric ( processing, storage, communication ) Grid Condor Globus Toolkit Condor

distributed I/O … › Close collaboration with the Scientific Data Management Group at LBL. › Provide management services for distributed data storage resources › Provide management and scheduling services for Data Placement jobs (DaPs) › Effective, secure and flexible remote I/O capabilities › Exception handling

job flow management … › Adoption of Directed Acyclic Graphs (DAGs) as a common job flow abstraction. › Adoption of the DAGMan as an effective solution to job flow management.

For the Rest of Today › Condor › Condor and the Grid › Related Technologies  DAGMan  ClassAds  Master-Worker  NeST  DaP Scheduler  Hawkeye › Today: Just the “Big Picture”

What is Condor? › Condor converts collections of distributively owned workstations and dedicated clusters into a distributed high-throughput computing facility.  Run lots of jobs over a long period of time,  Not a short burst of “high-performance” › Condor manages both machines and jobs with ClassAd Matchmaking to keep everyone happy

Condor Takes Care of You › Condor does whatever it takes to run your jobs, even if some machines…  Crash (or are disconnected)  Run out of disk space  Don’t have your software installed  Are frequently needed by others  Are far away & managed by someone else

What is Unique about Condor? › ClassAds › Transparent checkpoint/restart › Remote system calls › Works in heterogeneous clusters › Clusters can be:  Dedicated  Opportunistic

What’s Condor Good For? › Managing a large number of jobs  You specify the jobs in a file and submit them to Condor, which runs them all and sends you when they complete  Mechanisms to help you manage huge numbers of jobs (1000’s), all the data, etc.  Condor can handle inter-job dependencies (DAGMan)

What’s Condor Good For? (cont’d) › Robustness  Checkpointing allows guaranteed forward progress of your jobs, even jobs that run for weeks before completion  If an execute machine crashes, you only lose work done since the last checkpoint  Condor maintains a persistent job queue - if the submit machine crashes, Condor will recover  (Story)

What’s Condor Good For? (cont’d) › Giving your job the agility to access more computing resources  Checkpointing allows your job to run on “opportunistic resources” (not dedicated)  Checkpointing also provides “migration” - if a machine is no longer available, move!  With remote system calls, run on systems which do not share a filesystem - You don’t even need an account on a machine where your job executes

Other Condor features › Implement your policy on when the jobs can run on your workstation › Implement your policy on the execution order of the jobs › Keep a log of your job activities

A Condor Pool In Action

A Bit of Condor Philosophy › Condor brings more computing to everyone  A small-time scientist can make an opportunistic pool with 10 machines, and get 10 times as much computing done.  A large collaboration can use Condor to control it’s dedicated pool with hundreds of machines.

The Condor Idea Computing power is everywhere, we try to make it usable by anyone.

Meet Frieda. She is a scientist. But she has a big problem.

Frieda’s Application … Simulate the behavior of F(x,y,z) for 20 values of x, 10 values of y and 3 values of z (20*10*3 = 600 combinations)  F takes on the average 3 hours to compute on a “typical” workstation ( total = 1800 hours )  F requires a “moderate” (128MB) amount of memory  F performs “moderate” I/O - (x,y,z) is 5 MB and F(x,y,z) is 50 MB

I have 600 simulations to run. Where can I get help?

Install a Personal Condor!

Installing Condor › Download Condor for your operating system › Available as a free download from › Not labelled as “Personal” Condor, just “Condor”. › Available for most Unix platforms and Windows NT

So Frieda Installs Personal Condor on her machine… › What do we mean by a “Personal” Condor?  Condor on your own workstation, no root access required, no system administrator intervention needed—easy to set up. › So after installation, Frieda submits her jobs to her Personal Condor…

Personal Condor?! What’s the benefit of a Condor “Pool” with just one user and one machine?

Your Personal Condor will... › Keep an eye on your jobs and will keep you posted on their progress › Keep a log of your job activities › Add fault tolerance to your jobs › Implement your policy on when the jobs can run on your workstation

Frieda is happy until… She realizes she needs to run a post-analysis on each job, after it completes.

Condor DAGMan › Directed Acyclic Graph Manager › DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. › (e.g., “Don’t run job “B” until job “A” has completed successfully.”)

What is a DAG? › A DAG is the data structure used by DAGMan to represent these dependencies. › Each job is a “node” in the DAG. › Each node can have any number of “parent” or “children” nodes – as long as there are no loops! Job A Job BJob C Job D

DAGMan Running a DAG › DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies. Condor Job Queue C D A A B.dag File

DAGMan Running a DAG (cont’d) › DAGMan holds & submits jobs to Condor at the appropriate times. Condor Job Queue C D B C B A

DAGMan Running a DAG (cont’d) › In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG. Condor Job Queue X D A B Rescue File

DAGMan Recovering a DAG › Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG. Condor Job Queue C D A B Rescue File C

DAGMan Recovering a DAG (cont’d) › Once that job completes, DAGMan will continue the DAG as if the failure never happened. Condor Job Queue C D A B D

DAGMan Finishing a DAG › Once the DAG is complete, the DAGMan job itself is finished, and exits. Condor Job Queue C D A B

Frieda wants more… › She decides to use the graduate students’ computers when they aren’t, and get done sooner. › In exchange, they can use the Condor pool too.

Frieda’s Condor pool… Frieda’s Computer: Central Manager Graduate Student’s Desktop Computers

Frieda’s Pool is Flexible › Since Frieda’s is a professor, her jobs are preferred. › Frieda doesn’t always have jobs, so now the graduate students have access to more computing power. › Frieda’s pool has enabled more work to be done by everyone.

How does this work? › Frieda submits a job. Condor makes a ClassAd and give it to the Central Manager:  Owner = “Frieda”  MemoryUsed = 40M  ImageSize=20M  Requirements=(Opsys==“Linux” && Memory > MemoryUsed) › Central Manager collects machine ClassAds:  Memory=128M  Requirements=(ImageSize < 50M)  Rank=(Owner==“Frieda”) › Central Manager finds best match

After a match is found › Central Manager tells both parties about the match › Frieda’s computer and the remote computer cooperate to run Frieda’s job.

Lots of flexibility › Machines can:  Only run jobs when I have been idle for at least 15 minutes—or always run them.  Kick off jobs when someone starts using the computer—or never kick them off. › Jobs can:  Require or prefer certain machines  Use checkpointing, remote I/O, etc…

Happy Day! Frieda’s organization purchased a Beowulf Cluster! › Other scientists in her department have realized the power of Condor and want to share it.. › The Beowulf cluster and the graduate student computers can be part of a single Condor pool.

Frieda’s Condor pool… Frieda’s Computer: Central Manager Graduate Student’s Desktop Computers Beowulf Cluster

Frieda’s Big Condor Pool › Jobs can prefer to run in the Beowulf cluster by using “Rank”. › Jobs can run just on “appropriate machines” based on:  Memory, disk space, software, etc. › The Beowulf cluster is dedicated. › The student computers are still useful. › Everyone’s computing power is increased.

Frieda collaborates… › She wants to share her Condor pool with scientists from another lab.

Condor Flocking › Condor pools can work cooperatively

Flocking… › Flocking is Condor specific—you can just link Condor pools together › Jobs usually prefer running in their “native” pool, before running in alternate pools. › What if you want to connect to a non- Condor pool?

Condor-G › Condor-G lets you submit jobs to Grid resources.  Uses Globus job submission mechanisms › You get Condor’s benefits:  Fault tolerance, monitoring, etc. › You get the Grid’s benefits:  Use any Grid resources

Condor as a Grid Resource › Condor can be a backend for Globus  Submit Globus jobs to Condor resource  The Globus jobs run in the Condor pool

Condor Summary › Condor is useful, even on a single machine or a small pool. › Condor can bring computing power to people that can’t afford a “real” cluster. › Condor can work with dedicated clusters › Condor works with the Grid › Questions so far?

ClassAds › Condor uses ClassAds internally to pair jobs with machines.  Normally, you don’t need to know the details when you use Condor  We saw sample ClassAds earlier. › If you like, you can also use ClassAds in your own projects.

What Are ClassAds? › A ClassAd maps attributes to expressions › Expressions  Constants: strings, numbers, etc.  Expressions: other.Memory > 600M  Lists: { “roy”, “pfc”, “melski” }  Other ClassAds › Powerful tool for grid computing  Semi-structured (you pick your structure)  Matchmaking

ClassAd Example [ Type = “Job”; Owner = “roy”; Universe = “Standard”; Requirements = (other.OpSys == “Linux” && other.DiskSpace > 140M); Rank = (other.DiskSpace > 300M ? 10 : 1); ClusterID = 12314; JobID = 0; Env = “”; … ] Real ClassAds have a more fields than will fit on this slide.

ClassAd Matchmaking [ Type = “Job”; Owner = “roy”; Requirements = (other.OpSys == “Linux” && other.DiskSpace > 140M); Rank = (other.DiskSpace > 300M ? 10 : 1); ] [ Type = “Machine”; OpSys = “Linux”; DiskSpace = 500M; AllowedUsers = {“roy”, “melski”, “pfc”}; Requirements = (IsMember(other.Owner, AllowedUsers); ]

ClassAds Are Open Source › Library GNU Public License (LGPL) › Complete source code included  Library code  Test program › Available from: › Version 0.9.3

Who Uses ClassAds? › Condor › European Data Grid › NeST › Web site › …You?

ClassAd User: Condor › ClassAds describe jobs and machines › Matchmaking figures out what jobs run on which machines › DAGMan will soon internally represent DAGs as ClassAds

ClassAd User: EU Datagrid › JDL: ClassAd schema to describe jobs/machines › ResourceBroker: matches jobs to machines

ClassAd User: NeST › NeST is a storage appliance › NeST uses ClassAd collections for persistent storage of:  User Information  File meta-data  Disk Information  Lots (storage space allocations)

ClassAd User: Web Site › Web-based application in Germany › User actions (transitions) are constrained › Constraints expressed through ClassAds

ClassAd Summary › ClassAds are flexible › Matchmaking is powerful › You can use ClassAd independently of Condor:

MW = Master-Worker › Master-Worker Style Parallel Applications  Large problem partitioned into small pieces (tasks);  The master manages tasks and resources (worker pool);  Each worker gets a task, execute it, sends the result back, and repeat until all tasks are done;  Examples: ray-tracing, optimization problems, etc. › On Condor (PVM, Globus, … … )  Many opportunities!  Issues (in a Distributed Opportunistic Environment): Resource management, communication, portability; Fault-tolerance, dealing with runtime pool changes.

MW to Simplify the Work! › An OO framework with simple interfaces  3 classes to extend, a few virtual functions to fill;  Scientists can focus on their algorithms. › Lots of Functionality  Handles all the issues in a meta-computing environment;  Provides sufficient info. to make smart decisions. › Many Choices without Changing User Code  Multiple resource managers: Condor, PVM, …  Multiple communication interfaces: PVM, File, Socket, …

Application classes Underlying infrastructure MW’s Layered Architecture Resource Mgr MW abstract classes Communication Layer API IPI Infrastructure Provider’s Interface MWMW MW App.

MW’s Runtime Structure 1.User code adds tasks to the master’s Todo list; 2.Each task is sent to a worker (Todo -> Running); 3.The task is executed by the worker; 4.The result is sent back to the master; 5.User code processes the result (can add/remove tasks). Worker Process Worker Process Worker Process …… Master Process ToDo tasks Running tasks Workers

MW Summary › It’s simple:  simple API, minimal user code. › It’s powerful:  works on meta-computing platforms. › It’s inexpensive:  On top of Condor, it can exploits 100s of machines. › It solves hard problems!  Nug30, STORM, … …

MW Success Stories › Nug30 solved in 7 days by MW-QAP  Quadratic assignment problem outstanding for 30 years  Utilized 2500 machines from 10 sites NCSA, ANL, UWisc, Gatech, … … 1009 workers at peak, 11 CPU years  › STORM (flight scheduling)  Stochastic programming problem ( 1000M row X 13000M col)  2K times larger than the best sequential program can do  556 workers at peak, 1 CPU year 

MW Information ›

Questions So Far?

NeST › Traditional file servers have not evolved  NeST is a 2nd gen file server › Flexible storage appliance for the grid  Provides local and remote access to data  Easy management of storage resources › User level sw turns machines into storage apps  Deployable and portable

Research Meets Production › NeST exists at an exciting intersection › Freedom to pursue academic curiosities › Opportunities to discover real user concerns

Very exciting intersection

NeST Supports Lots › A lot is a guaranteed storage allocation. › When you run your large analysis on a Grid, will you have sufficient storage for your results? › Lots ensure you have storage space.

NeST Supports Multiple Protocols › Interoperability between admin domains › NeST currently speaks  Grid FTP and FTP  HTTP  NFS (beta)  Chirp › Designed for integration of new protocols

Dispatcher Transfer Mgr Concurrencies Storage Mgr Control flow Datal flow ChirpFTPGrid ftpNFS Common protocol layer HTTP Physical network layer Physical storage layer Design structure

Why not JBOS? › Just a bunch of servers has limitations › NeST advantages over JBOS:  Single config and admin interface  Optimizations across multiple protocols e.g. cache aware scheduling  Management and control of protocols e.g. prefer local users to remote users

Three-Way Matching Machine NeST Job Ad Machine Ad Storage Ad match Refers to NearestStorage. Knows where NearestStorage is.

Three way ClassAds Type = “job” TargetType = “machine” Cmd = “sim.exe” Owner = “thain” Requirements = (OpSys==“linux”) && NearestStorage.HasCMSData Job ClassAd Type = “machine” TargetType = “job” OpSys = “linux” Requirements = (Owner==“thain”) NearestStorage = ( Name = “turkey”) && (Type==“Storage”) Machine ClassAd Type = “storage” Name = “turkey.cs.wisc.edu” HasCMSData = true CMSDataPath = /cmsdata” Storage ClassAd

NeST Information ›  Version 0.9 now available (linux only, no NFS)  Solaris and NFS coming soon  Requests welcome

DaP Scheduler › Intelligent scheduling of data transfers

Applications Demand Storage › Database systems › Multimedia applications › Scientific applications  High Energy Physics & Computational Genomics  Currently terabytes, soon petabytes of data

Is Remote access good enough? › Huge amounts of data (mostly in tapes) › Large number of users › Distance / Low Bandwidth › Different platforms › Scalability and efficiency concerns => A middleware is required

Two approaches › Move job/application to the data  Less common  Insufficient computational power on storage site  Not efficient  Does not scale › Move data to the job/application

Move data to the Job Huge tape library (terabytes) Compute cluster LAN Local Storage Area (eg. Local Disk, NeST Server..) WAN Remote Staging Area

Main Issues › 1. Insufficient local storage area › 2. CPU should not wait much for I/O › 3. Crash Recovery › 4. Different Platforms & Protocols › 5. Make it simple

Data Placement Scheduler (DaPS) › Intelligently Manages and Schedules Data Placement (DaP) activities/jobs › What Condor is for computational jobs, DaPS means the same for DaP jobs › Just submit a bunch of DaP jobs and then relax..

Supported Protocols › Currently supported:  FTP  GridFTP  NeST (chirp)  SRB (Storage Resource Broker) › Very soon:  SRM (Storage Resource Manager)  GDMP (Grid Data Management Pilot)

Case Study: DAGMan.dag File Condor Job Queue A DAGMan C D A B

Current DAG structure › All jobs are assumed to be computational jobs Job A Job B Job C Job D

Current DAG structure › If data transfer to/from remote sites is required, this is performed via pre- and post-scripts attached to each job. Job A PRE Job B POST Job C Job D

New DAG structure Add DaP jobs to the DAG structure PRE Job B POST Transfer in Reserve In & out Job B Transfer out Release in Release out

New DAGMan Architecture.dag File Condor Job Queue A DAGMan B D A C DaPS Job Queue X Y X

DaP Conclusion › More intelligent management of remote data transfer & staging  increase local storage utilization  maximize CPU throughput

Questions So Far?

Hawkeye › Sys admins first need information about what is happening on the machines they are responsible for.  Both current and past  Information must be consolidated and easily accessible  Information must be dynamic

HawkEye Monitoring Agent HawkEye Manager HawkEye Monitoring Agent

HawkEye Monitoring Agent /proc, kstat… Hawkeye_Startup_Agent Hawkeye_Monitor HawkEye Monitoring Agent HawkEye Manager ClassAd Updates

Monitor Agent, cont. › Updates are sent periodically  Information does not get stale › Updates also serve as a heartbeat monitor  Know when a machine is down › Out of the box, the update ClassAd has many attributes about the machine of interest for system administration  Current Prototype = about 200 attributes

Custom Attributes /proc, kstat… Hawkeye_Startup_Agent Hawkeye_Monitor HawkEye Monitoring Agent HawkEye Manager Data from hawkeye_update_attribute command line tool Create your own HawkEye plugins, or share plugins with others

Role of HawkEye Manager › Store all incoming ClassAds in a indexed resident data structure  Fast response to client tool queries about current state  “Show me all machines with a load average > 10” › Periodically store ClassAd attributes into a Round Robin Database  Store information over time  “Show me a graph with the load average for this machine over the past week” › Speak to clients via CEDAR, HTTP

Web client › Command-line, GUI, Web-based

Running tasks on behalf of the sys admin › Submit your sys admin tasks to HawkEye  Tasks are stored in a persistent queue by the Manager  Tasks can leave the queue upon completion, or repeat after specified intervals  Tasks can have complex interdependencies via DAGMan  Records are kept on which task ran where › Sounds like Condor, eh?  Yes, but simpler…

Run Tasks in response to monitoring information › ClassAd “Requirements” Attribute › Example: Send if a machine is low on disk space or low on swap space  Submit an task with an attribute: Requirements = free_disk < 5 || free_swap < 5 › Example w/ task interdependency: If load average is high and OS=Linux and console is Idle, submit a task which runs “top”, if top sees Netscape, submit a task to kill Netscape

Today’s Summary › Condor works on many levels  Small pools can make a big difference  Big pools are for the really big problems  Condor works in the Grid › Condor is assisted by a host of technologies:  ClassAds, Checkpointing, Remote I/O DAGMan, Master-Worker, NeST, DaPScheduler, Hawkeye

Questions? Comments? › Web: ›