Presentation is loading. Please wait.

Presentation is loading. Please wait.

CONDOR CISC 879 Parallel Computation Spring 2003 Preethi Natarajan.

Similar presentations


Presentation on theme: "CONDOR CISC 879 Parallel Computation Spring 2003 Preethi Natarajan."— Presentation transcript:

1 CONDOR CISC 879 Parallel Computation Spring 2003 Preethi Natarajan

2 Outline oCondor – Goals & Overview oComponents oMatchmaking - ClassAds oRPC in Condor oCheckpoint/Restart oGlance @ APIs

3 Condor – Objectives oCondor ‘s goal is to hunt for idle resources that can be exploited by user applications oPerformance Vs. Throughput oHigh Performance Computing oCPU cycles/second under ideal circumstances. “How fast can I run simulation X on this machine?” oHigh Throughput Computing o CPU cycles/day (week, month, year?) under non-ideal circumstances. “How many times can I run simulation X in the next month using all available machines?” oHow much computing power is available to me? oCondor converts collections of distributively owned workstations (different platforms) and dedicated clusters into a distributed high-throughput computing facility

4 Condor - Overview Site at which job submitted oCustomers advertise their job requirements to Condor – Resource Requests oResource owners advertise their resource descriptions – Resource Offers Resource found appropriate for the job Condor Central Manager oCondor provides oMatchmaking between jobs and resources oNotification of Matches oTransparent access to job’s files during execution oOpportunistic Scheduling – Schedule resources when there is an opportunity oCheckpoint (save) job state when current resource needs to be preempted oRestart job from checkpointed state in another available resource

5 Condor Components CUSTOMER AGENT oSubmits Resource Requests (job requirements) in an application queue ordered by a priority scheme oImplementation is called the Scheduling daemon – schedd RESOURCE AGENT oPeriodically extracts resources’ state information and updates its Resource Offers oImplementation is called the startd schedd Job submission Customer Agent Resource Agent startd CollectorNegotiator Accountant Resource Offers Resource Requests Notify Match

6 Condor Components (Cont.) oCENTRAL MANAGER oIs the condor “kernel” of the condor pool oCollector - Periodically collects oResource Offers from startds oResource Requests Schedds oNegotiator oMatchmaking between Resource Requests and Offers oNotification about the match to the entities of the matched pair oClaiming Protocol followed between the respective Customer and Resource Agents oAccountant – Logs resource(s) usage by jobs

7 ClassAds oClassified Advertisement is a flexible and extensible data model used to represent oResource Offers - Resource services available oResource Requests - Job Requirements oAccess Policies - Constraints on resource allocations & requirements oIs a mapping from attribute names to expressions – defines semantics for evaluating the attributes

8 ClassAds - Access Policies oResource access policy specifies oWho may use resource oHow they may use resource oWhen they may use resource Expression TypeEvaluation Semantics for an application RequirementsTrue => Application may use resource RankLarger Value => Application is highly preferred over others SuspendTrue => Suspend active application ContinueTrue => Unsuspend active application VacateTrue => Active application notified to stop using the resource KillTrue => Active application should be immediately stopped Policy Specification Example oAccess Policy Specification in Condor is done using the following ClassAd Attributes

9 Matchmaking oClassAd Specification oClassAds describing Resource Requests and Resource Offers with attributes like Type, Rank, Requirements, Vacate etc oAdvertising Protocol oEntity periodically communicates the ClassAd and “contact address” to the Central Manager (Matchmaker) oMatchmaking Algorithm oMatches based on Requirements specified in the Resource Requests and Offers. oMatch with the highest Rank is selected. oUse of past resource usage (log) for fair scheduling

10 Matchmaking (cont. ) oMatchmaking Protocol oMatch notified to the two parties that were matched @ their “contact address” along with the matched ClassAd o(Possible) Authentication via hand-off of a session-key oClaiming Protocol oMatch was a mutual introduction of the 2 parties oCustomer contacts Resource directly to negotiate regarding resource allocation

11 After Match Notification… 1.Schedd on the Initiating (Submit) machine first spawns a shadow process. Shadow process acts as the shadow of the job that will be executed on the remote machine 2.Shadow negotiates with Startd of remote machine to run the job 3.If successful, Startd on the remote, spawns Starter which oStarts the remote job by spawning oManages the execution of the remote job by communicating with the Shadow.

12 Exploiting RPC oRemote Machine agrees to run submit machine’s job at its workstation. But the job’s files are physically located at the submit machine. oopen(), read(), write() calls in the job’s code are executed at the submit machine as RPCs ocondor_syscall_lib has to be linked to these jobs oIf files can be accessed via NFS/AFS then it is preferred over RPC if it will be efficient. The open() routine in the condor_syscall_lib talks with the shadow at submit machine and makes these decisions Remote Job’s process … Call to open(jobfile1) Shadow process for the job Access ‘jobfile1’ via NFS/AFS or RPC Remote Machine Submit Machine Starter process for the remote job spawns Local File System

13 Checkpoint oTo checkpoint an executing program is to take a snapshot of its current state in such a way that the program can be restarted from that state at a later time possibly at a different resource oProvides oPreemptive-Resume scheduling oFault Tolerance – when checkpointing is done periodically oIn Condor, checkpointing running jobs is optional. If it is needed, source should be linked with condor_syscall_lib

14 Checkpointing in Condor oImplemented in condor_syscall_lib as a signal handler oWhen condor sends a signal to checkpoint, the handler saves process’ state information in a checkpoint file oFrom Core - contents of process’s uarea, data and stack segments oFrom Executable – symbol and debugging info, initialized data, text

15 Checkpointing & Restart oShadow sends the latest checkpoint file to the new Starter during restart oThe starter, reads the job state from the checkpoint file and the execution continues oStarter periodically sends a checkpoint signal to the executing job oCondor_syscall_lib makes job dump core and saves job state in the checkpoint file oCheckpoint file temporarily stored @ Remote Machine oStarter transfers latest checkpoint file to shadow when job vacated Code in condor_syscall_lib saves process state information Shadow process for the job Remote Machine Submit Machine Starter process for the remote job Local File System Checkpoint signal Checkpoint file Checkpoint file transferred when job vacated Checkpoint file transferred when job restarted

16 CONDOR APIs - Glance oCompile as a condor job gcc –c hello.c –o hello.o condor_compile gcc hello.o –o hello oSubmit a condor job cat > submit.hello Executable = hello Universe = standard Output = hello.out Log = hello.log Queue condor_submit submit.hello – creates Job ClassAd

17 CONDOR APIs (Cont. ) oCondor_master – starts other daemons oCondor_vacate – vacate jobs running on specified hosts oCondor_status – display status of condor pool oCondor_rm – remove a condor job from queue oMore commands @ http://www.cs.wisc.edu/condor/manual/v6.4/

18 REFERENCES oCondor Project Home Page http://www.cs.wisc.edu/condor/ http://www.cs.wisc.edu/condor/ oResearch Publications on Condor http://www.cs.wisc.edu/condor/public ations.html http://www.cs.wisc.edu/condor/public ations.html


Download ppt "CONDOR CISC 879 Parallel Computation Spring 2003 Preethi Natarajan."

Similar presentations


Ads by Google