Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scheduling & Resource Management in Distributed Systems Rajesh Rajamani, May 2001.

Similar presentations


Presentation on theme: "Scheduling & Resource Management in Distributed Systems Rajesh Rajamani, May 2001."— Presentation transcript:

1 Scheduling & Resource Management in Distributed Systems Rajesh Rajamani, raj@cs.wisc.edu http://www.cs.wisc.edu/condor May 2001

2 Outline b Hi-throughput computing and Condor b Resource Management in distributed systems b Matchmaking b Current research/Misc.

3 b Power = Work / Time b High Performance Computing Fixed amount of work; how much time? Fixed amount of work; how much time? Traditional Performance metrics: FLOPS, MIPS Traditional Performance metrics: FLOPS, MIPS Response time/latency oriented Response time/latency oriented b High Throughput Computing Fixed amount of time; how much work? Fixed amount of time; how much work? Application specific performance metrics Application specific performance metrics Throughput oriented Throughput oriented Power of Computing environments

4 In other words … b HPC - Enormous amounts of computing power over relatively short periods of time (+) Good for applications under sharp time constraint b HTC - Large amounts of computing power for lengthy periods (+) What if u want to simulate 1000 applications on ur latest DSP chip design over the next 3 months??

5 The Condor Project b Goal - To develop, b Goal - To develop, implement, deploy, and evaluate mechanisms and policies that support High Throughput Computing (HTC) on large collections of distributively owned computing resources

6 More about Condor b Started in late 80’s b Principal Investigator - Prof.Miron Livny b Latest version 6.3.0 released b Supports 14 different platforms (OS + Arch) including Linux, Solaris and WinNT b Currently employs over 20 students and 5 staff b We write code, debug, port, publish papers and YES, we also provide support !!!

7 Distributed ownership of resources b Underutilized - 70% of CPU cycles in a cluster go waste b Fragmented - Resources owned by different people b Use these resources to provide HTC, BUT without impacting QOS available to owner b Achieved by allowing the user to set access policy using control expressions

8 Access policy b Current state of the resource (eg, keyboard idle for 15 minutes or load average less than 0.2) b Characteristics of the request (run only jobs of research associates) b Time of day/night that jobs can be run

9 What happens when u submit a job Central Manager Submitting machine Available resource 1. User submits a job Resources announce their properties periodically 2. Submitting machine sends Classad of the job 3. Matchmaker Notifies parties of a match 4. Parties negotiate

10 Important Mechanisms MechanismFor Matchmaking Resource Management Checkpointing Saving the state of a job Bypass Remote system calls DAGMAN Automatic job submission based on dependency graph Master-Worker Exploiting task level parallelism

11 Condor Architecture b Manager Collector: Database of resources Collector: Database of resources Negotiator: Matchmaker Negotiator: Matchmaker Accountant: Priority maintenance Accountant: Priority maintenance b Startds ( Represent owners of resources) Implement owner's access control policy Implement owner's access control policy b Schedds ( Represent customers of the system) Maintain persistent queues of resource requests Maintain persistent queues of resource requests

12 Condor Architecture, cont.

13 Power of Condor b Solves NUG30 Quadratic assignment problem, posed in 1968 over a period of 6.9 days, delivering over 96,000 CPU hours by commandeering an average of 650 machines !!! b Compare this with the RSA-155 problem posed in 1977 and solved using 300 computers (over a period of 7 months) in the last 90s. If you were to use the same amount of resources as that used to solve NUG30, this could’ve been done in 2 weeks !!! b “It (Chorus production) was done in parallel on machines in the computer center running XXX, and on the office machines under Condor. The latter did about 90% of the work!” - - Helge MEINHARD (EP division, CERN) (EP division, CERN)

14 Resource management using Matchmaking b Opportunistic Resource Exploitation Resource availability is unpredictableResource availability is unpredictable –Exploit resources as soon as they are available –Matchmaking performed continuously b As against a centralized scheduler which would’ve to deal with - Heterogeneity of resourcesHeterogeneity of resources Distributed Ownership - widely varying allocation policiesDistributed Ownership - widely varying allocation policies Dynamic nature of the clusterDynamic nature of the cluster

15 Classified Advertisements b A simple language used by resource providers and customers to express their properties/requirements to the Collector b Uses a semi-structured data model => no specific schema is required by the matchmaker, allowing it to work naturally in a heterogeneous env b Language folds query language into the data model. Constraints may be expressed as attributes of the classad b Should conform to advertising protocol

16 Matchmaking with Classads b 4 steps to managing resources - 1.Parties requiring matchmaking advertise their characteristics, preferences, constraints, etc. 2.Advertisements matched by a Matchmaker 3.Matched entities are notified 4.Matched entities establish an allocation through a claiming process - could include authentication, constraint verification, negotiation of terms etc  Method is symmetric

17 Classad example Sample classad of a workstation [ Type = “Machine”; OpSys = “Linux”; OpSys = “Linux”; Arch = “INTEL”; Arch = “INTEL”; Memory = 256 M; Memory = 256 M; Constraint = true; Constraint = true;] Sample classad of a Job [ Type = “Job”; Owner = “run_sim”; Owner = “run_sim”; Constraint = Constraint = other.Type ==“Machine” && other.Type ==“Machine” && Arch == “INTEL && Opsys == “Solaris251” && Other.Memory >= Memory; ]

18 Example Classad (workstation) [ Type= “Machine”; Activity=“Idle”; Name=“crow.cs.wisc.edu”; Arch=“INTEL”; OpSys=“Solaris251”; Kflops=21893; Memory= 64; Disk=323496; //KB DayTime=36107;

19 Example Classad (contd.) ResearchGrp= {“miron”, “thain”, “john”}; Untrusted= {“bgates”, “lalooyadav”, “thief”}; Rank= member(other.Owner, ResearchGrp)*10; Constraint= !member(other.Owner, Untrusted) && Rank >= 10 ?true : false//To prevent malicious users ]

20 Example Classad (Submitted job) [ Type=“Job”; QDate=886799469; Owner=“raman”; Cmd=run_sim; Iwd=/usr/raman/sim2; Memory=31; Rank=Kflops/1e3 + other.Memory/32; Constraint=other.Type == “Machine” && OpSys == “Solaris251”&& Disk >= 10000 && other.Memory >= self.Memory; Constraint=other.Type == “Machine” && OpSys == “Solaris251”&& Disk >= 10000 && other.Memory >= self.Memory;]

21 Matchmaking b Evaluates expressions in an environment that allows each classad to access attributes of the other Other.Memory >= self.Memory;Other.Memory >= self.Memory; b References to non-existent attribute evaluates to undefined b Considers pairs of ads incompatible unless their Constraint expressions both evaluate to true  Rank is then then used to choose among compatible matches b Both parties are notified about the match - could generate and hand-off session key for authentication and security

22 Separation of Matching and Claiming  Weak consistency requirements - Claiming allows provider and customer to verify their constraints with respect to their current state b Claiming protocol could use cryptographic techniques (authentication) b Principals involved in a match are themselves responsible for establishing, maintaining and servicing a match

23 Work outside the Condor kernel- New challenges b Mulitlateral Matchmaking - Gangmatching b IO regulation and Disk allocation - Kangaroo b User interfaces - ClassadView b Grid applications - Globus b Security

24 Summary b Matchmaking provides a scalable and robust resource management solution for HTC environments b Classads are used by workstations and jobs b Matchmaker forms the match and informs the parties, who in turn invoke the claiming protocol b The parties are responsible for establishing, maintaining and servicing a match b Questions ?

25 Gangmatch request [ Type= “Job”; Owner=“raj”; Cmd=run_sim; Ports={ [ Label = “cpu”; [ Label = “cpu”; ImageSize = 28 M; ImageSize = 28 M; //Rank and constraints ], //Rank and constraints ], [Label = “License”; [Label = “License”; Host= cpu.Name; Host = cpu.Name; //Rank and constraints ] //Rank and constraints ]}]


Download ppt "Scheduling & Resource Management in Distributed Systems Rajesh Rajamani, May 2001."

Similar presentations


Ads by Google