Download presentation
Presentation is loading. Please wait.
Published byBrent Fox Modified over 8 years ago
1
Scheduling & Resource Management in Distributed Systems Rajesh Rajamani, raj@cs.wisc.edu http://www.cs.wisc.edu/condor May 2001
2
Outline b Hi-throughput computing and Condor b Resource Management in distributed systems b Matchmaking b Current research/Misc.
3
b Power = Work / Time b High Performance Computing Fixed amount of work; how much time? Fixed amount of work; how much time? Traditional Performance metrics: FLOPS, MIPS Traditional Performance metrics: FLOPS, MIPS Response time/latency oriented Response time/latency oriented b High Throughput Computing Fixed amount of time; how much work? Fixed amount of time; how much work? Application specific performance metrics Application specific performance metrics Throughput oriented Throughput oriented Power of Computing environments
4
In other words … b HPC - Enormous amounts of computing power over relatively short periods of time (+) Good for applications under sharp time constraint b HTC - Large amounts of computing power for lengthy periods (+) What if u want to simulate 1000 applications on ur latest DSP chip design over the next 3 months??
5
The Condor Project b Goal - To develop, b Goal - To develop, implement, deploy, and evaluate mechanisms and policies that support High Throughput Computing (HTC) on large collections of distributively owned computing resources
6
More about Condor b Started in late 80’s b Principal Investigator - Prof.Miron Livny b Latest version 6.3.0 released b Supports 14 different platforms (OS + Arch) including Linux, Solaris and WinNT b Currently employs over 20 students and 5 staff b We write code, debug, port, publish papers and YES, we also provide support !!!
7
Distributed ownership of resources b Underutilized - 70% of CPU cycles in a cluster go waste b Fragmented - Resources owned by different people b Use these resources to provide HTC, BUT without impacting QOS available to owner b Achieved by allowing the user to set access policy using control expressions
8
Access policy b Current state of the resource (eg, keyboard idle for 15 minutes or load average less than 0.2) b Characteristics of the request (run only jobs of research associates) b Time of day/night that jobs can be run
9
What happens when u submit a job Central Manager Submitting machine Available resource 1. User submits a job Resources announce their properties periodically 2. Submitting machine sends Classad of the job 3. Matchmaker Notifies parties of a match 4. Parties negotiate
10
Important Mechanisms MechanismFor Matchmaking Resource Management Checkpointing Saving the state of a job Bypass Remote system calls DAGMAN Automatic job submission based on dependency graph Master-Worker Exploiting task level parallelism
11
Condor Architecture b Manager Collector: Database of resources Collector: Database of resources Negotiator: Matchmaker Negotiator: Matchmaker Accountant: Priority maintenance Accountant: Priority maintenance b Startds ( Represent owners of resources) Implement owner's access control policy Implement owner's access control policy b Schedds ( Represent customers of the system) Maintain persistent queues of resource requests Maintain persistent queues of resource requests
12
Condor Architecture, cont.
13
Power of Condor b Solves NUG30 Quadratic assignment problem, posed in 1968 over a period of 6.9 days, delivering over 96,000 CPU hours by commandeering an average of 650 machines !!! b Compare this with the RSA-155 problem posed in 1977 and solved using 300 computers (over a period of 7 months) in the last 90s. If you were to use the same amount of resources as that used to solve NUG30, this could’ve been done in 2 weeks !!! b “It (Chorus production) was done in parallel on machines in the computer center running XXX, and on the office machines under Condor. The latter did about 90% of the work!” - - Helge MEINHARD (EP division, CERN) (EP division, CERN)
14
Resource management using Matchmaking b Opportunistic Resource Exploitation Resource availability is unpredictableResource availability is unpredictable –Exploit resources as soon as they are available –Matchmaking performed continuously b As against a centralized scheduler which would’ve to deal with - Heterogeneity of resourcesHeterogeneity of resources Distributed Ownership - widely varying allocation policiesDistributed Ownership - widely varying allocation policies Dynamic nature of the clusterDynamic nature of the cluster
15
Classified Advertisements b A simple language used by resource providers and customers to express their properties/requirements to the Collector b Uses a semi-structured data model => no specific schema is required by the matchmaker, allowing it to work naturally in a heterogeneous env b Language folds query language into the data model. Constraints may be expressed as attributes of the classad b Should conform to advertising protocol
16
Matchmaking with Classads b 4 steps to managing resources - 1.Parties requiring matchmaking advertise their characteristics, preferences, constraints, etc. 2.Advertisements matched by a Matchmaker 3.Matched entities are notified 4.Matched entities establish an allocation through a claiming process - could include authentication, constraint verification, negotiation of terms etc Method is symmetric
17
Classad example Sample classad of a workstation [ Type = “Machine”; OpSys = “Linux”; OpSys = “Linux”; Arch = “INTEL”; Arch = “INTEL”; Memory = 256 M; Memory = 256 M; Constraint = true; Constraint = true;] Sample classad of a Job [ Type = “Job”; Owner = “run_sim”; Owner = “run_sim”; Constraint = Constraint = other.Type ==“Machine” && other.Type ==“Machine” && Arch == “INTEL && Opsys == “Solaris251” && Other.Memory >= Memory; ]
18
Example Classad (workstation) [ Type= “Machine”; Activity=“Idle”; Name=“crow.cs.wisc.edu”; Arch=“INTEL”; OpSys=“Solaris251”; Kflops=21893; Memory= 64; Disk=323496; //KB DayTime=36107;
19
Example Classad (contd.) ResearchGrp= {“miron”, “thain”, “john”}; Untrusted= {“bgates”, “lalooyadav”, “thief”}; Rank= member(other.Owner, ResearchGrp)*10; Constraint= !member(other.Owner, Untrusted) && Rank >= 10 ?true : false//To prevent malicious users ]
20
Example Classad (Submitted job) [ Type=“Job”; QDate=886799469; Owner=“raman”; Cmd=run_sim; Iwd=/usr/raman/sim2; Memory=31; Rank=Kflops/1e3 + other.Memory/32; Constraint=other.Type == “Machine” && OpSys == “Solaris251”&& Disk >= 10000 && other.Memory >= self.Memory; Constraint=other.Type == “Machine” && OpSys == “Solaris251”&& Disk >= 10000 && other.Memory >= self.Memory;]
21
Matchmaking b Evaluates expressions in an environment that allows each classad to access attributes of the other Other.Memory >= self.Memory;Other.Memory >= self.Memory; b References to non-existent attribute evaluates to undefined b Considers pairs of ads incompatible unless their Constraint expressions both evaluate to true Rank is then then used to choose among compatible matches b Both parties are notified about the match - could generate and hand-off session key for authentication and security
22
Separation of Matching and Claiming Weak consistency requirements - Claiming allows provider and customer to verify their constraints with respect to their current state b Claiming protocol could use cryptographic techniques (authentication) b Principals involved in a match are themselves responsible for establishing, maintaining and servicing a match
23
Work outside the Condor kernel- New challenges b Mulitlateral Matchmaking - Gangmatching b IO regulation and Disk allocation - Kangaroo b User interfaces - ClassadView b Grid applications - Globus b Security
24
Summary b Matchmaking provides a scalable and robust resource management solution for HTC environments b Classads are used by workstations and jobs b Matchmaker forms the match and informs the parties, who in turn invoke the claiming protocol b The parties are responsible for establishing, maintaining and servicing a match b Questions ?
25
Gangmatch request [ Type= “Job”; Owner=“raj”; Cmd=run_sim; Ports={ [ Label = “cpu”; [ Label = “cpu”; ImageSize = 28 M; ImageSize = 28 M; //Rank and constraints ], //Rank and constraints ], [Label = “License”; [Label = “License”; Host= cpu.Name; Host = cpu.Name; //Rank and constraints ] //Rank and constraints ]}]
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.