Presentation is loading. Please wait.

Presentation is loading. Please wait.

Slot Acquisition Presenter: Daniel Nurmi. Scope One aspect of VGDL request is the time ‘slot’ when resources are needed –Earliest time when resource set.

Similar presentations


Presentation on theme: "Slot Acquisition Presenter: Daniel Nurmi. Scope One aspect of VGDL request is the time ‘slot’ when resources are needed –Earliest time when resource set."— Presentation transcript:

1 Slot Acquisition Presenter: Daniel Nurmi

2 Scope One aspect of VGDL request is the time ‘slot’ when resources are needed –Earliest time when resource set is needed –Maximum duration resource set will be used Three classes of resources –dedicated: always available –batch controlled: lag before available –advanced reservation: guaranteed availability in the future

3 Acquisition Routines Each class of resource needs the following (logical) routines –Prob = Query (cluster, nodes, walltime, starttime) –Id = BindInit (cluster, nodes, walltime, starttime, success_prob) –Status = Check (id) –Status = Install(id)

4 Slot Manager Acquisition Procedure Query Bind Is available? probability Query() Initiate bind Bind yet? True/false/abort BindInit() Check() Install() Slot Manager Install PBS glide-in when time

5 Dedicated Query –NOP (prob = 1) BindInit –NOP (always true) Check –NOP (always true) Install –Installs PBS glide-in

6 Advanced Reservation Query –Makes request to advanced reservation system –Prob = 1 if we can make the reservation –Prob = 0 if we cannot BindInit –Make adv. res. Request Check –NOP (always return true) Install –Submit PBS glide-in installation job to specialized adv. res. queue

7 Batch Controlled Query –Performs an algorithm to determine probability of meeting the slot requirement through regular batch queue BindInit –Use values calculated from ‘query’ for job dimensions and time to wait before submission Check –When ‘time to wait’ has elapsed, return true Install –Submit PBS glide-in installation job

8 The Algorithm Routines –‘deadline’ is ‘seconds from now’ –P = bqp_pred(machine, nodes, walltime, deadline) Algorithm Preq = 0.75 past = 0 P = bqp_pred(M, N, W+D, D) While((D-past) > 0) { if (P ~ Preq) { wait = past real_walltime = W+(D-past) } past += 30 P = bqp_pred(M, N, W+(D-past), (D-past)) }

9 Batch Experiment 75% is the target probability 356 total requests 257 total batch submissions –99 requests resulted in initial ‘not possible’ response 192 slots successfully acquired 257 *.75 = 193 Choose last acceptable time to minimize waste now 0.75 submit time

10 Near Term Experiments Try other probability levels Try other deadlines

11 PBS Glide-in Basic batch queue system assumes one-to-one mapping of job to resource set (slot) Idea: once a single ‘slot’ has been acquired, install ‘personal’ res. manager and scheduler within it in order to support multiple jobs within single slot Have instrumented torque (PBS) to fulfill this task –Plays the role that Condor would play as infrastructure scheduler –PBS “glide-in” –Simpler, supports MPI, etc.

12 PBS Overview PBS ServerPBS Sched PBS Mom node1 PBS Mom node2 PBS Mom node3 PBS Mom node4 Transfer scriptA qsub ‘scriptA’ scriptA gets node1, node2, and node3

13 PBS Overview PBS ServerPBS Sched PBS Mom node1 PBS Mom node2 PBS Mom node3 PBS Mom node4 scriptA ssh cmd cmd ssh cmd cmd

14 PBS glide-in PBS ServerPBS Sched PBS Mom node1 PBS Mom node2 PBS Mom node3 PBS Mom node4 pglide.pbs qsub pglide.pbs

15 PBS glide-in PBS ServerPBS Sched PBS Mom node1 PBS Mom node2 PBS Mom node3 PBS Mom node4 pglide.pbspbs_mom pbs_server pbs_sched

16 PBS glide-in PBS ServerPBS Sched PBS Mom node1 PBS Mom node2 PBS Mom node3 PBS Mom node4 pglide.pbspbs_mom pbs_server pbs_sched qsub scriptA GRAM globusrun-ws jobA globusrun-ws jobB qsub scriptB scriptAscriptB

17 PBS glide-in TODO In order to implement this, needed to disable some of PBS internal security features (drop privs, root check, priv ports, user auth checks, host auth checks) Streamline installation process (good but not great) Architecture discussion: one server per slot? One server for all slots on a single machine? –Requires reworking torque software a bit

18 Slot Acquisition Status BQP ‘virtual advanced reservation’ system in place PBS glide-in working on all machines Dan has access to Need to investigate advanced reservation interface(s) Need to figure out how to properly submit PBS jobs using GRAM

19 Thanks! Questions?

20 Statistics TODO More reactive change point detection –Machine down time constitutes a change point we can detect better –Better understanding of autocorrelation and quantiles Non-statistical case –One user submits 20,000 single processor jobs

21 Current Cluster Status DedicatedBatch Controlled Advanced Res. Dante X ? NCSA Mercury X ? SDSC Teragrid X ? ADA X ? IU TG X ? IU BigRed X ? IU Tyr X ?


Download ppt "Slot Acquisition Presenter: Daniel Nurmi. Scope One aspect of VGDL request is the time ‘slot’ when resources are needed –Earliest time when resource set."

Similar presentations


Ads by Google