Slot Acquisition Presenter: Daniel Nurmi. Scope One aspect of VGDL request is the time ‘slot’ when resources are needed –Earliest time when resource set.

Slides:

Advertisements

Similar presentations

SProj 3 Libra: An Economy-Driven Cluster Scheduler Jahanzeb Sherwani Nosheen Ali Nausheen Lotia Zahra Hayat Project Advisor/Client: Rajkumar Buyya Faculty.

Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.

Evaluating the Cost-Benefit of Using Cloud Computing to Extend the Capacity of Clusters Presenter: Xiaoyu Sun.

4/2/2002HEP Globus Testing Request - Jae Yu x Participating in Globus Test-bed Activity for DØGrid UTA HEP group is playing a leading role in establishing.

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.

CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.

GPU Computing with Hartford Condor Week 2012 Bob Nordlund.

Parasol Architecture A mild case of scary asynchronous system stuff.

A Computation Management Agent for Multi-Institutional Grids

Using Clusters -User Perspective. Pre-cluster scenario So many different computers: prithvi, apah, tejas, vayu, akash, agni, aatish, falaq, narad, qasid.

Presented by: Priti Lohani

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.

A Grid Resource Broker Supporting Advance Reservations and Benchmark- Based Resource Selection Erik Elmroth and Johan Tordsson Reporter ： S.Y.Chen.

Sun Grid Engine Grid Computing Assignment – Fall 2005 James Ruff Senior Department of Mathematics and Computer Science Western Carolina University.

High Throughput Urgent Computing Jason Cope Condor Week 2008.

Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”

Apache Airavata GSOC Knowledge and Expertise Computational Resources Scientific Instruments Algorithms and Models Archived Data and Metadata Advanced.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

Resource Management and Accounting Working Group Working Group Scope and Components Progress made Current issues being worked Next steps Discussions involving.

The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.

Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting June 13-14, 2002.

MSc. Miriel Martín Mesa, DIC, UCLV. The idea Installing a High Performance Cluster in the UCLV, using professional servers with open source operating.

VIPBG LINUX CLUSTER By Helen Wang March 29th, 2013.

Bigben Pittsburgh Supercomputing Center J. Ray Scott

BaBar MC production BaBar MC production software VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:

Job Submission Condor, Globus, Java CoG Kit Young Suk Moon.

Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.

Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.

GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL.

+ Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert.

BOSCO Architecture Derek Weitzel University of Nebraska – Lincoln.

Rochester Institute of Technology Job Submission Andrew Pangborn & Myles Maxfield 10/19/2015Service Oriented Cyberinfrastructure Lab,

Some Design Notes Iteration - 2 Method - 1 Extractor main program Runs from an external VM Listens for RabbitMQ messages Starts a light database engine.

Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.

Predicting Queue Waiting Time in Batch Controlled Systems Rich Wolski, Dan Nurmi, John Brevik, Graziano Obertelli Computer Science Department University.

Remote Cluster Connect Factories David Lesny University of Illinois.

Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”

The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison

Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.

Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -

TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.

Institute For Digital Research and Education Implementation of the UCLA Grid Using the Globus Toolkit Grid Center’s 2005 Community Workshop University.

Scheduling in HPC Resource Management System: Queuing vs. Planning Matthias Hovestadt, Odej Kao, Alex Keller, and Achim Streit 2003 Job Scheduling Strategies.

Resource Management Task Report Thomas Röblitz 19th June 2002.

APST Internals Sathish Vadhiyar. apstd daemon should be started on the local resource Opens a port to listen for apst client requests Runs on the host.

1 High-Performance Grid Computing and Research Networking Presented by David Villegas Instructor: S. Masoud Sadjadi

Faucets Queuing System Presented by, Sameer Kumar.

GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.

CPSC 171 Introduction to Computer Science System Software and Virtual Machines.

Pilot Factory using Schedd Glidein Barnett Chiu BNL

Tool Integration with Data and Computation Grid “Grid Wizard 2”

Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.

Process Manager Specification Rusty Lusk 1/15/04.

Virtual Cluster Computing in IHEPCloud Haibo Li, Yaodong Cheng, Jingyan Shi, Tao Cui Computer Center, IHEP HEPIX Spring 2016.

2004 Queue Scheduling and Advance Reservations with COSY Junwei Cao Falk Zimmermann C&C Research Laboratories NEC Europe Ltd.

Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!

Condor on Dedicated Clusters Peter Couvares and Derek Wright Computer Sciences Department University of Wisconsin-Madison

Lessons from LEAD/VGrADS Demo Yang-suk Kee, Carl Kesselman ISI/USC.

PARADOX Cluster job management

OpenPBS – Distributed Workload Management System

SC’07 Demo Draft VGrADS Team June 2007.

LEAD-VGrADS Day 1 Notes.

GWE Core Grid Wizard Enterprise (

Predicting Queue Waiting Time For Individual User Jobs

Mike Becher and Wolfgang Rehm

Resource and Service Management on the Grid

Introduction to High Performance Computing Using Sapelo2 at GACRC

Presentation transcript:

Slot Acquisition Presenter: Daniel Nurmi

Scope One aspect of VGDL request is the time ‘slot’ when resources are needed –Earliest time when resource set is needed –Maximum duration resource set will be used Three classes of resources –dedicated: always available –batch controlled: lag before available –advanced reservation: guaranteed availability in the future

Acquisition Routines Each class of resource needs the following (logical) routines –Prob = Query (cluster, nodes, walltime, starttime) –Id = BindInit (cluster, nodes, walltime, starttime, success_prob) –Status = Check (id) –Status = Install(id)

Slot Manager Acquisition Procedure Query Bind Is available? probability Query() Initiate bind Bind yet? True/false/abort BindInit() Check() Install() Slot Manager Install PBS glide-in when time

Dedicated Query –NOP (prob = 1) BindInit –NOP (always true) Check –NOP (always true) Install –Installs PBS glide-in

Advanced Reservation Query –Makes request to advanced reservation system –Prob = 1 if we can make the reservation –Prob = 0 if we cannot BindInit –Make adv. res. Request Check –NOP (always return true) Install –Submit PBS glide-in installation job to specialized adv. res. queue

Batch Controlled Query –Performs an algorithm to determine probability of meeting the slot requirement through regular batch queue BindInit –Use values calculated from ‘query’ for job dimensions and time to wait before submission Check –When ‘time to wait’ has elapsed, return true Install –Submit PBS glide-in installation job

The Algorithm Routines –‘deadline’ is ‘seconds from now’ –P = bqp_pred(machine, nodes, walltime, deadline) Algorithm Preq = 0.75 past = 0 P = bqp_pred(M, N, W+D, D) While((D-past) > 0) { if (P ~ Preq) { wait = past real_walltime = W+(D-past) } past += 30 P = bqp_pred(M, N, W+(D-past), (D-past)) }

Batch Experiment 75% is the target probability 356 total requests 257 total batch submissions –99 requests resulted in initial ‘not possible’ response 192 slots successfully acquired 257 *.75 = 193 Choose last acceptable time to minimize waste now 0.75 submit time

Near Term Experiments Try other probability levels Try other deadlines

PBS Glide-in Basic batch queue system assumes one-to-one mapping of job to resource set (slot) Idea: once a single ‘slot’ has been acquired, install ‘personal’ res. manager and scheduler within it in order to support multiple jobs within single slot Have instrumented torque (PBS) to fulfill this task –Plays the role that Condor would play as infrastructure scheduler –PBS “glide-in” –Simpler, supports MPI, etc.

PBS Overview PBS ServerPBS Sched PBS Mom node1 PBS Mom node2 PBS Mom node3 PBS Mom node4 Transfer scriptA qsub ‘scriptA’ scriptA gets node1, node2, and node3

PBS Overview PBS ServerPBS Sched PBS Mom node1 PBS Mom node2 PBS Mom node3 PBS Mom node4 scriptA ssh cmd cmd ssh cmd cmd

PBS glide-in PBS ServerPBS Sched PBS Mom node1 PBS Mom node2 PBS Mom node3 PBS Mom node4 pglide.pbs qsub pglide.pbs

PBS glide-in PBS ServerPBS Sched PBS Mom node1 PBS Mom node2 PBS Mom node3 PBS Mom node4 pglide.pbspbs_mom pbs_server pbs_sched

PBS glide-in PBS ServerPBS Sched PBS Mom node1 PBS Mom node2 PBS Mom node3 PBS Mom node4 pglide.pbspbs_mom pbs_server pbs_sched qsub scriptA GRAM globusrun-ws jobA globusrun-ws jobB qsub scriptB scriptAscriptB

PBS glide-in TODO In order to implement this, needed to disable some of PBS internal security features (drop privs, root check, priv ports, user auth checks, host auth checks) Streamline installation process (good but not great) Architecture discussion: one server per slot? One server for all slots on a single machine? –Requires reworking torque software a bit

Slot Acquisition Status BQP ‘virtual advanced reservation’ system in place PBS glide-in working on all machines Dan has access to Need to investigate advanced reservation interface(s) Need to figure out how to properly submit PBS jobs using GRAM

Thanks! Questions?

Statistics TODO More reactive change point detection –Machine down time constitutes a change point we can detect better –Better understanding of autocorrelation and quantiles Non-statistical case –One user submits 20,000 single processor jobs

Current Cluster Status DedicatedBatch Controlled Advanced Res. Dante X ? NCSA Mercury X ? SDSC Teragrid X ? ADA X ? IU TG X ? IU BigRed X ? IU Tyr X ?