Job Scheduling in a Grid Computing Environment Colton Lewis
Agenda Last presentation: introduce grid computing This presentation: address job scheduling techniques in detail Review what is grid computing Job scheduling challenges in grid computing Case study in approaching these challenges
Components of Grid Computing Multiple computers Independently functioning hardware Multiple locations and/or owners Shared computational goal Distributed resources over a network Typically an already existing network
Benefits of Grid Computing Large pool of resources Large grids are comparable in FLOP/s to top 500 supercomputers Distributed costs Administration Maintenance Electricity Space Utilize existing infrastructure and avoid specialized hardware
Inherent Parallelism Large numbers of computers means lots of possible parallelism Great for handling large numbers of easily separable tasks Easily parallelizable problems with little communication Signal processing, graphics and animation, search and simulation, etc. High volume of very similar tasks
Example Grid Project: SETI@home
Job Scheduling NP-Hard computer science problem Optimality is computationally intractable in general Combinatorial Optimization Grids must consider even more factors
Heterogeneous Machines Machines on the grid may have vastly different resources Dedicated clusters Desktop computer donating spare cycles Embedded devices Must account for this to balance load
Dynamic Network Resources may not be available Computers may be shut off, software uninstalled Resources may not be reliable Hardware errors, malicious participants returning incorrect results
General Strategies Know as much as possible Use Heuristics Job intensity Client capabilities Use Heuristics
Examining the BOINC Scheduler Berkeley Open Infrastructure for Network Computing Software behind many volunteer computing projects
Terminology Host – a worker machine May work on multiple projects Client – program for fetching jobs from servers All server communication is issued by the client Server – a task assignment program Project – a long-running computation on the grid May have its own server or share SETI is a BOINC project Job – subtask of a project assigned to a host Application – program for performing a job Supplied by project
BOINC Host Architecture
User Preferences Informs many scheduling decisions Owner of host can specify Resource share of projects Limits on CPU, RAM, Network Bandwidth Connection interval to server(s)
Credit Hosts are assigned credit for jobs completed before deadline Based on estimated number of FLOPs Each project awards credit Provides a way to rank performance of hosts Points toward possible grid improvements
Host Perspective Each host must solve two related problems CPU scheduling – when to run currently assigned jobs When to ask a project for more work Works to maximize credit subject to constraints User preferences Hardware
Early Policies CPU Scheduling – Weighted Round Robin Each project given CPU time according to user specified percentage Does not account for deadlines, may waste lots of work Work Fetch Scheduling – Keep enough work for full connection interval for all projects
Example Failure Consider the table to the right Jobs complete in 250, 20, and 10 hours CPU is never idle, but all work is wasted
Earliest Deadline First Does not enforce desired resource sharing Projects with long jobs will stave
Estimating CPU Time Knowing CPU time means knowing which jobs can be completed by deadline Project supplied FLOPs estimate divided by host CPU benchmark Can be consistently wrong, real projects need memory, io, etc. Duration correction factor per project How much CPU time did last project take compared to estimate CPU efficiency factor How does actual CPU time compare to wall time Applications may periodically report percentage done
Debt The amount of work “owed” to a project Long term enforcement of resource shares while still attending to deadlines Short term debt controls CPU scheduling over one connection interval Long term debt controls Work Fetching
CPU Scheduling Periodically calculate debt to each project CPU time expected by resource sharing minus CPU time spent Deduct expected payoff from currently running jobs Run earliest deadline job from project with most debt
Work Fetching Same general method as CPU Scheduling Controls new jobs requested rather than CPU time
Server Perspective Must ensure correctness of results, if needed Must deliver reasonable jobs to hosts requesting work
BOINC Server Architecture
Credit and Redundancy Many jobs require error checking Solution: assign same job to two or more hosts Answers are compared by project server If enough hosts agree, answer is accepted Credit is awarded to all correct hosts When assigning new work, prioritize jobs waiting for an answer
Job Size Matching Assume jobs can be created in various size classes Keep order statistics of known host performance When assigning new work, prioritize jobs that are the right size for the requesting host If possible, create jobs according the distribution of known hosts
Summary Effective grid computing must consider both host and server Nature of grid means different interests may control each Long running projects allow for predictive statistics CPU time, job matching, etc. The best known methods use heuristics to decide what to do Human-like notions of “credit”, “debt”, etc.
Works Consulted D. P. Anderson and J. McLeod, "Local Scheduling for Volunteer Computing," 2007 IEEE International Parallel and Distributed Processing Symposium, Long Beach, CA, 2007, pp. 1-8. D. P. Anderson, E. Korpela and R. Walton, "High-performance task distribution for volunteer computing," First International Conference on e-Science and Grid Computing (e-Science'05), Melbourne, Vic., 2005, pp. 8 pp.-203. E. Korpela, D. Werthimer, D. Anderson, J. Cobb and M. Leboisky, "SETI@home- massively distributed computing for SETI," in Computing in Science & Engineering, vol. 3, no. 1, pp. 78-83, Jan/Feb 2001. Jacob, Bart, et al. Introduction to Grid Computing. United States: IBM, International Technical Support Organization, 2005. Web. <https://www.redbooks.ibm.com/redbooks/pdfs/sg246778.pdf>. <http://boinc.berkeley.edu/trac/wiki/JobSizeMatching>