Job Scheduling in a Grid Computing Environment

Slides:



Advertisements
Similar presentations
Performance Testing - Kanwalpreet Singh.
Advertisements

Hadi Goudarzi and Massoud Pedram
Chapter 5: Server Hardware and Availability. Hardware Reliability and LAN The more reliable a component, the more expensive it is. Server hardware is.
Computer Organization and Architecture
On-Demand Media Streaming Over the Internet Mohamed M. Hefeeda, Bharat K. Bhargava Presented by Sam Distributed Computing Systems, FTDCS Proceedings.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
Public-resource computing for CEPC Simulation Wenxiao Kan Computing Center/Institute of High Physics Energy Chinese Academic of Science CEPC2014 Scientific.
Connecting OurGrid & GridSAM A Short Overview. Content Goals OurGrid: architecture overview OurGrid: short overview GridSAM: short overview GridSAM: example.
November , 2009SERVICE COMPUTATION 2009 Analysis of Energy Efficiency in Clouds H. AbdelSalamK. Maly R. MukkamalaM. Zubair Department.
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.
Y. Kotani · F. Ino · K. Hagihara Springer Science + Business Media B.V Reporter: 李長霖.
07:44:46Service Oriented Cyberinfrastructure Lab, Introduction to BOINC By: Andrew J Younge
Operating Systems Process Management.
A Utility-based Approach to Scheduling Multimedia Streams in P2P Systems Fang Chen Computer Science Dept. University of California, Riverside
June 30 - July 2, 2009AIMS 2009 Towards Energy Efficient Change Management in A Cloud Computing Environment: A Pro-Active Approach H. AbdelSalamK. Maly.
Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,
Capacity Planning in a Virtual Environment Chris Chesley, Sr. Systems Engineer
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
CT101: Computing Systems Introduction to Operating Systems.
CernVM and Volunteer Computing Ivan D Reid Brunel University London Laurence Field CERN.
Local Scheduling for Volunteer Computing David P. Anderson U.C. Berkeley Space Sciences Lab John McLeod VII Sybase March 30, 2007.
Chapter 1: Introduction
Chapter 1: Introduction
Jacob R. Lorch Microsoft Research
Designing a Runtime System for Volunteer Computing David P
Credits: 3 CIE: 50 Marks SEE:100 Marks Lab: Embedded and IOT Lab
Operating Systems (CS 340 D)
Chapter 1: Introduction
Chapter 1: Introduction
CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016
Network Load Balancing
CLUSTER COMPUTING Presented By, Navaneeth.C.Mouly 1AY05IS037
Lecture 21 Concurrency Introduction
Grid Computing.
Introduction to Operating System (OS)
CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017
Chapter 1: Introduction
Grid Computing Colton Lewis.
Chapter 1: Introduction
Chapter 1: Introduction
Real-time Software Design
Operating Systems (CS 340 D)
Chapter 1: Introduction
Chapter 1: Introduction
ITIS 1210 Introduction to Web-Based Information Systems
Chapter 2: System Structures
Operating Systems Chapter 5: Input/Output Management
CLUSTER COMPUTING.
Grid Computing Done by: Shamsa Amur Al-Matani.
CSE8380 Parallel and Distributed Processing Presentation
Chapter 1: Introduction
What is Concurrent Programming?
Introduction to Operating Systems
Language Processors Application Domain – ideas concerning the behavior of a software. Execution Domain – Ideas implemented in Computer System. Semantic.
Multithreaded Programming
Chapter 2: Operating-System Structures
CS385T Software Engineering Dr.Doaa Sami
Subject Name: Operating System Concepts Subject Number:
Chapter 1: Introduction
Chapter 1: Introduction
Chapter 1: Introduction
Presented By: Darlene Banta
Chapter 1: Introduction
Chapter 2: Operating-System Structures
Chapter 1: Introduction
Chapter 1: Introduction
Exploring Multi-Core on
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Job Scheduling in a Grid Computing Environment Colton Lewis

Agenda Last presentation: introduce grid computing This presentation: address job scheduling techniques in detail Review what is grid computing Job scheduling challenges in grid computing Case study in approaching these challenges

Components of Grid Computing Multiple computers Independently functioning hardware Multiple locations and/or owners Shared computational goal Distributed resources over a network Typically an already existing network

Benefits of Grid Computing Large pool of resources Large grids are comparable in FLOP/s to top 500 supercomputers Distributed costs Administration Maintenance Electricity Space Utilize existing infrastructure and avoid specialized hardware

Inherent Parallelism Large numbers of computers means lots of possible parallelism Great for handling large numbers of easily separable tasks Easily parallelizable problems with little communication Signal processing, graphics and animation, search and simulation, etc. High volume of very similar tasks

Example Grid Project: SETI@home

Job Scheduling NP-Hard computer science problem Optimality is computationally intractable in general Combinatorial Optimization Grids must consider even more factors

Heterogeneous Machines Machines on the grid may have vastly different resources Dedicated clusters Desktop computer donating spare cycles Embedded devices Must account for this to balance load

Dynamic Network Resources may not be available Computers may be shut off, software uninstalled Resources may not be reliable Hardware errors, malicious participants returning incorrect results

General Strategies Know as much as possible Use Heuristics Job intensity Client capabilities Use Heuristics

Examining the BOINC Scheduler Berkeley Open Infrastructure for Network Computing Software behind many volunteer computing projects

Terminology Host – a worker machine May work on multiple projects Client – program for fetching jobs from servers All server communication is issued by the client Server – a task assignment program Project – a long-running computation on the grid May have its own server or share SETI is a BOINC project Job – subtask of a project assigned to a host Application – program for performing a job Supplied by project

BOINC Host Architecture

User Preferences Informs many scheduling decisions Owner of host can specify Resource share of projects Limits on CPU, RAM, Network Bandwidth Connection interval to server(s)

Credit Hosts are assigned credit for jobs completed before deadline Based on estimated number of FLOPs Each project awards credit Provides a way to rank performance of hosts Points toward possible grid improvements

Host Perspective Each host must solve two related problems CPU scheduling – when to run currently assigned jobs When to ask a project for more work Works to maximize credit subject to constraints User preferences Hardware

Early Policies CPU Scheduling – Weighted Round Robin Each project given CPU time according to user specified percentage Does not account for deadlines, may waste lots of work Work Fetch Scheduling – Keep enough work for full connection interval for all projects

Example Failure Consider the table to the right Jobs complete in 250, 20, and 10 hours CPU is never idle, but all work is wasted

Earliest Deadline First Does not enforce desired resource sharing Projects with long jobs will stave

Estimating CPU Time Knowing CPU time means knowing which jobs can be completed by deadline Project supplied FLOPs estimate divided by host CPU benchmark Can be consistently wrong, real projects need memory, io, etc. Duration correction factor per project How much CPU time did last project take compared to estimate CPU efficiency factor How does actual CPU time compare to wall time Applications may periodically report percentage done

Debt The amount of work “owed” to a project Long term enforcement of resource shares while still attending to deadlines Short term debt controls CPU scheduling over one connection interval Long term debt controls Work Fetching

CPU Scheduling Periodically calculate debt to each project CPU time expected by resource sharing minus CPU time spent Deduct expected payoff from currently running jobs Run earliest deadline job from project with most debt

Work Fetching Same general method as CPU Scheduling Controls new jobs requested rather than CPU time

Server Perspective Must ensure correctness of results, if needed Must deliver reasonable jobs to hosts requesting work

BOINC Server Architecture

Credit and Redundancy Many jobs require error checking Solution: assign same job to two or more hosts Answers are compared by project server If enough hosts agree, answer is accepted Credit is awarded to all correct hosts When assigning new work, prioritize jobs waiting for an answer

Job Size Matching Assume jobs can be created in various size classes Keep order statistics of known host performance When assigning new work, prioritize jobs that are the right size for the requesting host If possible, create jobs according the distribution of known hosts

Summary Effective grid computing must consider both host and server Nature of grid means different interests may control each Long running projects allow for predictive statistics CPU time, job matching, etc. The best known methods use heuristics to decide what to do Human-like notions of “credit”, “debt”, etc.

Works Consulted D. P. Anderson and J. McLeod, "Local Scheduling for Volunteer Computing," 2007 IEEE International Parallel and Distributed Processing Symposium, Long Beach, CA, 2007, pp. 1-8. D. P. Anderson, E. Korpela and R. Walton, "High-performance task distribution for volunteer computing," First International Conference on e-Science and Grid Computing (e-Science'05), Melbourne, Vic., 2005, pp. 8 pp.-203. E. Korpela, D. Werthimer, D. Anderson, J. Cobb and M. Leboisky, "SETI@home- massively distributed computing for SETI," in Computing in Science & Engineering, vol. 3, no. 1, pp. 78-83, Jan/Feb 2001. Jacob, Bart, et al. Introduction to Grid Computing. United States: IBM, International Technical Support Organization, 2005. Web. <https://www.redbooks.ibm.com/redbooks/pdfs/sg246778.pdf>. <http://boinc.berkeley.edu/trac/wiki/JobSizeMatching>