Job Scheduling in a Grid Computing Environment

Slides:

Advertisements

Similar presentations

Performance Testing - Kanwalpreet Singh.

Advertisements

Hadi Goudarzi and Massoud Pedram

Chapter 5: Server Hardware and Availability. Hardware Reliability and LAN The more reliable a component, the more expensive it is. Server hardware is.

Computer Organization and Architecture

On-Demand Media Streaming Over the Internet Mohamed M. Hefeeda, Bharat K. Bhargava Presented by Sam Distributed Computing Systems, FTDCS Proceedings.

Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.

Public-resource computing for CEPC Simulation Wenxiao Kan Computing Center/Institute of High Physics Energy Chinese Academic of Science CEPC2014 Scientific.

Connecting OurGrid & GridSAM A Short Overview. Content Goals OurGrid: architecture overview OurGrid: short overview GridSAM: short overview GridSAM: example.

November , 2009SERVICE COMPUTATION 2009 Analysis of Energy Efficiency in Clouds H. AbdelSalamK. Maly R. MukkamalaM. Zubair Department.

Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.

Y. Kotani · F. Ino · K. Hagihara Springer Science + Business Media B.V Reporter: 李長霖.

07:44:46Service Oriented Cyberinfrastructure Lab, Introduction to BOINC By: Andrew J Younge

Operating Systems Process Management.

A Utility-based Approach to Scheduling Multimedia Streams in P2P Systems Fang Chen Computer Science Dept. University of California, Riverside

June 30 - July 2, 2009AIMS 2009 Towards Energy Efficient Change Management in A Cloud Computing Environment: A Pro-Active Approach H. AbdelSalamK. Maly.

Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,

Capacity Planning in a Virtual Environment Chris Chesley, Sr. Systems Engineer

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

CT101: Computing Systems Introduction to Operating Systems.

CernVM and Volunteer Computing Ivan D Reid Brunel University London Laurence Field CERN.

Local Scheduling for Volunteer Computing David P. Anderson U.C. Berkeley Space Sciences Lab John McLeod VII Sybase March 30, 2007.

Chapter 1: Introduction

Chapter 1: Introduction

Jacob R. Lorch Microsoft Research

Designing a Runtime System for Volunteer Computing David P

Credits: 3 CIE: 50 Marks SEE:100 Marks Lab: Embedded and IOT Lab

Operating Systems (CS 340 D)

Chapter 1: Introduction

Chapter 1: Introduction

CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016

Network Load Balancing

CLUSTER COMPUTING Presented By, Navaneeth.C.Mouly 1AY05IS037

Lecture 21 Concurrency Introduction

Grid Computing.

Introduction to Operating System (OS)

CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017

Chapter 1: Introduction

Grid Computing Colton Lewis.

Chapter 1: Introduction

Chapter 1: Introduction

Real-time Software Design

Operating Systems (CS 340 D)

Chapter 1: Introduction

Chapter 1: Introduction

ITIS 1210 Introduction to Web-Based Information Systems

Chapter 2: System Structures

Operating Systems Chapter 5: Input/Output Management

CLUSTER COMPUTING.

Grid Computing Done by: Shamsa Amur Al-Matani.

CSE8380 Parallel and Distributed Processing Presentation

Chapter 1: Introduction

What is Concurrent Programming?

Introduction to Operating Systems

Language Processors Application Domain – ideas concerning the behavior of a software. Execution Domain – Ideas implemented in Computer System. Semantic.

Multithreaded Programming

Chapter 2: Operating-System Structures

CS385T Software Engineering Dr.Doaa Sami

Subject Name: Operating System Concepts Subject Number:

Chapter 1: Introduction

Chapter 1: Introduction

Chapter 1: Introduction

Presented By: Darlene Banta

Chapter 1: Introduction

Chapter 2: Operating-System Structures

Chapter 1: Introduction

Chapter 1: Introduction

Exploring Multi-Core on

Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.

Presentation transcript:

Job Scheduling in a Grid Computing Environment Colton Lewis

Agenda Last presentation: introduce grid computing This presentation: address job scheduling techniques in detail Review what is grid computing Job scheduling challenges in grid computing Case study in approaching these challenges

Components of Grid Computing Multiple computers Independently functioning hardware Multiple locations and/or owners Shared computational goal Distributed resources over a network Typically an already existing network

Benefits of Grid Computing Large pool of resources Large grids are comparable in FLOP/s to top 500 supercomputers Distributed costs Administration Maintenance Electricity Space Utilize existing infrastructure and avoid specialized hardware

Inherent Parallelism Large numbers of computers means lots of possible parallelism Great for handling large numbers of easily separable tasks Easily parallelizable problems with little communication Signal processing, graphics and animation, search and simulation, etc. High volume of very similar tasks

Example Grid Project: SETI@home

Job Scheduling NP-Hard computer science problem Optimality is computationally intractable in general Combinatorial Optimization Grids must consider even more factors

Heterogeneous Machines Machines on the grid may have vastly different resources Dedicated clusters Desktop computer donating spare cycles Embedded devices Must account for this to balance load

Dynamic Network Resources may not be available Computers may be shut off, software uninstalled Resources may not be reliable Hardware errors, malicious participants returning incorrect results

General Strategies Know as much as possible Use Heuristics Job intensity Client capabilities Use Heuristics

Examining the BOINC Scheduler Berkeley Open Infrastructure for Network Computing Software behind many volunteer computing projects

Terminology Host – a worker machine May work on multiple projects Client – program for fetching jobs from servers All server communication is issued by the client Server – a task assignment program Project – a long-running computation on the grid May have its own server or share SETI is a BOINC project Job – subtask of a project assigned to a host Application – program for performing a job Supplied by project

BOINC Host Architecture

User Preferences Informs many scheduling decisions Owner of host can specify Resource share of projects Limits on CPU, RAM, Network Bandwidth Connection interval to server(s)

Credit Hosts are assigned credit for jobs completed before deadline Based on estimated number of FLOPs Each project awards credit Provides a way to rank performance of hosts Points toward possible grid improvements

Host Perspective Each host must solve two related problems CPU scheduling – when to run currently assigned jobs When to ask a project for more work Works to maximize credit subject to constraints User preferences Hardware

Early Policies CPU Scheduling – Weighted Round Robin Each project given CPU time according to user specified percentage Does not account for deadlines, may waste lots of work Work Fetch Scheduling – Keep enough work for full connection interval for all projects

Example Failure Consider the table to the right Jobs complete in 250, 20, and 10 hours CPU is never idle, but all work is wasted

Earliest Deadline First Does not enforce desired resource sharing Projects with long jobs will stave

Estimating CPU Time Knowing CPU time means knowing which jobs can be completed by deadline Project supplied FLOPs estimate divided by host CPU benchmark Can be consistently wrong, real projects need memory, io, etc. Duration correction factor per project How much CPU time did last project take compared to estimate CPU efficiency factor How does actual CPU time compare to wall time Applications may periodically report percentage done

Debt The amount of work “owed” to a project Long term enforcement of resource shares while still attending to deadlines Short term debt controls CPU scheduling over one connection interval Long term debt controls Work Fetching

CPU Scheduling Periodically calculate debt to each project CPU time expected by resource sharing minus CPU time spent Deduct expected payoff from currently running jobs Run earliest deadline job from project with most debt

Work Fetching Same general method as CPU Scheduling Controls new jobs requested rather than CPU time

Server Perspective Must ensure correctness of results, if needed Must deliver reasonable jobs to hosts requesting work

BOINC Server Architecture

Credit and Redundancy Many jobs require error checking Solution: assign same job to two or more hosts Answers are compared by project server If enough hosts agree, answer is accepted Credit is awarded to all correct hosts When assigning new work, prioritize jobs waiting for an answer

Job Size Matching Assume jobs can be created in various size classes Keep order statistics of known host performance When assigning new work, prioritize jobs that are the right size for the requesting host If possible, create jobs according the distribution of known hosts

Summary Effective grid computing must consider both host and server Nature of grid means different interests may control each Long running projects allow for predictive statistics CPU time, job matching, etc. The best known methods use heuristics to decide what to do Human-like notions of “credit”, “debt”, etc.

Works Consulted D. P. Anderson and J. McLeod, "Local Scheduling for Volunteer Computing," 2007 IEEE International Parallel and Distributed Processing Symposium, Long Beach, CA, 2007, pp. 1-8. D. P. Anderson, E. Korpela and R. Walton, "High-performance task distribution for volunteer computing," First International Conference on e-Science and Grid Computing (e-Science'05), Melbourne, Vic., 2005, pp. 8 pp.-203. E. Korpela, D. Werthimer, D. Anderson, J. Cobb and M. Leboisky, "SETI@home- massively distributed computing for SETI," in Computing in Science & Engineering, vol. 3, no. 1, pp. 78-83, Jan/Feb 2001. Jacob, Bart, et al. Introduction to Grid Computing. United States: IBM, International Technical Support Organization, 2005. Web. <https://www.redbooks.ibm.com/redbooks/pdfs/sg246778.pdf>. <http://boinc.berkeley.edu/trac/wiki/JobSizeMatching>