Synchronizing Processes

Slides:

Advertisements

Similar presentations

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

Advertisements

Quincy: Fair Scheduling for Distributed Computing Clusters Microsoft Research Silicon Valley SOSP’09 Presented at the Big Data Reading Group by Babu Pillai.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

Multiprocessing Memory Management

Chapter 1 and 2 Computer System and Operating System Overview

©Brooks/Cole, 2003 Chapter 7 Operating Systems Dr. Barnawi.

Computer Organization and Architecture

CS Spring 2012 CS 414 – Multimedia Systems Design Lecture 34 – Media Server (Part 3) Klara Nahrstedt Spring 2012.

CS364 CH08 Operating System Support TECH Computer Science Operating System Overview Scheduling Memory Management Pentium II and PowerPC Memory Management.

Layers and Views of a Computer System Operating System Services Program creation Program execution Access to I/O devices Controlled access to files System.

Server Load Balancing. Introduction Why is load balancing of servers needed? If there is only one web server responding to all the incoming HTTP requests.

Chapter 6 CPU SCHEDULING.

Network Aware Resource Allocation in Distributed Clouds.

Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.

Scheduling in Cloud Presented by: Abdullah Al Mahmud Course: Cloud Computing(Fall 2012)

1 Quincy: Fair Scheduling for Distributed Computing Clusters Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg.

Chapter 5 Operating System Support. Outline Operating system - Objective and function - types of OS Scheduling - Long term scheduling - Medium term scheduling.

Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.

임규찬. 1. Abstract 2. Introduction 3. Design Goals 4. Sample-Based Scheduling for Parallel Jobs 5. Implements.

Chapter 7 Operating Systems. Define the purpose and functions of an operating system. Understand the components of an operating system. Understand the.

Course Title: “Operating System” Chapter No: 04 “Process Scheduling” Course Instructor: ILTAF MEHDI IT Lecturer, MIHE, Kart-i Parwan, Kabul.

Virtual Memory The memory space of a process is normally divided into blocks that are either pages or segments. Virtual memory management takes.

SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.

Memory Management OS Fazal Rehman Shamil. swapping Swapping concept comes in terms of process scheduling. Swapping is basically implemented by Medium.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

CS3771 Today: Distributed Coordination  Previous class: Distributed File Systems Issues: Naming Strategies: Absolute Names, Mount Points (logical connection.

Lecture 4 Page 1 CS 111 Summer 2013 Scheduling CS 111 Operating Systems Peter Reiher.

Virtual Memory.

OPERATING SYSTEMS CS 3502 Fall 2017

Synchronizing Processes

Memory Hierarchy Ideal memory is fast, large, and inexpensive

Memory Management.

Chapter 2 Memory and process management

Memory Allocation The main memory must accommodate both:

Distributed Shared Memory

RAID, Programmed I/O, Interrupt Driven I/O, DMA, Operating System

Large-scale file systems and Map-Reduce

Advanced OS Concepts (For OCR)

CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016

William Stallings Computer Organization and Architecture

PREGEL Data Management in the Cloud

Task Scheduling for Multicore CPUs and NUMA Systems

CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017

Cache Memory Presentation I

Virtual Memory Networks and Communication Department.

CSI 400/500 Operating Systems Spring 2009

Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.

Main Memory Management

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

湖南大学-信息科学与工程学院-计算机与科学系

Outline Midterm results summary Distributed file systems – continued

Operating Systems.

Computer Architecture

Operating Systems.

CPU SCHEDULING.

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Threads Chapter 4.

COMP60621 Fundamentals of Parallel and Distributed Systems

Chapter 11 I/O Management and Disk Scheduling

Chapter 8: Memory Management strategies

CS510 - Portland State University

Chapter 4: Threads.

CS703 - Advanced Operating Systems

Database System Architectures

COMP60611 Fundamentals of Parallel and Distributed Systems

Lecture 18: Coherence and Synchronization

COMP755 Advanced Operating Systems

Operating Systems: Internals and Design Principles, 6/E

Presentation transcript:

Synchronizing Processes Clocks External clock synchronization (Cristian) Internal clock synchronization (Gusella & Zatti) Network Time Protocol (Mills) Decisions Agreement protocols (Fischer) Data Distributed file systems (Satyanarayanan) Memory Distributed shared memory (Nitzberg & Lo) Schedules Distributed scheduling (Isard et al.) Synchronizing Processes

Distributed Shared Memory Distributed Shared Memory (DSM): a feature that gives the illusion of physically shared memory in a distributed system Pros: Allows programmers to use shared-memory paradigm Low cost of distributed system machines Scalability due to no serialization point (i.e., no common bus) Synchronizing Processes > Distributed Shared Memory

DSM Approaches Three common approaches: Hardware implementations that extend traditional caching techniques to scalable architectures Operating system that achieves sharing and coherence through virtual-memory management Compiler implementations that automatically convert shared accesses into synchronization and coherence primitives Some systems use more than one approach Synchronizing Processes > Distributed Shared Memory

DSM Design Choices Four key aspects: Structure and granularity Coherence semantics Scalability Heterogeneity Synchronizing Processes > Distributed Shared Memory > Design Choices

Granularity A process is likely to access a large region of its shared address in a small amount of time Hence, larger page sizes reduce paging overhead However, larger page sizes increase the likelihood that more than one process will require access to a page (i.e., contention) Larger page sizes also increase false sharing False Sharing: occurs when two unrelated variables (each used by different processes) are placed in the same page Synchronizing Processes > Distributed Shared Memory > Design Choices

DSM Implementation Five key aspects: Data location and access Coherence protocol Replacement strategy Thrashing Related algorithms Synchronizing Processes > Distributed Shared Memory > Implementation

Coherence Protocol If shared data is replicated, two types of protocols handle replication and coherence Write-Invalidate Protocol: invalidates all copies of a piece of data except one before a write can proceed Write-Update Protocol: a write updates all copies of a piece of data Synchronizing Processes > Distributed Shared Memory > Implementation

Synchronizing Processes Clocks External clock synchronization (Cristian) Internal clock synchronization (Gusella & Zatti) Network Time Protocol (Mills) Decisions Agreement protocols (Fischer) Data Distributed file systems (Satyanarayanan) Memory Distributed shared memory (Nitzberg & Lo) Schedules Distributed scheduling (Isard et al.) Synchronizing Processes

Synchronizing Processes: Data-Intensive Scheduling CS/CE/TE 6378 Advanced Operating Systems

Data-Intensive Computing Increasingly important for applications such as: Web-scale data mining, Machine learning, traffic analysis Fairness More than 50% are small jobs ( less than 30 minutes) Large job should not monopolize the cluster If Job X takes t seconds when it runs exclusively on a cluster, X should take no more than Jt seconds when cluster has J concurrent jobs. (For N computers and J jobs, each job should get at-least N/J computers) Data locality Large disks directly attached to the computers Network bandwidth is expensive Synchronizing Processes > Data-Intensive Scheduling > Data-Intensive Computing

Distinguishing Feature Every computer in the cluster has a large disk directly attached to it Allows application data to be stored on the same computers on which it will be processed But maintaining high bandwidth between arbitrary pairs of computers becomes increasingly expensive as the size of a cluster grows If computations are not placed close to their input data, the network can therefore become bottlenecked Hence, optimizing the placement of computation to minimize network traffic is a primary goal Synchronizing Processes > Data-Intensive Scheduling > Data-Intensive Computing

Data-Intensive Scheduling Primary challenge is to balance the requirements of data locality and fairness Fairness: all applications will be affected equally A strategy that achieves optimal data locality will typically delay a job until its ideal resources are available A strategy that achieves fairness by allocating available resources to a job as soon as possible will typically be forced to choose resources not closest to the computation’s data Synchronizing Processes > Data-Intensive Scheduling

Fair Scheduling Most users desire fair sharing of cluster resources The most common request is that one user’s large job should not monopolize the whole cluster, delaying the completion of everyone else’s small jobs It is also important that ensuring low latency for short jobs does not come at the expense of the overall throughput of the system Synchronizing Processes > Data-Intensive Scheduling

Throughput Throughput: the amount of data transferred from one node to another or processed in a specified amount of time Data transfer rates for disk drives and networks are measured in terms of throughput Throughput is normally measured in kbps, Mbps, and Gbps Synchronizing Processes > Data-Intensive Scheduling

Computational Model Each job is managed by a root task, which is a process running on one of the cluster computers that contains a state machine managing the workflow of that job The computation is executed by worker tasks, which are individual processes A job’s workflow is represented by a directed acyclic graph of workers where edges represent dependencies Synchronizing Processes > Data-Intensive Scheduling > Computational Model

Workflow Initially, the root task will request for worker tasks W1 – W5 W8 W6 W7 W1 W2 W3 W4 W5 Synchronizing Processes > Data-Intensive Scheduling > Computational Model

Workflow Once W1, W2, and W3 are finished, the root task will request for W6 W8 W6 W7 W1 W2 W3 W4 W5 Synchronizing Processes > Data-Intensive Scheduling > Computational Model

Workflow Once W4 and W5 are finished, the root task will request for W7 W8 W6 W7 W1 W2 W3 W4 W5 Synchronizing Processes > Data-Intensive Scheduling > Computational Model

Workflow Once W6 and W7 are finished, the root task will request for W8 W8 W6 W7 W1 W2 W3 W4 W5 Synchronizing Processes > Data-Intensive Scheduling > Computational Model

Computational Model The root process monitors which tasks have completed and which are ready for execution While running, tasks are independent of each other so killing one task will not impact another A worker task may also be executed multiple times, for example to recreate data lost as a result of a failed computer But a worker task will always generate the same result Synchronizing Processes > Data-Intensive Scheduling > Computational Model

Cluster Architecture A single centralized scheduling service maintains a batch queue of jobs Several concurrent jobs can be running and sharing resources within the cluster Each computer is restricted to run only one task at a time Other jobs can be queued and waiting for admission When a job is started the scheduler allocates a computer for its root task If that computer fails, the job will be re-executed from its beginning Each running job’s root task submits its list of ready workers and their input data summaries to the scheduler Synchronizing Processes > Data-Intensive Scheduling > Cluster Architecture

Input Data Locality A worker task may read data from storage attached to computers in the cluster Data is either Inputs stored as partitioned files in a DFS, or Intermediate output files generated by other workers Consequently, the scheduler can be made aware of detailed information about the data transfer costs that would result from executing a worker on a given computer Synchronizing Processes > Data-Intensive Scheduling > Cluster Architecture

Input Data Summaries When the nth worker in job j (denoted wjn) is ready, its root rj computes for each computer m the amount of data that wjn would have to read across the network if executed on m The root rj then constructs two lists for wjn: A list of preferred computers, and A list of preferred racks Synchronizing Processes > Data-Intensive Scheduling > Cluster Architecture

Preferred Localities Any computer that stores more than a fraction δc of wjn’s total input data is added to the list of preferred computers Any rack whose computers in sum store more than a fraction δr of wjn’s total input data is added to the second In practice, the authors recommend a value of 0.1 for δc and δr Synchronizing Processes > Data-Intensive Scheduling > Cluster Architecture

Cluster Architecture Each running job’s root task submits its list of ready workers and their input data summaries to the scheduler The scheduler then matches tasks to computers and instructs the appropriate root task to set them running When a worker completes, its root task is informed and this may trigger a new set of ready tasks to be sent to the scheduler Synchronizing Processes > Data-Intensive Scheduling > Cluster Architecture

Monitoring The root task also monitors the execution time of worker tasks It can submit a duplicate of a task that is taking longer than expected to complete When a worker task fails because of unreliable cluster resources, its root task is responsible for back-tracking through the dependency graph and re-submitting tasks as necessary to regenerate any intermediate data that has been lost Synchronizing Processes > Data-Intensive Scheduling > Cluster Architecture

Termination The scheduler may decide to kill a worker task before it completes in order to allow other jobs to have access to its resources or to move the worker to a more suitable location In this case, the scheduler will automatically instruct the worker’s root so the task can be restarted on a different computer at a later time Synchronizing Processes > Data-Intensive Scheduling > Cluster Architecture

Fairness Goals A job which runs for t seconds given exclusive access to a cluster should take no more than Jt seconds when there are J jobs concurrently executing on that cluster Authors attempt to achieve this through admission control Synchronizing Processes > Data-Intensive Scheduling > Fairness

Admission Control Admission control is implemented to ensure that at most K jobs are executing at any time When the limit of K jobs is reached, new jobs are queued and started in order of submission time as other jobs complete A large K increases the likelihood that several jobs will be competing for access to data stored on the same computer A small K may leave some cluster computers idle if the jobs do not submit enough tasks Synchronizing Processes > Data-Intensive Scheduling > Fairness

Queue-Based Scheduling Queue-based architecture: One queue for each computer m in the cluster (Cm) One queue for each rack of computers n (Rn) One cluster-wide queue (X) When a worker task is submitted to the scheduler it is added to multiple queues: Cm for each computer m on its preferred list Rn for each rack n on its preferred rack list X for the entire cluster When a task starts, it is removed from all queues Synchronizing Processes > Data-Intensive Scheduling > Queue-Based Scheduling

Queue-Based Scheduling When a new job is started, its root task is assigned a computer at random from among the computers that are not currently executing root tasks Any worker task currently running on that computer is killed and re-entered into the scheduler queues as though it had just been submitted K must be small enough that there are at least K + 1 working computer in the cluster, in order that at least one computer is available to execute worker tasks Synchronizing Processes > Data-Intensive Scheduling > Queue-Based Scheduling

Queue-Based Scheduling Four types of queue-based algorithms Greedy without fairness Fair greedy Fairness with preemption Sticky slots Synchronizing Processes > Data-Intensive Scheduling > Queue-Based Scheduling

Greedy Without Fairness Whenever a computer m becomes free, the first ready task on Cm is dispatched to m If Cm does not have any ready tasks, then the first ready task on Rn is dispatched to m, where n is m’s rack If neither Cm nor Rn contain a ready task, then the first ready task on X is dispatched to m Synchronizing Processes > Data-Intensive Scheduling > Queue-Based Scheduling

Greedy Without Fairness Cons: If a job submits a large number of tasks on every computer’s queue then other jobs will not execute any workers until the first job’s tasks have been run In a loaded cluster, a task that has no preferred computers or racks may wait for a long time before being executed anywhere since there will always be at least one preferred task ready for every computer or rack Synchronizing Processes > Data-Intensive Scheduling > Queue-Based Scheduling

Fair Greedy Blocked: Simple Fairness: when a job is blocked, its waiting tasks will not be matched to any computers, thus allowing unblocked jobs to take precedence when starting new tasks Simple Fairness: the first “ready” task from a queue is pulled, where a task is ready only if its job is not blocked A job j is blocked when it is running Aj tasks or more, where Aj = min(M/K, Nj) and Nj is the number of workers that j currently has running or queued Synchronizing Processes > Data-Intensive Scheduling > Queue-Based Scheduling

Fairness With Preemption When a job j is running more than Aj workers, the scheduler will kill its over-quota tasks, starting with the most-recently scheduled task first to try to minimize wasted work Synchronizing Processes > Data-Intensive Scheduling > Queue-Based Scheduling

Sticky Slots Consider the steady state in which each job is occupying exactly its allocated quota of computers Whenever a task from job j completes on computer m, j becomes unblocked but all of the other jobs remain blocked Consequently, m is reassigned to one of j’s tasks again This is the “sticky slot” problem Synchronizing Processes > Data-Intensive Scheduling > Queue-Based Scheduling

Sticky Slots Solution A job j is not unblocked immediately if its number of running tasks falls below Aj Instead the scheduler waits to unblock j until either the number of j’s running tasks falls below Aj – H or H seconds have passed, where H is a hysteresis margin In many cases, this delay is sufficient enough to allow another job’s worker, with better locality, to steal computer m Synchronizing Processes > Data-Intensive Scheduling > Queue-Based Scheduling

Flow-Based Scheduling The primary data structure used by this approach is a graph that encodes both the structure of the cluster’s network and the set of waiting tasks along with their locality metadata By assigning appropriate weights and capacities to the edges in this graph, a standard graph solver can be used to convert the graph into an instantaneous set of scheduling assignments There is a quantifiable cost to every scheduling decision Transfer cost for running a task on a particular computer Time cost for killing a task that has begun execution Synchronizing Processes > Data-Intensive Scheduling > Flow-Based Scheduling

Flow Networks Each edge has a flow and a capacity Synchronizing Processes > Data-Intensive Scheduling > Flow-Based Scheduling

Minimum-cost flow problem To find the cheapest possible way of sending a certain amount of flow trhough a flow network The MCF solution represents the job scheduling decisions that yield the minimum global cost.

Flow-Based Scheduling S is the sink node through which all flows exit the graph There is a node Cm for each computer m There is a node Rn for each rack n These is a cluster node X Synchronizing Processes > Data-Intensive Scheduling > Flow-Based Scheduling

Flow-Based Scheduling Each root task has a single outgoing edge to the computer where it is currently running Synchronizing Processes > Data-Intensive Scheduling > Flow-Based Scheduling

Flow-Based Scheduling Each worker task in job j has an edge to j’s unscheduled node Uj, to X, and to every rack and computer in its preferred lists Synchronizing Processes > Data-Intensive Scheduling > Flow-Based Scheduling

Flow-Based Scheduling Workers that are currently executing (shown in gray) also have an edge to the computer on which they are running Synchronizing Processes > Data-Intensive Scheduling > Flow-Based Scheduling

Flow-Based Scheduling The cost on the edge from wjn to Cm is a function of the amount of data that would be transferred across m’s rack switch and the core switch, if wjn were run on computer m. Synchronizing Processes > Data-Intensive Scheduling > Flow-Based Scheduling © 2013 Dr. Ryan P. McMahan

Flow-Based Scheduling The cost on the edge from wjn to Rn is the worst-case cost that would result if the task were run on the least favorable node in the nth rack. Synchronizing Processes > Data-Intensive Scheduling > Flow-Based Scheduling

Flow-Based Scheduling The cost on the edge from wjn to X is the worst-case cost that would result if the task were run on any computer. Synchronizing Processes > Data-Intensive Scheduling > Flow-Based Scheduling

Experiment Authors conducted experiments to compare the queue-based and flow-based (Quincy) algorithms Quincy with preemption performed the best in unconstrained and constrained networks Overall, flow-based scheduling was found to reduce traffic through the core switch of a hierarchical network by a factor of three, while at the same time increasing the throughput of the cluster Synchronizing Processes > Data-Intensive Scheduling > Flow-Based Scheduling

Credit Modified version of: www.cse.unl.edu/~ylu/csce990/notes/Quincy_Weiyue.ppt

Evaluation Typical Dryad jobs (Sort, Join, PageRank, WordCount, Prime) Prime used as a worst-case job that hogs the cluster if started first 240 computers in cluster. 8 racks, 29-31 computers per rack More than one metric used for evaluation

Experiments

Experiments (2)

Experiments (3)

Experiments (4)

Experiments (5)

Makespan when network is bottleneck(s)

Data Transfer (TB)

Conclusion New computational model for data intensive computing Elegant mapping of scheduling to min-cost flow/matching problem

Discussion Homogenous environment Centralized Quincy controller: single point of failure. No theoretical stability guarantee. Cost measure: fairness, cost of kill