Outline for Today’s Lecture Objective for today: Real time scheduling Linux scheduler Advanced scheduling topics Administrative: No class Thursday but.

Outline for Today’s Lecture Objective for today: Real time scheduling Linux scheduler Advanced scheduling topics Administrative: No class Thursday but discussion sections will go on as usual.

Real Time Schedulers Real-time schedulers must support regular, periodic execution of tasks (e.g., continuous media). CPU Reservations “I need to execute for X out of every Y units.” Scheduler exercises admission control at reservation time: application must handle failure of a reservation request. Time Constraints “Run this before my deadline at time T.” Proportional Sharing “I need X/N of the resources.”

Assumptions Tasks are periodic with constant interval between requests, T i (request rate 1/T i ) Each task must be completed before the next request for it occurs Tasks are independent Run-time for each task is constant (max), C i Any non-periodic tasks are special

Task Model time t1t1 t2t2 TiTi TiTi T2T2 C 1 = 1 C 2 = 1

Definitions Deadline is time of next request Overflow at time t if t is deadline of unfulfilled request Feasible schedule - for a given set of tasks, a scheduling algorithm produces a schedule so no overflow ever occurs. Critical instant for a task - time at which a request will have largest response time. Occurs when task is requested simultaneously with all tasks of higher priority

Rate Monotonic Assign priorities to tasks according to their request rates, independent of run times Optimal in the sense that no other fixed priority assignment rule can schedule a task set which can not be scheduled by rate monotonic. If feasible (fixed) priority assignment exists for some task set, rate monotonic is feasible for that task set.

Task Model time t1t1 t2t2 T1T1 T1T1 T2T2 C 1 = 1 C 2 = 1 ½ vs. 1/3

Earliest Deadline First Dynamic algorithm Priorities are assigned to tasks according to the deadlines of their current request With EDF there is no idle time prior to an overflow For a given set of m tasks, EDF is feasible iff C 1 /T 1 + C 2 /T 2 + … + C m /T m <= 1 If a set of tasks can be scheduled by any algorithm, it can be scheduled by EDF

Task Model time t1t1 t2t2 T1T1 T1T1 T2T2 C 1 = 1 C 2 = 1 8 vs. 9

Proportional Share Goals: to integrate real-time and non-real-time tasks, to police ill-behaved tasks, to give every process a well-defined share of the processor. Each client, i, gets a weight w i Instantaneous share f i (t) = w i /( S w j ) Service time (f i constant in interval) S i (t 0, t 1 ) = f i (t) D t Set of active clients varies  f i varies over time S i (t 0, t 1 ) = f i ( t ) d t j t0t0 t1t1 Difference is service time error E A = S i (t 0, t 1 ) - f i (t) D t + more than its share - less than its share

Common Proportional Share Competitors Weighted Round Robin – RR with quantum times equal to share RR: WRR: Fair Share –adjustments to priorities to reflect share allocation (compatible with multilevel feedback algorithms) Linux 1020 3020

Common Proportional Share Competitors Weighted Round Robin – RR with quantum times equal to share RR: WRR: Fair Share –adjustments to priorities to reflect share allocation (compatible with multilevel feedback algorithms) Linux 1020 10

Common Proportional Share Competitors Weighted Round Robin – RR with quantum times equal to share RR: WRR: Fair Share –adjustments to priorities to reflect share allocation (compatible with multilevel feedback algorithms) Linux 010

Fair Queuing Weighted Fair Queuing Stride scheduling VT – Virtual Time advances at a rate proportional to share VFT – Virtual Finishing Time: VT a client would have after executing its next time quantum WFQ schedules by smallest VFT E A never below -1 Common Proportional Share Competitors VT A (t) = S A (t) / w A VFT = 3/3 VFT = 3/2 VFT = 2/1 2/32/21/1 t

Lottery Scheduling Lottery scheduling [Waldspurger96] is another scheduling technique. Elegant approach to periodic execution, priority, and proportional resource allocation. Give W p “lottery tickets” to each process p. GetNextToRun selects “winning ticket” randomly. If Σ W p = N, then each process gets CPU share W p /N......probabilistically, and over a sufficiently long time interval. Flexible: tickets are transferable to allow application-level adjustment of CPU shares. Simple, clean, fast. Random choices are often a simple and efficient way to produce the desired overall behavior (probabilistically).

Example List-based Lottery 10 2 5 1 2 T = 20 Random(0, 19) = 15 10 12 17Summing:

Bells and Whistles Ticket transfers - objects that can be explicitly passed in messages Can be used to solve priority inversions Ticket inflation Create more - used among mutually trusting clients to dynamically adjust ticket allocations Currencies - “ local” control, exchange rates Compensation tickets - to maintain share with I/O use only f of quantum, ticket inflated by 1/f in next

Linux Scheduling

Linux Scheduling Policy Runnable process with highest priority and timeslice remaining runs (SCHED_OTHER policy) Dynamically calculated priority Starts with nice value Bonus or penalty reflecting whether I/O or compute bound by tracking sleep time vs. runnable time: sleep_avg and decremented by timer tick while running

Linux Scheduling Policy Dynamically calculated timeslice The higher the dynamic priority, the longer the timeslice: Recalculated every round when “expired” and “active” swap Exceptions for expired interactive Go back on active unless there are starving expired tasks Low priority less interactive High priority more interactive 10ms150ms300ms

Linux task_struct Process descriptor in kernel memory represents a process (allocated on process creation, deallocated on termination). Linux: task_struct, located via task pointer in thread_info structure on process’s kernel state. state prio policy static_prio sleep_avg time_slice … task_struct

............ Runqueue for O(1) Scheduler active expired priority array............ priority queue Higher priority more I/O 300ms lower priority more CPU 10ms

............ Runqueue for O(1) Scheduler active expired priority array............ priority queue 1 0

............ Runqueue for O(1) Scheduler active expired priority array............ priority queue 1 0 X X

Linux Real-time No guarantees SCHED_FIFO Static priority, effectively higher than SCHED_OTHER processes * No timeslice – it runs until it blocks or yields voluntarily RR within same priority level SCHED_RR As above but with a timeslice. * Although their priority number ranges overlap

Beyond “Ordinary” Uniprocessors Multiprocessors Co-scheduling and gang scheduling Hungry puppy task scheduling Load balancing Networks of Workstations Harvesting Idle Resources - remote execution and process migration Laptops and mobile computers Power management to extend battery life, scaling processor speed/voltage to tasks at hand, sleep and idle modes.

Multiprocessor Scheduling What makes the problem different? Workload consists of parallel programs Multiple processes or threads, synchronized and communicating Latency defined as last piece to finish. Time-sharing and/or Space-sharing (partitioning up the Mp nodes) Both when and where a process should run

Architectures PPPP Memory $$$$ Symmetric mp NUMA cluster Interconnect CA Mem P $ CA Mem P $ CA Mem P $ CA Mem P $ Node 0Node 1 Node 2 Node 3

Affinity Scheduling Where (on which node) to run a particular thread during the next time slice? Processor’s POV: favor processes which have some residual state locally (e.g. cache) What is a useful measure of affinity for deciding this? Least intervening time or intervening activity (number of processes here since “my” last time) * Same place as last time “I” ran. Possible negative effect on load-balance.

Linux Support for SMP Every processor has its own private runqueue Locking – spinlock protects runqueue Load balancing – pulls tasks from busiest runqueue into mine. Affinity – cpus_allowed bitmask constrains a process to particular set of processors load_balance runs from schedule( ) when runqueue is empty or periodically esp. during idle. Prefers to pull processes from expired, not cache-hot, high priority, allowed by affinity PPPP Memory $$$$ Symmetric mp

Processor Partitioning Static or Dynamic Process Control (Gupta) Vary number of processors available Match number of processes to processors Adjusts # at runtime. Works with task-queue or threads programming model Suspend and resume are responsibility of runtime package of application Impact on “working set”

Process Control Claims Typical speed-up profile Number of processes per application speedup ||ism and working set in memory Lock contention, memory contention, context switching, cache corruption Magic point

Co-Scheduling John Ousterhout (Medusa OS) Time-sharing model Schedule related threads simultaneously Why? Common state and coordination How? Local scheduling decisions after some global initialization (Medusa) Centralized (SGI IRIX)

Effect of Workload Impact of communication and cooperation Issues: -context switch +common state -lock contention +coordination

CM*’s Version Matrix S (slices) x P (processors) Allocate a new set of processes (task force) to a row with enough empty slots Schedule: Round robin through rows of matrix If during a time slice, this processor’s element is empty or not ready, run some other task force’s entry in this column - backward in time (for affinity reasons and purely local “fall- back” decision)

Networks of Workstations What makes the problem different? Exploiting otherwise “idle” cycles. Notion of ownership associated with workstation. Global truth is harder to come by in wide area context

37 Harvesting Idle Cycles Remote execution on an idle processor in a NOW (network of workstations) Finding the idle machine and starting execution there. Related to load-balancing work. Vacating the remote workstation when its user returns and it is no longer idle Process migration

Issues Why? Which tasks are candidates for remote execution? Where to find processing cycles? What does “idle” mean? When should a task be moved? How?

Motivation for Cycle Sharing Load imbalances. Parallel program completion time determined by slowest thread. Speedup limited. Utilization. In trend from shared mainframe to networks of workstations  scheduled cycles to statically allocated cycles “Ownership” model Heterogeneity

Which Tasks? Explicit submission to a “batch” scheduler (e.g., Condor) or Transparent to user. Should be demanding enough to justify overhead of moving elsewhere. Properties? Proximity of resources. Example: move query processing to site of database records. Cache affinity

Finding Destination Defining “idle” workstations Keyboard/mouse events? CPU load? How timely and complete is the load information (given message transit times)? Global view maintained by some central manager with local daemons reporting status. Limited negotiation with a few peers How binding is any offer of free cycles? Task requirements must match machine capabilities

When to Move At task invocation. Process is created and run at chosen destination. Process migration, once task is already running at some node. State must move. For adjusting load balance (generally not done) On arrival of workstation’s owner (vacate, when no longer idle)

How - Negotiation Phase Condor example: Central manager with each machine reporting status, properties (e.g. architecture, OS). Regular match of submitted tasks against available resources. Decentralized example: select peer and ask if load is below threshold. If agreement to accept work, send task. Otherwise keep asking around (until probe limit reached).

How - Execution Phase Issue - Execution environment. File access - possibly without user having account on destination machine or network file system to provide access to user’s files. UIDs? Remote System Calls (Condor) On original (submitting) machine, run a “shadow” process (runs as user) All system calls done by task at remote site are “caught” and message sent to shadow.

Remote System Calls Submitting machineExecuting machine OS Kernel Shadow Remote Job Remote syscall code Remote syscall stubs User code Regular syscall stubs

How - Process Migration Checkpointing current execution state (both for recovery and for migration) Generic representation for heterogeneity? Condor has a checkpoint file containing register state, memory image, open file descriptors, etc. Checkpoint can be returned to Condor job queue. Mach - package up processor state, let memory working set be demand paged into new site. Messages in-flight?

Applying Scheduling to Power Management of the CPU

Relationships Power (watts) = Voltage (volts) * Current (amps) Power (watts) = Energy (Joules) / Time (sec) Energy (Joules) = Power (watts) * Time (sec) Energy (Joules) = Voltage (volts) * Charge (coulombs) Current (amps) = Voltage (volts) / Resistance (ohms)

Dynamic Voltage Scaling CPU can run at different clock frequencies/voltage: Voltage scalable processors StrongARM SA-2 (500mW at 600MHz; 40mW at 150MHz) Intel Xscale AMD Mobile K6 Plus Transmeta Power is proportional to V 2 x F Energy will be affected (+) by lower power, (-) by increased time

Dynamic Voltage Scheduling Questions addressed by the scheduler: Which process to run When to run it How long to run it for How fast to run the CPU while it runs Intuitive goal - fill “soft idle” times with slow computation

Background Work in DVS Interval scheduling Based on observed processor utilization “general purpose” -- no deadlines assumed by the system Predicting patterns of behavior to squeeze out idle times. Worst-case real-time schedulers (Earliest Deadline First) Stretch the work to smoothly fill the period without missing deadlines (without inordinate transitioning).

Interval Scheduling (adjust clock based on past window, no process reordering involved) Weiser et. al. Algorithms (when): Past AVG N Stepping (how much) One Double Peg – min or max Based on unfinished work during previous interval time CPU load Clock speed

Implementation of Interval Scheduling Algorithms Issues: Capturing utilization measure Start with no a priori information about applications and need to dynamically infer / predict behavior (patterns / “deadlines” / constraints?) Idle process or “real” process – usually each quantum is either 100% idle or busy AVG N : weighted utilization at time t W t = (NW t-1 + U t-1 ) / (N+1) Inelastic performance constraints – don’t want to allow user to see any performance degradation

Results It is hard to find any discernible patterns in “real” applications Better at larger time scales (corresponding to larger windows in AVG N ) but then systems becomes unresponsive Poor coupling between adaptive decisions of applications themselves and system decision-making (example: MPEG player that can either block or spin) NEED application-supplied information Simple averaging shows asymmetric behavior – clock rate drops faster than ramps up AVG N may not stabilize on the “right” clock speed - Oscillations

Earliest Deadline First DVS time t1t1 t2t2 TiTi TiTi T2T2 C 1 = 1 C 2 = 1

time t1t1 t2t2 TiTi TiTi T2T2 C 1 = 1 C 2 = 1 Earliest Deadline First DVS

EDF-based DVS Algorithm Sort in EDF order Invoked when thread added or removed or deadline reached Includes non-runnable in scheduling decision speed = MAX S work j deadline i -currenttime j<=i i<=n Exponential moving average

Outline for Today’s Lecture Objective for today: Real time scheduling Linux scheduler Advanced scheduling topics Administrative: No class Thursday but.

Similar presentations

Presentation on theme: "Outline for Today’s Lecture Objective for today: Real time scheduling Linux scheduler Advanced scheduling topics Administrative: No class Thursday but."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Outline for Today’s Lecture Objective for today: Real time scheduling Linux scheduler Advanced scheduling topics Administrative: No class Thursday but.

Similar presentations

Presentation on theme: "Outline for Today’s Lecture Objective for today: Real time scheduling Linux scheduler Advanced scheduling topics Administrative: No class Thursday but."— Presentation transcript:

Similar presentations

About project

Feedback