Linux Kernel 2.6 Scheduler O(1) from O(n) A B.Sc Seminar in Software Engineering Student: Maxim Raskin Lecturer: Dr. Itzhak Aviv Tel Aviv Afeka College.

Linux Kernel 2.6 Scheduler O(1) from O(n) A B.Sc Seminar in Software Engineering Student: Maxim Raskin Lecturer: Dr. Itzhak Aviv Tel Aviv Afeka College of Engineering Semester B 2010

 Created by a Finnish computer science student Linus Torvalds (hence, Linux) in 1991 as a hobby to experiment with the Intel 80386 CPU.  Kernel code is mostly C and assembler.  First version contained 10,239 lines of code, nowdays: 12,990,041 (!).  1992 marked the year the code had become self- hosted.  GUI X Windowing System was added into the OS.  Distributed under the GPL license to evade commercial clutches.

The linux kernel type is called Monolithic Kernel in which the OS is operating under A jurisdiction called Kernel Mode (or Kernel Space). User code and OS code operate in separate Spaces, thus the kernel code is secured by Using hardware dependent CPU flags. Kernel has a hold on the following routinely tasks: Handling drivers - communicating with the hardware. File System management. Memory Management. Scheduling of tasks.

Today, Linux Kernel spans a variety of distros for all kinds of architectures: Lightweight Netbooks Mobile Devices Music Players Desktop Computers Servers Cluster Computing Various Other Embedded Systems

Programs and Processes A program is a box which consists of Data and Instructions which give meaning to it. A process is an instance of a program (same as an object to a class in OOP). The purpose of a process is to embody the data of a program such as: threads and their data, hardware registers and the contents of the program’s address space. Program Process 1 Thread 0 Thread 1 Data Process 2 Thread 0 Thread 1 Data …ProcessN Thread 0 Thread 1 Data

Threads A Process can have multiple threads of execution that work together to accomplish its goals. An OS kernel must keep state information for every thread it creates. Threads of a process sometimes share the same address space, at times they overlap or have completely separate address space. Only one thread is allowed to execute on the same CPU at a time. An example of threads can be seen in almost every video games, as there are multiple threads handling various modules of the game – one for user input, one for graphics, one for sound processing and another for AI.

Scheduling A multitasking kernel, allows multiple processes to run alongside: Processes are not aware of each other unless programmed to be so. A Scheduler shifts the programmer’s mind from the tedious task of scheduling and focuses it on actual programming (Imagine how “fun” it would’ve been otherwise). The Scheduler’s goals are to decide the policy of each thread in the system: How long will it run? On which CPU? In What priority? The scheduler itself is a thread, it is scheduled routinely using a timer interrupt.

CPU and I/O Bound Threads Threads divide into two categories: CPU Bound – Threads that perform computations heavily dependent on a CPU’s resources. I/O Bound – Threads that wait for certain I/O device, these threads are CPU friendly since they sleep more. A Scheduler cannot know for sure which thread is CPU bound and which is I/O bound, however, it can guesstimate with a reasonable precision. The common practice is to prioritize I/O bound tasks since they take a long time to wait for their resource, its best to begin them ASAP.

Real World Needs In the real world, beyond theory, a scheduler's goals are not to follow blindly a set of mathematically proven algorithms, but to adjust itself to the needs of its target market through real experience. Linux has gone far beyond its 80386 origins, now days its code has to be adjusted to support: Desktops, Servers, Embedded Systems, Cluster Computers and NUMA environments. As a result of the uses mentioned above the Linux scheduler has been tailored accordingly.

Core Requirements Efficiency - Less context switches the better. Interactivity - Responsiveness. Preventing Starvation – Fairness among threads. Support for soft RT (real time) tasks.

Multiprocessing: SMP, SMT and NUMA Era of Multiprocessing is at hand, 2.6 includes support for: SMP – Symmetric MultiProcessing – the case of several cores on the same CPU die. The goals: Efficient division of workload. Keeping a thread running on the same CPU it started to avoid re-caching. SMT – Symmetric MultiThreading – a concept first implemented by Intel, which in term named it HyperThreading (HT). One CPU virtualizes multiple CPUs, the caveat: Virtual cores cannot be completely treated as true separate cores as they share the same cache resources. NUMA – Non-Uniform Memory Access, a type of clustering technique which involves tight coupling of nodes (node=motherboard), high volume of CPUs. Major problem: Memory locality – threads should keep residence on the same node they started so efficiency will not suffer.

Introduction: O(1) The current 2.6.x scheduler guarantees constant runtime O(1) thanks to O(1) algorithms it is composed of. This is a vast improvement over the 2.4 scheduler which ran at O(n), The base reason for it is its independence of the amount of tasks in the system.

Introduction: Priority & Timeslices Process Priority – dynamic process priority: Threads which use all their timeslices considered CPU bound – little to no priority boost. Threads which sleep a lot are considered nice to the CPU in term gain a priority boost. Priority levels are divided into 2 ranges with a total of 140 levels: 1.nice value – [-20,+19], default: 0. (lower number = higher priority) 2.real time priority – [0,99] (again, lower number = higher priority) *Timeslice – time in milliseconds which specifies for how long can a given thread run until it’s preempted. It is not a trivial task to decide the perfect length of a timeslice: * The longer the time slice the less interactive the system can become. * Too short timeslices cause heavy resource loses over to context switches. * Timeslices don’t have to be exhausted all at once – e.g a 100ms time slice can be exhausted in 5 different reschedules, 20ms each. * Large timeslices benefit interactive tasks – make them run first and foremost, they won’t really use up all of the timeslice anyways. * A process which exhausts all it’s timeslices is not elligible to run until all other processes have exhausted theirs.

Introduction: Preemption The Linux operating system is preemptive. The scheduler executes in the following scenarios: A process enters the TASK_RUNNING state and it’s priority is higher than the current process. A process has exhausted all it’s timeslices (=0).

Code Orientation Code is located in kernel/sched.c and kernel/sched.h Interesting facts: Some types are declared within sched.c and not in sched.h, this is done in order to abstract away scheduler private types. System calls which are public are to be found within sched.h.

Runqueue Monitors running/expired task for each CPU (1 rq per CPU). Declared in kernel/sched.c: struct runqueue { spinlock_t lock;/* spin lock that protects this runqueue */ unsigned long nr_running; /* number of runnable tasks */ unsigned long nr_switches; /* context switch count */ unsigned long expired_timestamp; /* time of last array swap */ unsigned long nr_uninterruptible; /* uninterruptible tasks */ unsigned long long timestamp_last_tick; /* last scheduler tick* / struct task_struct *curr; /* currently running task */ struct task_struct *idle; /* this processor's idle task */ struct mm_struct *prev_mm; /* mm_struct of last ran task */ struct prio_array *active; /* active priority array */ struct prio_array *expired; /* the expired priority array */ struct prio_array arrays[2]; /* the actual priority arrays */ struct task_struct *migration_thread; /* migration thread */ struct list_head migration_queue; /* migration queue*/ atomic_t nr_iowait; /* number of tasks waiting on I/O*/ };

Runqueue - Safety A runqueue can be accessed from multiple threads, however, only one thread is allowed to access it at a time, to obtain thread safety locks are used: To avoid deadlocks when accessing multiple runqueues, a convention is used – locks are obtained in the order of ascending runqueue address: struct runqueue *rq; rq = this_rq_lock(); /* manipulate this process's current runqueue, rq */ rq_unlock(rq); /* to lock... */ if (rq1 == rq2) spinlock(&rq1->lock); else { if (rq1 < rq2) { spin_lock(&rq1->lock); spin_lock(&rq2->lock); } else { spin_lock(&rq2->lock); spin_lock(&rq1->lock); } /* manipulate both runqueues... */ /* to unlock... */ spin_unlock(&rq1->lock); if (rq1 != rq2) spin_unlock(&rq2->lock); In order to automate the steps above, the functions double_rq_lock ( rq1, rq2 ) and double_rq_unlock(rq1, rq2 ) can be used.

Priority Arrays Priority arrays are the data structures that provide O(1) scheduling by mapping each running task to a priority queue. Each runqueue contains pointer to 2 priority array objects: active, expired. Priority arrays defined in kernel/sched.c : struct prio_array { int nr_active; /* number of tasks in the queues */ unsigned long bitmap[BITMAP_SIZE]; /* priority bitmap */ struct list_head queue[MAX_PRIO]; /* priority queues */ }; queue[MAX_PRIO] – an array of queues (linked lists) 1 queue per priority (MAX_PRIO == 140 by default). bitmap[BITMAP_SIZE] – a map of bits (from which MAX_PRIO==140 are used), a bit is turned on whenever at least one task exists at a given priority level. * The bitmap makes it easy to find the task with the highest priority using a macro called sched_find_first_bit() which operates at O(1). nr_active – number of tasks active in all of the priority queues.

Recalculating Timeslices – The Past Here’s a naïve (pre 2.6.x) algorithm: Issues arise by the above algorithm: Worst case O(n) complexity (n=number of task in the system) Recalculation must be performed under some kind of lock protection, result if high lock contetnion. Nondeterminstic recalculation which leads to problems in deterministic real- time tasks. for (each task in the system) { recalculate priority recalculate timeslice }

Recalculating Timeslices – The Present #define BASE_TIMESLICE(p) (MIN_TIMESLICE + \ ((MAX_TIMESLICE - MIN_TIMESLICE) * \ (MAX_PRIO-1 - (p)->static_prio) / (MAX_USER_PRIO-1))) Table 2 Constants and task_struct->static_prio range ConstantValue MIN_TIMESLICE5ms DEF_TIMESLICE100ms MAX_TIMESLICE800ms MAX_USER_PRIO 40 MAX_PRIO140 [task_struct]->static_prio range[100,139] A simple O(1) call to BASE_TIMESLICE(task) does the trick. Timeslice calculation has unlimited forking protection, when a child is forked the parent divides its timeslice between the child.

Recalculating Timeslices – The Present The new scheduler alleviates the need for a recalculation loop. Instead, it maintains two priority arrays for each CPU (located in its runqueue): active – contains all tasks that still have nonzero timeslices. expired – contains all tasks which have exhausted their timeslice. struct prio_array *array = rq->active; if (!array->nr_active) { rq->active = rq->expired; rq->expired = array; } When a task’s timeslice reaches zero it is recalculated and it is put in to the expired array, and when all tasks are done with their timeslices the arrays are swapped:

Calculating Priority As you can recall, interactive tasks are of priority, how can the scheduler tell the difference? task->sleep_avg += sleep_time; // sleep_avg is bound by MAX_SLEEP_AVG (10ms by default) task->sleep_avg -= runt_time; The scheduler uses the value above to reward or give penalty to a task’s dynamic priority, this pentalty/rewards range is: [-5,5] #define CURRENT_BONUS(p) \ NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / MAX_SLEEP_AVG) int effective_prio(task_struct *p) { /* Compute bonus based on sleep_avg (see CURRENT_BONUS above) */ bonus = CURRENT_BONUS(p) - MAX_BONUS / 2; prio = p->static_prio - bonus; /* add bonus to dynamic priority */ }

Deciding which task goes next The act of picking the next task and switching to it is implemented via the schedule () function: Called when a task goes to sleep. Called when a task is preempted. Runs independently on each CPU. schedule() is relatively simple for all it must accomplish, here is how it determines the next task to be run: struct task_struct *prev, *next; struct list_head *queue; struct prio_array *array; int idx; prev = current; array = rq->active; idx = sched_find_first_bit(array->bitmap); queue = array->queue + idx; next = list_entry(queue->next, struct task_struct, run_list); if (prev != next) context_switch();

Deciding which task goes next - Illustration

Sleeping And Waking Up Sleep and waking is an important aspect which must be implemented propertly: *When a task declares it needs to sleep it is marked as sleeping, it’s moved away from the runqueue and put in the wait queue, then it calls schedule() to notify the scheduler continue scheduling, waking is vice versa. Sleeping tasks have 2 states associated with them: TASK_INTERRUPTIBLE - can be awaken prematurely (respond to a signal). TASK_UNINTERRUPTIBLE – cannot be disturbed. Wait queues are represented in the kernel code by the wait_queue_head_t struct which is a linked list.

Sleeping And Waking Up: The Code /* 'q' is the wait queue we wish to sleep on */ DECLARE_WAITQUEUE(wait, current); add_wait_queue(q, &wait); while (!condition) { /* condition is the event that we are waiting for */ set_current_state(TASK_INTERRUPTIBLE); /* or TASK_UNINTERRUPTIBLE */ if (signal_pending(current)) /* handle signal */ schedule(); } set_current_state(TASK_RUNNING); remove_wait_queue(q, &wait);

Sleeping And Waking Up: Illustration

The Load Balancer On a multiprocessor environment each CPU has its own dedicated runqueue (5). It is imperative to balance out the tasks on each runqueue so we won't come to a situation for example, where one CPU has 20 tasks and the other has 15 tasks. The idea is this: pull tasks from busy runqueues and put them into less busy runqueues.

The Load Balancer Load balancer is implemented via the load_balance() function. It has two methods of invocation: 1.From the schedule() function whenever current runqueue is empty. 2.By a timer interrupt: every 1ms whenever the system is idle and every 200ms otherwise. Note: On uniprocessor systems it is never called or even not compiled into the kernel.

The Load Balancer The load_balance () function and related methods are fairly large and complicated, although the steps they perform are comprehensible: 1.First, load_balance () calls find_busiest_queue () to determine the busiest runqueue. In other words, this is the runqueue with the greatest number of processes in it. If there is no runqueue that has at least 25% more processes than the current, find_busiest_queue () returns NULL and load_balance () returns. Otherwise, the busiest runqueue is returned. 2.Second, load_balance () decides from which priority array on the busiest runqueue it wants to pull. The expired array is preferred because those tasks have not run in a relatively long time and thus are most likely not in the processor's cache (that is, they are not "cache hot"). If the expired priority array is empty, the active one is the only choice. 3.Next, load_balance () finds the highest priority (smallest value) list that has tasks, because it is more important to fairly distribute high-priority tasks than lower-priority ones. 4.Each task of the given priority is analyzed to find a task that is not running, not prevented to migrate via processor affinity, and not cache hot. If the task meets this criteria, pull_task () is called to pull the task from the busiest runqueue to the current runqueue. 5.As long as the runqueues remain imbalanced, the previous two steps are repeated and more tasks are pulled from the busiest runqueue to the current. Finally, when the imbalance is resolved, the current runqueue is unlocked and l oad_balance ()returns.

The Load Balancer Code static int load_balance(int this_cpu, runqueue_t *this_rq, struct sched_domain *sd, enum idle_type idle) { struct sched_group *group; runqueue_t *busiest; unsigned long imbalance; int nr_moved; spin_lock(&this_rq->lock); group = find_busiest_group(sd, this_cpu, &imbalance, idle); if (!group) goto out_balanced; busiest = find_busiest_queue(group); if (!busiest) goto out_balanced; nr_moved = 0; if (busiest->nr_running > 1) { double_lock_balance(this_rq, busiest); nr_moved = move_tasks(this_rq, this_cpu, busiest, imbalance, sd, idle); spin_unlock(&busiest->lock); } spin_unlock(&this_rq->lock);

The Load Balancer Code if (!nr_moved) { sd->nr_balance_failed++; if (unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2)) { int wake = 0; spin_lock(&busiest->lock); if (!busiest->active_balance) { busiest->active_balance = 1; busiest->push_cpu = this_cpu; wake = 1; } spin_unlock(&busiest->lock); if (wake) wake_up_process(busiest->migration_thread); sd->nr_balance_failed = sd->cache_nice_tries; } } else sd->nr_balance_failed = 0; sd->balance_interval = sd->min_interval; return nr_moved; out_balanced: spin_unlock(&this_rq->lock); if (sd->balance_interval max_interval) sd->balance_interval *= 2; return 0; }

Soft Real-Time (RT) Scheduling The scheduler supports real time scheduling quite well, it will do its best to meet predetermined dead lines but does not guarantee it. RT tasks priority range is [0,99], this tasks will always preempt user tasks as user tasks are ranged [100,139] and the lower the priority the higher. Two Scheduling schemes are available: 1. SCHED_FIFO – As the name implies, first in, first out. Timeslices are irrelveant in this scheme, the tasks with the highest priority runs until is finishes. 2. SCHED_RR – Round Robin, tasks are scheduled by priority, task s in the same priority run in a round robin fashion for an pre-allotted timeslice. T1 T2T2T2T2 T2T2T2T2 T2T2

Let’s review the 2.4.x scheduler and It’s shortcomings to understand better How the 2.6.x scheduler improved Performance and scalability.

Linux 2.4.x scheduling algorithm divides time into “epochs” – periods of time in which a task is allowed to use up its timslice. Timeslices are computed for all tasks when epochs begin, resulting an O(n) computation before we even begin scheduling. Each task has a base timeslice which is determined by a user assigned nice value (or default value). The nice value is scaled to a certain number of scheduler ticks, with nice value 0 (default) resolving to a timeslice of about 200ms. During a timeslice recalculation the base timeslice is modified based on how I/O bound a task is: I/O bound tasks receive a bonus to their timeslice at the beginning of an epoch in the following manner: The Algorithm - Timeslices p->counter = (p->counter > > 1) + NICE_TO_TICKS(p->nice); That is, the base timeslice is added half the time which the task didn’t consume in previous epoch.

To determine which task is to run next the schedule() function iterates through out All runnable tasks in the system (again – O(n) complexity) and call the goodness() function: The Algorithm – Which Task Runs Next? if (p->policy != SCHED_NORMAL) return 1000 + p->rt_priority; if (p->counter == 0) return 0; if (p->mm == prev->mm) return p->counter + p->priority + 1; return p->counter + p->priority; Explanation: RT tasks get a boost by adding a 1000 to their priority – insures they run first. Tasks which consumed their timeslice return 0. Tasks which share the same memory space with the previous running tasks get a goodness boost – kicks in caching to it’s advantage.

1.Scalability – O(n) complexity execution, much of CPU time is spent on recalculating timeslice for each task and picking the next task to be run – hurts interactivity – as mentioned earlier. 2.6.x solves it in O(1) time using constant-time lookups via a bitmap and timeslice calculations are performed on demand. 2.Large Average Timeslices – the average time slice in Linux 2.4.x was 210ms vs 100ms in 2.6.x. Consider this – 100 threads running in a high priority and 1 thread running in low priority might have to wait for about 20seconds to execute, unlikely but feasible. Consider a web server’s performance under such conditions. limit to the manifestation of the problem. Shortcomings

Conclusion The Linux scheduler has come a long way to becoming a reliable, smart and adjustable piece of code which drives the heart of what computer were meant to do – run tasks as efficiently as possible. Give proper means for the various computing environments in the market – be it embedded systems, personal computers - uniprocessor and SMP, NUMA and general clustering. It has adapted itself quite well to all of the above, but still has some open problems which were eased but no eradicated, but as in engineering, if it is accurate enough and usable for the tasks its assigned to then it achieves its goals.

Kernel Average CPU utilization Average memory utilization Average swap utilization Total Web page served Page served per second Processin g mean time (millisecs) Unsuccessful connections 2.4.18 - smp 100% (user:7.38% system:92.62 %) 6.41%0%8,845,147102.37294.440 2.6.0 test599.42% (user:39.35% system:60.07 %) 35.96%0%53,827,939623.0057.710 Hours Served

Linux Kernel 2.6 Scheduler O(1) from O(n) A B.Sc Seminar in Software Engineering Student: Maxim Raskin Lecturer: Dr. Itzhak Aviv Tel Aviv Afeka College.

Similar presentations

Presentation on theme: "Linux Kernel 2.6 Scheduler O(1) from O(n) A B.Sc Seminar in Software Engineering Student: Maxim Raskin Lecturer: Dr. Itzhak Aviv Tel Aviv Afeka College."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Linux Kernel 2.6 Scheduler O(1) from O(n) A B.Sc Seminar in Software Engineering Student: Maxim Raskin Lecturer: Dr. Itzhak Aviv Tel Aviv Afeka College.

Similar presentations

Presentation on theme: "Linux Kernel 2.6 Scheduler O(1) from O(n) A B.Sc Seminar in Software Engineering Student: Maxim Raskin Lecturer: Dr. Itzhak Aviv Tel Aviv Afeka College."— Presentation transcript:

Similar presentations

About project

Feedback