In the previous chapter on Tasks, we’ve discussed one of the main responsibilities of an operating system: task management. Well to be fair, we have only been creating tasks and stopping or killing tasks. The necessary component that allows tasks to be run on one or multiple processors, the scheduler, is discussed in this chapter. Note that “tasks” or “jobs” can refer to either processes and/or threads.
The scheduler has two main responsibilities:
Remember the image below? The first responsibility of the scheduler is the transition between the “ready” and “running” states by interrupting (pausing) and dispatching (starting/resuming) individual tasks:
source: SILBERSCHATZ, A., GALVIN, P.B., and GAGNE, G. Operating System Concepts. 9th ed. Hoboken: Wiley, 2013.
While the image above is mainly for processes, similar logic of course exists for Threads as well, as they go through similar conceptual lifecycle phases as processes.
For the sake of simplicity, in this chapter we assume a system which has only a single processor with a single CPU core. However, the concepts introduced here also (largely) hold for multi-core systems.
When working with a single core, only a single task can be active at the same time. Say that the scheduler starts with the execution of the first task. It then has two options to determine when the next job is dispatched:
sleep()
, but also pthread_join()
or sem_wait()
can indicate a task can be paused for the time being.While the cooperative scheduling approach is the simplest, it also has some severe downsides. If a given task takes a long time to complete or doesn’t properly yield at appropriate intervals, it can end up hogging the CPU for a long time, causing other tasks to stall. As such, most modern OSes will employ a form of preemptive scheduling.
If the scheduler needs to preempt jobs after a certain amount of time (or execution ticks), it requires hardware assistance. CPUs contain a specialized timer component that can trigger an “interrupt” to the CPU after a certain amount of time has passed. This interrupt then triggers a pre-defined OS function call that handles the interrupt. In this case, the handling code will pause the current process and schedule a new one. As such, the hardware timer is a crucial piece necessary for implementing pre-emptive scheduling!
Independent of whether cooperative or preemptive scheduling is used, there exist many algorithms the scheduler may use to determine which job is to be scheduled next. A (very select) number of algorithms are given here.
To be able to reason about different scheduling algorithms, there is a need of some sort of metric to determine which approach is best. When studying schedulers, the following metrics are typically used:
A simple algorithm that a scheduler can follow is: First Come, First served (FCFS). The order in which the jobs arrive (are started) is the same as the order on which the jobs are allowed on the processor.
The image below shows three tasks that arrive very close to each other. The result of the cooperative scheduler’s job is shown in the image:
Applying the first three metrics on the example above gives the following results:
Average Throughput:
AJWT:
AJCT:
For these examples, the decimal portion can be rounded away. It is only used to make a distinction in the order of arrival.
By looking at the FCFS metrics, we can immediately see an easy way to improve the AJWT and AJCT metrics: schedule Task 3 before Task 2!
One algorithm that would allow such an optimization is called Shortest Job First (SJF). With this algorithm the scheduler looks at the tasks that are in the ready state. The shortest job within this queue is allowed first on the processor.
If the scheduler applies the SJF algorithm on the same example, the occupation of the processor looks like shown below.
Calculate the three metrics for the result of the SJF example, above: Throughput, AJWT, and AJCT.
Throughput = 3 taken / 12 s = 0.25 jobs/s
AJWT = ( 0 s + 1 s + 2 s ) / 3 = 1s
AJCT = ( 1 s + 2 s + 12 s ) / 3 = 5s
We can see that the AJWT and AJCT metrics are indeed improved considerably for this example using SJF!
Both of the examples for FCFS and SJF have so far been for non-preemptive/cooperative scheduling. Tasks have been allowed to run to their full completion. Let’s now compare this to preemptive scheduling, where the scheduler can pause a task running on the processor. Note that for the practical example we’ve been using, nothing much would change with preemptive scheduling: the selected next job would always be the same (either the first one started that hasn’t finished yet, or the shortest one remaining).
As such, let’s use a slightly more advanced example:
For preemptive scheduling, there are again several options to determine when to preempt a running task, as here we’re no longer waiting for a task to end/yield. You could for example switch tasks each x milliseconds/x processor ticks. In our example, the scheduler preempts only when a new job comes in: it stops the currently running job and starts the most recently added job.
In the example above the following actions are taken:
As such, this example demonstrates a sort-of Last-Come-First-Served (LCFS) approach.
Calculate the three metrics for the result of the preemption example, above: Throughput, AJWT, and AJCT.
Throughput = 3 taken / 12 s = 0.25 taken/s
AJWT = ( 0 s + 0 s + 1 s ) / 3 = 0.33 s
AJCT = ( 1 s + 12 s + 2 s ) / 3 = 5 s
Apply cooperative FCFS and SJF scheduling to the new example tasks and calculate the necessary metrics. Compare the results to the preemptive LCFS.
At this point we could try a SJF approach with preemption (which here would be called shortest-remaining-time-first). Although this a perfectly fine exercise (wink), in practice estimating the duration of a job is not an easy task, as even the program itself typically doesn’t know how long it will run for! The OS could base itself on earlier runs of the program (or similar programs), or on the length of the program, but it remains guesswork. As such, SJF is rarely used in practice. In our example, it also wouldn’t be the perfect approach, since both T1 and T3 have equal (estimated) durations, and it wouldn’t help the OS to decide which should be run first. Put differently, the scheduler wouldn’t be deterministic.
A more practical approach is priority-based scheduling. In this setup, you can assign a given priority to each task, and have jobs with higher priority run before those of lower priority. This still leaves some uncertainty/non-determinism for processes with the same priority, but it’s a good first approach.
Let’s assume the priorities as mentioned in the image below. Try to complete the graph with the correct scheduler decision.
As can be seen from the example above, this approach might hold a potential risk: starvation. Some jobs with lower priority (in our case T1) might not get any processor time until all other processes are done: they starve. One solution for starvation is priority ageing. This mechanism allows the priority of a job to increase over time in case of starvation, leading to the job eventually being scheduled. The actual priority thus becomes a function of the original priority and the age of the task. Again, as you can imagine, there are several ways to do this priority ageing (for example at which time intervals to update the priority and by how much, or by how/if you change the priority after the task has been scheduled for its first time slot). We will later see how this is practically approached in Linux.
As you can see, scheduling algorithms can get quite complex and it’s not always clear which approach will give the best results for any given job load. As such, it might be easier to just do the simplest preemptive scheduling we can think of: switch between tasks at fixed time intervals in a fixed order (for example ordered by descending Task start time). This is called Round-Robin scheduling (RR).
As such, RR allows multiple tasks to effectively time-share the processor. The smallest amount of time that a job can stay on the processor is called a time quantum or time slice. Typically the duration of a time slice in a modern OS is between 10 and 100 ms. All jobs in the ready queue get assigned a time slice in a circular fashion. An (unrealistic) example with a time slice of 1s is shown below:
Calculate the three metrics for the result of the preemption example, above: Throughput, AJWT, and AJCT.
Throughput = 3 taken / 12 s = 0.25 taken/s
AJWT = ( 0 s + 1 s + 2 s ) / 3 = 1 s
AJCT = ( 10 s + 11 s + 12 s ) / 3 = 11 s
As can be seen from the example above, RR also has some downsides. While each task gets some CPU time very early on (low AJWT), the average completion time (AJCT) is of course very high, as all tasks are interrupted several times. If we were to compute CPU efficiency, this would also be lowest here, due to the high amount of context switches between tasks.
A preemptive scheduler does not wait until a task yields the CPU, but interrupts its execution after a single time slice (or, in previous examples, when a new task arrives).
This does NOT mean however that a task cannot yield the CPU !!!
Another way of putting it is: a job can either run until the time slice has run out (this is when the scheduler interrupts the job) or until the job itself yields the processor. In practice, both of course happen often during normal executing of tasks in an OS.