forked from Qortal/Brooklyn
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
112 lines
5.1 KiB
112 lines
5.1 KiB
===================== |
|
Scheduler Nice Design |
|
===================== |
|
|
|
This document explains the thinking about the revamped and streamlined |
|
nice-levels implementation in the new Linux scheduler. |
|
|
|
Nice levels were always pretty weak under Linux and people continuously |
|
pestered us to make nice +19 tasks use up much less CPU time. |
|
|
|
Unfortunately that was not that easy to implement under the old |
|
scheduler, (otherwise we'd have done it long ago) because nice level |
|
support was historically coupled to timeslice length, and timeslice |
|
units were driven by the HZ tick, so the smallest timeslice was 1/HZ. |
|
|
|
In the O(1) scheduler (in 2003) we changed negative nice levels to be |
|
much stronger than they were before in 2.4 (and people were happy about |
|
that change), and we also intentionally calibrated the linear timeslice |
|
rule so that nice +19 level would be _exactly_ 1 jiffy. To better |
|
understand it, the timeslice graph went like this (cheesy ASCII art |
|
alert!):: |
|
|
|
|
|
A |
|
\ | [timeslice length] |
|
\ | |
|
\ | |
|
\ | |
|
\ | |
|
\|___100msecs |
|
|^ . _ |
|
| ^ . _ |
|
| ^ . _ |
|
-*----------------------------------*-----> [nice level] |
|
-20 | +19 |
|
| |
|
| |
|
|
|
So that if someone wanted to really renice tasks, +19 would give a much |
|
bigger hit than the normal linear rule would do. (The solution of |
|
changing the ABI to extend priorities was discarded early on.) |
|
|
|
This approach worked to some degree for some time, but later on with |
|
HZ=1000 it caused 1 jiffy to be 1 msec, which meant 0.1% CPU usage which |
|
we felt to be a bit excessive. Excessive _not_ because it's too small of |
|
a CPU utilization, but because it causes too frequent (once per |
|
millisec) rescheduling. (and would thus trash the cache, etc. Remember, |
|
this was long ago when hardware was weaker and caches were smaller, and |
|
people were running number crunching apps at nice +19.) |
|
|
|
So for HZ=1000 we changed nice +19 to 5msecs, because that felt like the |
|
right minimal granularity - and this translates to 5% CPU utilization. |
|
But the fundamental HZ-sensitive property for nice+19 still remained, |
|
and we never got a single complaint about nice +19 being too _weak_ in |
|
terms of CPU utilization, we only got complaints about it (still) being |
|
too _strong_ :-) |
|
|
|
To sum it up: we always wanted to make nice levels more consistent, but |
|
within the constraints of HZ and jiffies and their nasty design level |
|
coupling to timeslices and granularity it was not really viable. |
|
|
|
The second (less frequent but still periodically occurring) complaint |
|
about Linux's nice level support was its assymetry around the origo |
|
(which you can see demonstrated in the picture above), or more |
|
accurately: the fact that nice level behavior depended on the _absolute_ |
|
nice level as well, while the nice API itself is fundamentally |
|
"relative": |
|
|
|
int nice(int inc); |
|
|
|
asmlinkage long sys_nice(int increment) |
|
|
|
(the first one is the glibc API, the second one is the syscall API.) |
|
Note that the 'inc' is relative to the current nice level. Tools like |
|
bash's "nice" command mirror this relative API. |
|
|
|
With the old scheduler, if you for example started a niced task with +1 |
|
and another task with +2, the CPU split between the two tasks would |
|
depend on the nice level of the parent shell - if it was at nice -10 the |
|
CPU split was different than if it was at +5 or +10. |
|
|
|
A third complaint against Linux's nice level support was that negative |
|
nice levels were not 'punchy enough', so lots of people had to resort to |
|
run audio (and other multimedia) apps under RT priorities such as |
|
SCHED_FIFO. But this caused other problems: SCHED_FIFO is not starvation |
|
proof, and a buggy SCHED_FIFO app can also lock up the system for good. |
|
|
|
The new scheduler in v2.6.23 addresses all three types of complaints: |
|
|
|
To address the first complaint (of nice levels being not "punchy" |
|
enough), the scheduler was decoupled from 'time slice' and HZ concepts |
|
(and granularity was made a separate concept from nice levels) and thus |
|
it was possible to implement better and more consistent nice +19 |
|
support: with the new scheduler nice +19 tasks get a HZ-independent |
|
1.5%, instead of the variable 3%-5%-9% range they got in the old |
|
scheduler. |
|
|
|
To address the second complaint (of nice levels not being consistent), |
|
the new scheduler makes nice(1) have the same CPU utilization effect on |
|
tasks, regardless of their absolute nice levels. So on the new |
|
scheduler, running a nice +10 and a nice 11 task has the same CPU |
|
utilization "split" between them as running a nice -5 and a nice -4 |
|
task. (one will get 55% of the CPU, the other 45%.) That is why nice |
|
levels were changed to be "multiplicative" (or exponential) - that way |
|
it does not matter which nice level you start out from, the 'relative |
|
result' will always be the same. |
|
|
|
The third complaint (of negative nice levels not being "punchy" enough |
|
and forcing audio apps to run under the more dangerous SCHED_FIFO |
|
scheduling policy) is addressed by the new scheduler almost |
|
automatically: stronger negative nice levels are an automatic |
|
side-effect of the recalibrated dynamic range of nice levels.
|
|
|