mirror of https://github.com/Qortal/Brooklyn
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
169 lines
5.9 KiB
169 lines
5.9 KiB
|
|
|
|
NOTE; all this assumes a linear relation between frequency and work capacity, |
|
we know this is flawed, but it is the best workable approximation. |
|
|
|
|
|
PELT (Per Entity Load Tracking) |
|
------------------------------- |
|
|
|
With PELT we track some metrics across the various scheduler entities, from |
|
individual tasks to task-group slices to CPU runqueues. As the basis for this |
|
we use an Exponentially Weighted Moving Average (EWMA), each period (1024us) |
|
is decayed such that y^32 = 0.5. That is, the most recent 32ms contribute |
|
half, while the rest of history contribute the other half. |
|
|
|
Specifically: |
|
|
|
ewma_sum(u) := u_0 + u_1*y + u_2*y^2 + ... |
|
|
|
ewma(u) = ewma_sum(u) / ewma_sum(1) |
|
|
|
Since this is essentially a progression of an infinite geometric series, the |
|
results are composable, that is ewma(A) + ewma(B) = ewma(A+B). This property |
|
is key, since it gives the ability to recompose the averages when tasks move |
|
around. |
|
|
|
Note that blocked tasks still contribute to the aggregates (task-group slices |
|
and CPU runqueues), which reflects their expected contribution when they |
|
resume running. |
|
|
|
Using this we track 2 key metrics: 'running' and 'runnable'. 'Running' |
|
reflects the time an entity spends on the CPU, while 'runnable' reflects the |
|
time an entity spends on the runqueue. When there is only a single task these |
|
two metrics are the same, but once there is contention for the CPU 'running' |
|
will decrease to reflect the fraction of time each task spends on the CPU |
|
while 'runnable' will increase to reflect the amount of contention. |
|
|
|
For more detail see: kernel/sched/pelt.c |
|
|
|
|
|
Frequency- / CPU Invariance |
|
--------------------------- |
|
|
|
Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU |
|
for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on |
|
a big CPU, we allow architectures to scale the time delta with two ratios, one |
|
Dynamic Voltage and Frequency Scaling (DVFS) ratio and one microarch ratio. |
|
|
|
For simple DVFS architectures (where software is in full control) we trivially |
|
compute the ratio as: |
|
|
|
f_cur |
|
r_dvfs := ----- |
|
f_max |
|
|
|
For more dynamic systems where the hardware is in control of DVFS we use |
|
hardware counters (Intel APERF/MPERF, ARMv8.4-AMU) to provide us this ratio. |
|
For Intel specifically, we use: |
|
|
|
APERF |
|
f_cur := ----- * P0 |
|
MPERF |
|
|
|
4C-turbo; if available and turbo enabled |
|
f_max := { 1C-turbo; if turbo enabled |
|
P0; otherwise |
|
|
|
f_cur |
|
r_dvfs := min( 1, ----- ) |
|
f_max |
|
|
|
We pick 4C turbo over 1C turbo to make it slightly more sustainable. |
|
|
|
r_cpu is determined as the ratio of highest performance level of the current |
|
CPU vs the highest performance level of any other CPU in the system. |
|
|
|
r_tot = r_dvfs * r_cpu |
|
|
|
The result is that the above 'running' and 'runnable' metrics become invariant |
|
of DVFS and CPU type. IOW. we can transfer and compare them between CPUs. |
|
|
|
For more detail see: |
|
|
|
- kernel/sched/pelt.h:update_rq_clock_pelt() |
|
- arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation." |
|
- Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization" |
|
|
|
|
|
UTIL_EST / UTIL_EST_FASTUP |
|
-------------------------- |
|
|
|
Because periodic tasks have their averages decayed while they sleep, even |
|
though when running their expected utilization will be the same, they suffer a |
|
(DVFS) ramp-up after they are running again. |
|
|
|
To alleviate this (a default enabled option) UTIL_EST drives an Infinite |
|
Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is |
|
highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR |
|
filter to instantly increase and only decay on decrease. |
|
|
|
A further runqueue wide sum (of runnable tasks) is maintained of: |
|
|
|
util_est := \Sum_t max( t_running, t_util_est_ewma ) |
|
|
|
For more detail see: kernel/sched/fair.c:util_est_dequeue() |
|
|
|
|
|
UCLAMP |
|
------ |
|
|
|
It is possible to set effective u_min and u_max clamps on each CFS or RT task; |
|
the runqueue keeps an max aggregate of these clamps for all running tasks. |
|
|
|
For more detail see: include/uapi/linux/sched/types.h |
|
|
|
|
|
Schedutil / DVFS |
|
---------------- |
|
|
|
Every time the scheduler load tracking is updated (task wakeup, task |
|
migration, time progression) we call out to schedutil to update the hardware |
|
DVFS state. |
|
|
|
The basis is the CPU runqueue's 'running' metric, which per the above it is |
|
the frequency invariant utilization estimate of the CPU. From this we compute |
|
a desired frequency like: |
|
|
|
max( running, util_est ); if UTIL_EST |
|
u_cfs := { running; otherwise |
|
|
|
clamp( u_cfs + u_rt , u_min, u_max ); if UCLAMP_TASK |
|
u_clamp := { u_cfs + u_rt; otherwise |
|
|
|
u := u_clamp + u_irq + u_dl; [approx. see source for more detail] |
|
|
|
f_des := min( f_max, 1.25 u * f_max ) |
|
|
|
XXX IO-wait; when the update is due to a task wakeup from IO-completion we |
|
boost 'u' above. |
|
|
|
This frequency is then used to select a P-state/OPP or directly munged into a |
|
CPPC style request to the hardware. |
|
|
|
XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min |
|
required to satisfy the workload. |
|
|
|
Because these callbacks are directly from the scheduler, the DVFS hardware |
|
interaction should be 'fast' and non-blocking. Schedutil supports |
|
rate-limiting DVFS requests for when hardware interaction is slow and |
|
expensive, this reduces effectiveness. |
|
|
|
For more information see: kernel/sched/cpufreq_schedutil.c |
|
|
|
|
|
NOTES |
|
----- |
|
|
|
- On low-load scenarios, where DVFS is most relevant, the 'running' numbers |
|
will closely reflect utilization. |
|
|
|
- In saturated scenarios task movement will cause some transient dips, |
|
suppose we have a CPU saturated with 4 tasks, then when we migrate a task |
|
to an idle CPU, the old CPU will have a 'running' value of 0.75 while the |
|
new CPU will gain 0.25. This is inevitable and time progression will |
|
correct this. XXX do we still guarantee f_max due to no idle-time? |
|
|
|
- Much of the above is about avoiding DVFS dips, and independent DVFS domains |
|
having to re-learn / ramp-up when load shifts. |
|
|
|
|