forked from Qortal/Brooklyn
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
212 lines
6.7 KiB
212 lines
6.7 KiB
====================================== |
|
vlocks for Bare-Metal Mutual Exclusion |
|
====================================== |
|
|
|
Voting Locks, or "vlocks" provide a simple low-level mutual exclusion |
|
mechanism, with reasonable but minimal requirements on the memory |
|
system. |
|
|
|
These are intended to be used to coordinate critical activity among CPUs |
|
which are otherwise non-coherent, in situations where the hardware |
|
provides no other mechanism to support this and ordinary spinlocks |
|
cannot be used. |
|
|
|
|
|
vlocks make use of the atomicity provided by the memory system for |
|
writes to a single memory location. To arbitrate, every CPU "votes for |
|
itself", by storing a unique number to a common memory location. The |
|
final value seen in that memory location when all the votes have been |
|
cast identifies the winner. |
|
|
|
In order to make sure that the election produces an unambiguous result |
|
in finite time, a CPU will only enter the election in the first place if |
|
no winner has been chosen and the election does not appear to have |
|
started yet. |
|
|
|
|
|
Algorithm |
|
--------- |
|
|
|
The easiest way to explain the vlocks algorithm is with some pseudo-code:: |
|
|
|
|
|
int currently_voting[NR_CPUS] = { 0, }; |
|
int last_vote = -1; /* no votes yet */ |
|
|
|
bool vlock_trylock(int this_cpu) |
|
{ |
|
/* signal our desire to vote */ |
|
currently_voting[this_cpu] = 1; |
|
if (last_vote != -1) { |
|
/* someone already volunteered himself */ |
|
currently_voting[this_cpu] = 0; |
|
return false; /* not ourself */ |
|
} |
|
|
|
/* let's suggest ourself */ |
|
last_vote = this_cpu; |
|
currently_voting[this_cpu] = 0; |
|
|
|
/* then wait until everyone else is done voting */ |
|
for_each_cpu(i) { |
|
while (currently_voting[i] != 0) |
|
/* wait */; |
|
} |
|
|
|
/* result */ |
|
if (last_vote == this_cpu) |
|
return true; /* we won */ |
|
return false; |
|
} |
|
|
|
bool vlock_unlock(void) |
|
{ |
|
last_vote = -1; |
|
} |
|
|
|
|
|
The currently_voting[] array provides a way for the CPUs to determine |
|
whether an election is in progress, and plays a role analogous to the |
|
"entering" array in Lamport's bakery algorithm [1]. |
|
|
|
However, once the election has started, the underlying memory system |
|
atomicity is used to pick the winner. This avoids the need for a static |
|
priority rule to act as a tie-breaker, or any counters which could |
|
overflow. |
|
|
|
As long as the last_vote variable is globally visible to all CPUs, it |
|
will contain only one value that won't change once every CPU has cleared |
|
its currently_voting flag. |
|
|
|
|
|
Features and limitations |
|
------------------------ |
|
|
|
* vlocks are not intended to be fair. In the contended case, it is the |
|
_last_ CPU which attempts to get the lock which will be most likely |
|
to win. |
|
|
|
vlocks are therefore best suited to situations where it is necessary |
|
to pick a unique winner, but it does not matter which CPU actually |
|
wins. |
|
|
|
* Like other similar mechanisms, vlocks will not scale well to a large |
|
number of CPUs. |
|
|
|
vlocks can be cascaded in a voting hierarchy to permit better scaling |
|
if necessary, as in the following hypothetical example for 4096 CPUs:: |
|
|
|
/* first level: local election */ |
|
my_town = towns[(this_cpu >> 4) & 0xf]; |
|
I_won = vlock_trylock(my_town, this_cpu & 0xf); |
|
if (I_won) { |
|
/* we won the town election, let's go for the state */ |
|
my_state = states[(this_cpu >> 8) & 0xf]; |
|
I_won = vlock_lock(my_state, this_cpu & 0xf)); |
|
if (I_won) { |
|
/* and so on */ |
|
I_won = vlock_lock(the_whole_country, this_cpu & 0xf]; |
|
if (I_won) { |
|
/* ... */ |
|
} |
|
vlock_unlock(the_whole_country); |
|
} |
|
vlock_unlock(my_state); |
|
} |
|
vlock_unlock(my_town); |
|
|
|
|
|
ARM implementation |
|
------------------ |
|
|
|
The current ARM implementation [2] contains some optimisations beyond |
|
the basic algorithm: |
|
|
|
* By packing the members of the currently_voting array close together, |
|
we can read the whole array in one transaction (providing the number |
|
of CPUs potentially contending the lock is small enough). This |
|
reduces the number of round-trips required to external memory. |
|
|
|
In the ARM implementation, this means that we can use a single load |
|
and comparison:: |
|
|
|
LDR Rt, [Rn] |
|
CMP Rt, #0 |
|
|
|
...in place of code equivalent to:: |
|
|
|
LDRB Rt, [Rn] |
|
CMP Rt, #0 |
|
LDRBEQ Rt, [Rn, #1] |
|
CMPEQ Rt, #0 |
|
LDRBEQ Rt, [Rn, #2] |
|
CMPEQ Rt, #0 |
|
LDRBEQ Rt, [Rn, #3] |
|
CMPEQ Rt, #0 |
|
|
|
This cuts down on the fast-path latency, as well as potentially |
|
reducing bus contention in contended cases. |
|
|
|
The optimisation relies on the fact that the ARM memory system |
|
guarantees coherency between overlapping memory accesses of |
|
different sizes, similarly to many other architectures. Note that |
|
we do not care which element of currently_voting appears in which |
|
bits of Rt, so there is no need to worry about endianness in this |
|
optimisation. |
|
|
|
If there are too many CPUs to read the currently_voting array in |
|
one transaction then multiple transations are still required. The |
|
implementation uses a simple loop of word-sized loads for this |
|
case. The number of transactions is still fewer than would be |
|
required if bytes were loaded individually. |
|
|
|
|
|
In principle, we could aggregate further by using LDRD or LDM, but |
|
to keep the code simple this was not attempted in the initial |
|
implementation. |
|
|
|
|
|
* vlocks are currently only used to coordinate between CPUs which are |
|
unable to enable their caches yet. This means that the |
|
implementation removes many of the barriers which would be required |
|
when executing the algorithm in cached memory. |
|
|
|
packing of the currently_voting array does not work with cached |
|
memory unless all CPUs contending the lock are cache-coherent, due |
|
to cache writebacks from one CPU clobbering values written by other |
|
CPUs. (Though if all the CPUs are cache-coherent, you should be |
|
probably be using proper spinlocks instead anyway). |
|
|
|
|
|
* The "no votes yet" value used for the last_vote variable is 0 (not |
|
-1 as in the pseudocode). This allows statically-allocated vlocks |
|
to be implicitly initialised to an unlocked state simply by putting |
|
them in .bss. |
|
|
|
An offset is added to each CPU's ID for the purpose of setting this |
|
variable, so that no CPU uses the value 0 for its ID. |
|
|
|
|
|
Colophon |
|
-------- |
|
|
|
Originally created and documented by Dave Martin for Linaro Limited, for |
|
use in ARM-based big.LITTLE platforms, with review and input gratefully |
|
received from Nicolas Pitre and Achin Gupta. Thanks to Nicolas for |
|
grabbing most of this text out of the relevant mail thread and writing |
|
up the pseudocode. |
|
|
|
Copyright (C) 2012-2013 Linaro Limited |
|
Distributed under the terms of Version 2 of the GNU General Public |
|
License, as defined in linux/COPYING. |
|
|
|
|
|
References |
|
---------- |
|
|
|
[1] Lamport, L. "A New Solution of Dijkstra's Concurrent Programming |
|
Problem", Communications of the ACM 17, 8 (August 1974), 453-455. |
|
|
|
https://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm |
|
|
|
[2] linux/arch/arm/common/vlock.S, www.kernel.org.
|
|
|