mirror of https://github.com/Qortal/Brooklyn
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
393 lines
15 KiB
393 lines
15 KiB
====================================== |
|
Wound/Wait Deadlock-Proof Mutex Design |
|
====================================== |
|
|
|
Please read mutex-design.txt first, as it applies to wait/wound mutexes too. |
|
|
|
Motivation for WW-Mutexes |
|
------------------------- |
|
|
|
GPU's do operations that commonly involve many buffers. Those buffers |
|
can be shared across contexts/processes, exist in different memory |
|
domains (for example VRAM vs system memory), and so on. And with |
|
PRIME / dmabuf, they can even be shared across devices. So there are |
|
a handful of situations where the driver needs to wait for buffers to |
|
become ready. If you think about this in terms of waiting on a buffer |
|
mutex for it to become available, this presents a problem because |
|
there is no way to guarantee that buffers appear in a execbuf/batch in |
|
the same order in all contexts. That is directly under control of |
|
userspace, and a result of the sequence of GL calls that an application |
|
makes. Which results in the potential for deadlock. The problem gets |
|
more complex when you consider that the kernel may need to migrate the |
|
buffer(s) into VRAM before the GPU operates on the buffer(s), which |
|
may in turn require evicting some other buffers (and you don't want to |
|
evict other buffers which are already queued up to the GPU), but for a |
|
simplified understanding of the problem you can ignore this. |
|
|
|
The algorithm that the TTM graphics subsystem came up with for dealing with |
|
this problem is quite simple. For each group of buffers (execbuf) that need |
|
to be locked, the caller would be assigned a unique reservation id/ticket, |
|
from a global counter. In case of deadlock while locking all the buffers |
|
associated with a execbuf, the one with the lowest reservation ticket (i.e. |
|
the oldest task) wins, and the one with the higher reservation id (i.e. the |
|
younger task) unlocks all of the buffers that it has already locked, and then |
|
tries again. |
|
|
|
In the RDBMS literature, a reservation ticket is associated with a transaction. |
|
and the deadlock handling approach is called Wait-Die. The name is based on |
|
the actions of a locking thread when it encounters an already locked mutex. |
|
If the transaction holding the lock is younger, the locking transaction waits. |
|
If the transaction holding the lock is older, the locking transaction backs off |
|
and dies. Hence Wait-Die. |
|
There is also another algorithm called Wound-Wait: |
|
If the transaction holding the lock is younger, the locking transaction |
|
wounds the transaction holding the lock, requesting it to die. |
|
If the transaction holding the lock is older, it waits for the other |
|
transaction. Hence Wound-Wait. |
|
The two algorithms are both fair in that a transaction will eventually succeed. |
|
However, the Wound-Wait algorithm is typically stated to generate fewer backoffs |
|
compared to Wait-Die, but is, on the other hand, associated with more work than |
|
Wait-Die when recovering from a backoff. Wound-Wait is also a preemptive |
|
algorithm in that transactions are wounded by other transactions, and that |
|
requires a reliable way to pick up the wounded condition and preempt the |
|
running transaction. Note that this is not the same as process preemption. A |
|
Wound-Wait transaction is considered preempted when it dies (returning |
|
-EDEADLK) following a wound. |
|
|
|
Concepts |
|
-------- |
|
|
|
Compared to normal mutexes two additional concepts/objects show up in the lock |
|
interface for w/w mutexes: |
|
|
|
Acquire context: To ensure eventual forward progress it is important the a task |
|
trying to acquire locks doesn't grab a new reservation id, but keeps the one it |
|
acquired when starting the lock acquisition. This ticket is stored in the |
|
acquire context. Furthermore the acquire context keeps track of debugging state |
|
to catch w/w mutex interface abuse. An acquire context is representing a |
|
transaction. |
|
|
|
W/w class: In contrast to normal mutexes the lock class needs to be explicit for |
|
w/w mutexes, since it is required to initialize the acquire context. The lock |
|
class also specifies what algorithm to use, Wound-Wait or Wait-Die. |
|
|
|
Furthermore there are three different class of w/w lock acquire functions: |
|
|
|
* Normal lock acquisition with a context, using ww_mutex_lock. |
|
|
|
* Slowpath lock acquisition on the contending lock, used by the task that just |
|
killed its transaction after having dropped all already acquired locks. |
|
These functions have the _slow postfix. |
|
|
|
From a simple semantics point-of-view the _slow functions are not strictly |
|
required, since simply calling the normal ww_mutex_lock functions on the |
|
contending lock (after having dropped all other already acquired locks) will |
|
work correctly. After all if no other ww mutex has been acquired yet there's |
|
no deadlock potential and hence the ww_mutex_lock call will block and not |
|
prematurely return -EDEADLK. The advantage of the _slow functions is in |
|
interface safety: |
|
|
|
- ww_mutex_lock has a __must_check int return type, whereas ww_mutex_lock_slow |
|
has a void return type. Note that since ww mutex code needs loops/retries |
|
anyway the __must_check doesn't result in spurious warnings, even though the |
|
very first lock operation can never fail. |
|
- When full debugging is enabled ww_mutex_lock_slow checks that all acquired |
|
ww mutex have been released (preventing deadlocks) and makes sure that we |
|
block on the contending lock (preventing spinning through the -EDEADLK |
|
slowpath until the contended lock can be acquired). |
|
|
|
* Functions to only acquire a single w/w mutex, which results in the exact same |
|
semantics as a normal mutex. This is done by calling ww_mutex_lock with a NULL |
|
context. |
|
|
|
Again this is not strictly required. But often you only want to acquire a |
|
single lock in which case it's pointless to set up an acquire context (and so |
|
better to avoid grabbing a deadlock avoidance ticket). |
|
|
|
Of course, all the usual variants for handling wake-ups due to signals are also |
|
provided. |
|
|
|
Usage |
|
----- |
|
|
|
The algorithm (Wait-Die vs Wound-Wait) is chosen by using either |
|
DEFINE_WW_CLASS() (Wound-Wait) or DEFINE_WD_CLASS() (Wait-Die) |
|
As a rough rule of thumb, use Wound-Wait iff you |
|
expect the number of simultaneous competing transactions to be typically small, |
|
and you want to reduce the number of rollbacks. |
|
|
|
Three different ways to acquire locks within the same w/w class. Common |
|
definitions for methods #1 and #2:: |
|
|
|
static DEFINE_WW_CLASS(ww_class); |
|
|
|
struct obj { |
|
struct ww_mutex lock; |
|
/* obj data */ |
|
}; |
|
|
|
struct obj_entry { |
|
struct list_head head; |
|
struct obj *obj; |
|
}; |
|
|
|
Method 1, using a list in execbuf->buffers that's not allowed to be reordered. |
|
This is useful if a list of required objects is already tracked somewhere. |
|
Furthermore the lock helper can use propagate the -EALREADY return code back to |
|
the caller as a signal that an object is twice on the list. This is useful if |
|
the list is constructed from userspace input and the ABI requires userspace to |
|
not have duplicate entries (e.g. for a gpu commandbuffer submission ioctl):: |
|
|
|
int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) |
|
{ |
|
struct obj *res_obj = NULL; |
|
struct obj_entry *contended_entry = NULL; |
|
struct obj_entry *entry; |
|
|
|
ww_acquire_init(ctx, &ww_class); |
|
|
|
retry: |
|
list_for_each_entry (entry, list, head) { |
|
if (entry->obj == res_obj) { |
|
res_obj = NULL; |
|
continue; |
|
} |
|
ret = ww_mutex_lock(&entry->obj->lock, ctx); |
|
if (ret < 0) { |
|
contended_entry = entry; |
|
goto err; |
|
} |
|
} |
|
|
|
ww_acquire_done(ctx); |
|
return 0; |
|
|
|
err: |
|
list_for_each_entry_continue_reverse (entry, list, head) |
|
ww_mutex_unlock(&entry->obj->lock); |
|
|
|
if (res_obj) |
|
ww_mutex_unlock(&res_obj->lock); |
|
|
|
if (ret == -EDEADLK) { |
|
/* we lost out in a seqno race, lock and retry.. */ |
|
ww_mutex_lock_slow(&contended_entry->obj->lock, ctx); |
|
res_obj = contended_entry->obj; |
|
goto retry; |
|
} |
|
ww_acquire_fini(ctx); |
|
|
|
return ret; |
|
} |
|
|
|
Method 2, using a list in execbuf->buffers that can be reordered. Same semantics |
|
of duplicate entry detection using -EALREADY as method 1 above. But the |
|
list-reordering allows for a bit more idiomatic code:: |
|
|
|
int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) |
|
{ |
|
struct obj_entry *entry, *entry2; |
|
|
|
ww_acquire_init(ctx, &ww_class); |
|
|
|
list_for_each_entry (entry, list, head) { |
|
ret = ww_mutex_lock(&entry->obj->lock, ctx); |
|
if (ret < 0) { |
|
entry2 = entry; |
|
|
|
list_for_each_entry_continue_reverse (entry2, list, head) |
|
ww_mutex_unlock(&entry2->obj->lock); |
|
|
|
if (ret != -EDEADLK) { |
|
ww_acquire_fini(ctx); |
|
return ret; |
|
} |
|
|
|
/* we lost out in a seqno race, lock and retry.. */ |
|
ww_mutex_lock_slow(&entry->obj->lock, ctx); |
|
|
|
/* |
|
* Move buf to head of the list, this will point |
|
* buf->next to the first unlocked entry, |
|
* restarting the for loop. |
|
*/ |
|
list_del(&entry->head); |
|
list_add(&entry->head, list); |
|
} |
|
} |
|
|
|
ww_acquire_done(ctx); |
|
return 0; |
|
} |
|
|
|
Unlocking works the same way for both methods #1 and #2:: |
|
|
|
void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) |
|
{ |
|
struct obj_entry *entry; |
|
|
|
list_for_each_entry (entry, list, head) |
|
ww_mutex_unlock(&entry->obj->lock); |
|
|
|
ww_acquire_fini(ctx); |
|
} |
|
|
|
Method 3 is useful if the list of objects is constructed ad-hoc and not upfront, |
|
e.g. when adjusting edges in a graph where each node has its own ww_mutex lock, |
|
and edges can only be changed when holding the locks of all involved nodes. w/w |
|
mutexes are a natural fit for such a case for two reasons: |
|
|
|
- They can handle lock-acquisition in any order which allows us to start walking |
|
a graph from a starting point and then iteratively discovering new edges and |
|
locking down the nodes those edges connect to. |
|
- Due to the -EALREADY return code signalling that a given objects is already |
|
held there's no need for additional book-keeping to break cycles in the graph |
|
or keep track off which looks are already held (when using more than one node |
|
as a starting point). |
|
|
|
Note that this approach differs in two important ways from the above methods: |
|
|
|
- Since the list of objects is dynamically constructed (and might very well be |
|
different when retrying due to hitting the -EDEADLK die condition) there's |
|
no need to keep any object on a persistent list when it's not locked. We can |
|
therefore move the list_head into the object itself. |
|
- On the other hand the dynamic object list construction also means that the -EALREADY return |
|
code can't be propagated. |
|
|
|
Note also that methods #1 and #2 and method #3 can be combined, e.g. to first lock a |
|
list of starting nodes (passed in from userspace) using one of the above |
|
methods. And then lock any additional objects affected by the operations using |
|
method #3 below. The backoff/retry procedure will be a bit more involved, since |
|
when the dynamic locking step hits -EDEADLK we also need to unlock all the |
|
objects acquired with the fixed list. But the w/w mutex debug checks will catch |
|
any interface misuse for these cases. |
|
|
|
Also, method 3 can't fail the lock acquisition step since it doesn't return |
|
-EALREADY. Of course this would be different when using the _interruptible |
|
variants, but that's outside of the scope of these examples here:: |
|
|
|
struct obj { |
|
struct ww_mutex ww_mutex; |
|
struct list_head locked_list; |
|
}; |
|
|
|
static DEFINE_WW_CLASS(ww_class); |
|
|
|
void __unlock_objs(struct list_head *list) |
|
{ |
|
struct obj *entry, *temp; |
|
|
|
list_for_each_entry_safe (entry, temp, list, locked_list) { |
|
/* need to do that before unlocking, since only the current lock holder is |
|
allowed to use object */ |
|
list_del(&entry->locked_list); |
|
ww_mutex_unlock(entry->ww_mutex) |
|
} |
|
} |
|
|
|
void lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) |
|
{ |
|
struct obj *obj; |
|
|
|
ww_acquire_init(ctx, &ww_class); |
|
|
|
retry: |
|
/* re-init loop start state */ |
|
loop { |
|
/* magic code which walks over a graph and decides which objects |
|
* to lock */ |
|
|
|
ret = ww_mutex_lock(obj->ww_mutex, ctx); |
|
if (ret == -EALREADY) { |
|
/* we have that one already, get to the next object */ |
|
continue; |
|
} |
|
if (ret == -EDEADLK) { |
|
__unlock_objs(list); |
|
|
|
ww_mutex_lock_slow(obj, ctx); |
|
list_add(&entry->locked_list, list); |
|
goto retry; |
|
} |
|
|
|
/* locked a new object, add it to the list */ |
|
list_add_tail(&entry->locked_list, list); |
|
} |
|
|
|
ww_acquire_done(ctx); |
|
return 0; |
|
} |
|
|
|
void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) |
|
{ |
|
__unlock_objs(list); |
|
ww_acquire_fini(ctx); |
|
} |
|
|
|
Method 4: Only lock one single objects. In that case deadlock detection and |
|
prevention is obviously overkill, since with grabbing just one lock you can't |
|
produce a deadlock within just one class. To simplify this case the w/w mutex |
|
api can be used with a NULL context. |
|
|
|
Implementation Details |
|
---------------------- |
|
|
|
Design: |
|
^^^^^^^ |
|
|
|
ww_mutex currently encapsulates a struct mutex, this means no extra overhead for |
|
normal mutex locks, which are far more common. As such there is only a small |
|
increase in code size if wait/wound mutexes are not used. |
|
|
|
We maintain the following invariants for the wait list: |
|
|
|
(1) Waiters with an acquire context are sorted by stamp order; waiters |
|
without an acquire context are interspersed in FIFO order. |
|
(2) For Wait-Die, among waiters with contexts, only the first one can have |
|
other locks acquired already (ctx->acquired > 0). Note that this waiter |
|
may come after other waiters without contexts in the list. |
|
|
|
The Wound-Wait preemption is implemented with a lazy-preemption scheme: |
|
The wounded status of the transaction is checked only when there is |
|
contention for a new lock and hence a true chance of deadlock. In that |
|
situation, if the transaction is wounded, it backs off, clears the |
|
wounded status and retries. A great benefit of implementing preemption in |
|
this way is that the wounded transaction can identify a contending lock to |
|
wait for before restarting the transaction. Just blindly restarting the |
|
transaction would likely make the transaction end up in a situation where |
|
it would have to back off again. |
|
|
|
In general, not much contention is expected. The locks are typically used to |
|
serialize access to resources for devices, and optimization focus should |
|
therefore be directed towards the uncontended cases. |
|
|
|
Lockdep: |
|
^^^^^^^^ |
|
|
|
Special care has been taken to warn for as many cases of api abuse |
|
as possible. Some common api abuses will be caught with |
|
CONFIG_DEBUG_MUTEXES, but CONFIG_PROVE_LOCKING is recommended. |
|
|
|
Some of the errors which will be warned about: |
|
- Forgetting to call ww_acquire_fini or ww_acquire_init. |
|
- Attempting to lock more mutexes after ww_acquire_done. |
|
- Attempting to lock the wrong mutex after -EDEADLK and |
|
unlocking all mutexes. |
|
- Attempting to lock the right mutex after -EDEADLK, |
|
before unlocking all mutexes. |
|
|
|
- Calling ww_mutex_lock_slow before -EDEADLK was returned. |
|
|
|
- Unlocking mutexes with the wrong unlock function. |
|
- Calling one of the ww_acquire_* twice on the same context. |
|
- Using a different ww_class for the mutex than for the ww_acquire_ctx. |
|
- Normal lockdep errors that can result in deadlocks. |
|
|
|
Some of the lockdep errors that can result in deadlocks: |
|
- Calling ww_acquire_init to initialize a second ww_acquire_ctx before |
|
having called ww_acquire_fini on the first. |
|
- 'normal' deadlocks that can occur. |
|
|
|
FIXME: |
|
Update this section once we have the TASK_DEADLOCK task state flag magic |
|
implemented.
|
|
|