forked from Qortal/Brooklyn
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
90 lines
4.2 KiB
90 lines
4.2 KiB
.. SPDX-License-Identifier: GPL-2.0 |
|
|
|
===================== |
|
Syscall User Dispatch |
|
===================== |
|
|
|
Background |
|
---------- |
|
|
|
Compatibility layers like Wine need a way to efficiently emulate system |
|
calls of only a part of their process - the part that has the |
|
incompatible code - while being able to execute native syscalls without |
|
a high performance penalty on the native part of the process. Seccomp |
|
falls short on this task, since it has limited support to efficiently |
|
filter syscalls based on memory regions, and it doesn't support removing |
|
filters. Therefore a new mechanism is necessary. |
|
|
|
Syscall User Dispatch brings the filtering of the syscall dispatcher |
|
address back to userspace. The application is in control of a flip |
|
switch, indicating the current personality of the process. A |
|
multiple-personality application can then flip the switch without |
|
invoking the kernel, when crossing the compatibility layer API |
|
boundaries, to enable/disable the syscall redirection and execute |
|
syscalls directly (disabled) or send them to be emulated in userspace |
|
through a SIGSYS. |
|
|
|
The goal of this design is to provide very quick compatibility layer |
|
boundary crosses, which is achieved by not executing a syscall to change |
|
personality every time the compatibility layer executes. Instead, a |
|
userspace memory region exposed to the kernel indicates the current |
|
personality, and the application simply modifies that variable to |
|
configure the mechanism. |
|
|
|
There is a relatively high cost associated with handling signals on most |
|
architectures, like x86, but at least for Wine, syscalls issued by |
|
native Windows code are currently not known to be a performance problem, |
|
since they are quite rare, at least for modern gaming applications. |
|
|
|
Since this mechanism is designed to capture syscalls issued by |
|
non-native applications, it must function on syscalls whose invocation |
|
ABI is completely unexpected to Linux. Syscall User Dispatch, therefore |
|
doesn't rely on any of the syscall ABI to make the filtering. It uses |
|
only the syscall dispatcher address and the userspace key. |
|
|
|
As the ABI of these intercepted syscalls is unknown to Linux, these |
|
syscalls are not instrumentable via ptrace or the syscall tracepoints. |
|
|
|
Interface |
|
--------- |
|
|
|
A thread can setup this mechanism on supported kernels by executing the |
|
following prctl: |
|
|
|
prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector]) |
|
|
|
<op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and |
|
disable the mechanism globally for that thread. When |
|
PR_SYS_DISPATCH_OFF is used, the other fields must be zero. |
|
|
|
[<offset>, <offset>+<length>) delimit a memory region interval |
|
from which syscalls are always executed directly, regardless of the |
|
userspace selector. This provides a fast path for the C library, which |
|
includes the most common syscall dispatchers in the native code |
|
applications, and also provides a way for the signal handler to return |
|
without triggering a nested SIGSYS on (rt\_)sigreturn. Users of this |
|
interface should make sure that at least the signal trampoline code is |
|
included in this region. In addition, for syscalls that implement the |
|
trampoline code on the vDSO, that trampoline is never intercepted. |
|
|
|
[selector] is a pointer to a char-sized region in the process memory |
|
region, that provides a quick way to enable disable syscall redirection |
|
thread-wide, without the need to invoke the kernel directly. selector |
|
can be set to SYSCALL_DISPATCH_FILTER_ALLOW or SYSCALL_DISPATCH_FILTER_BLOCK. |
|
Any other value should terminate the program with a SIGSYS. |
|
|
|
Security Notes |
|
-------------- |
|
|
|
Syscall User Dispatch provides functionality for compatibility layers to |
|
quickly capture system calls issued by a non-native part of the |
|
application, while not impacting the Linux native regions of the |
|
process. It is not a mechanism for sandboxing system calls, and it |
|
should not be seen as a security mechanism, since it is trivial for a |
|
malicious application to subvert the mechanism by jumping to an allowed |
|
dispatcher region prior to executing the syscall, or to discover the |
|
address and modify the selector value. If the use case requires any |
|
kind of security sandboxing, Seccomp should be used instead. |
|
|
|
Any fork or exec of the existing process resets the mechanism to |
|
PR_SYS_DISPATCH_OFF.
|
|
|