mirror of https://github.com/Qortal/Brooklyn
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
319 lines
15 KiB
319 lines
15 KiB
.. _userfaultfd: |
|
|
|
=========== |
|
Userfaultfd |
|
=========== |
|
|
|
Objective |
|
========= |
|
|
|
Userfaults allow the implementation of on-demand paging from userland |
|
and more generally they allow userland to take control of various |
|
memory page faults, something otherwise only the kernel code could do. |
|
|
|
For example userfaults allows a proper and more optimal implementation |
|
of the ``PROT_NONE+SIGSEGV`` trick. |
|
|
|
Design |
|
====== |
|
|
|
Userfaults are delivered and resolved through the ``userfaultfd`` syscall. |
|
|
|
The ``userfaultfd`` (aside from registering and unregistering virtual |
|
memory ranges) provides two primary functionalities: |
|
|
|
1) ``read/POLLIN`` protocol to notify a userland thread of the faults |
|
happening |
|
|
|
2) various ``UFFDIO_*`` ioctls that can manage the virtual memory regions |
|
registered in the ``userfaultfd`` that allows userland to efficiently |
|
resolve the userfaults it receives via 1) or to manage the virtual |
|
memory in the background |
|
|
|
The real advantage of userfaults if compared to regular virtual memory |
|
management of mremap/mprotect is that the userfaults in all their |
|
operations never involve heavyweight structures like vmas (in fact the |
|
``userfaultfd`` runtime load never takes the mmap_lock for writing). |
|
|
|
Vmas are not suitable for page- (or hugepage) granular fault tracking |
|
when dealing with virtual address spaces that could span |
|
Terabytes. Too many vmas would be needed for that. |
|
|
|
The ``userfaultfd`` once opened by invoking the syscall, can also be |
|
passed using unix domain sockets to a manager process, so the same |
|
manager process could handle the userfaults of a multitude of |
|
different processes without them being aware about what is going on |
|
(well of course unless they later try to use the ``userfaultfd`` |
|
themselves on the same region the manager is already tracking, which |
|
is a corner case that would currently return ``-EBUSY``). |
|
|
|
API |
|
=== |
|
|
|
When first opened the ``userfaultfd`` must be enabled invoking the |
|
``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or |
|
a later API version) which will specify the ``read/POLLIN`` protocol |
|
userland intends to speak on the ``UFFD`` and the ``uffdio_api.features`` |
|
userland requires. The ``UFFDIO_API`` ioctl if successful (i.e. if the |
|
requested ``uffdio_api.api`` is spoken also by the running kernel and the |
|
requested features are going to be enabled) will return into |
|
``uffdio_api.features`` and ``uffdio_api.ioctls`` two 64bit bitmasks of |
|
respectively all the available features of the read(2) protocol and |
|
the generic ioctl available. |
|
|
|
The ``uffdio_api.features`` bitmask returned by the ``UFFDIO_API`` ioctl |
|
defines what memory types are supported by the ``userfaultfd`` and what |
|
events, except page fault notifications, may be generated: |
|
|
|
- The ``UFFD_FEATURE_EVENT_*`` flags indicate that various other events |
|
other than page faults are supported. These events are described in more |
|
detail below in the `Non-cooperative userfaultfd`_ section. |
|
|
|
- ``UFFD_FEATURE_MISSING_HUGETLBFS`` and ``UFFD_FEATURE_MISSING_SHMEM`` |
|
indicate that the kernel supports ``UFFDIO_REGISTER_MODE_MISSING`` |
|
registrations for hugetlbfs and shared memory (covering all shmem APIs, |
|
i.e. tmpfs, ``IPCSHM``, ``/dev/zero``, ``MAP_SHARED``, ``memfd_create``, |
|
etc) virtual memory areas, respectively. |
|
|
|
- ``UFFD_FEATURE_MINOR_HUGETLBFS`` indicates that the kernel supports |
|
``UFFDIO_REGISTER_MODE_MINOR`` registration for hugetlbfs virtual memory |
|
areas. ``UFFD_FEATURE_MINOR_SHMEM`` is the analogous feature indicating |
|
support for shmem virtual memory areas. |
|
|
|
The userland application should set the feature flags it intends to use |
|
when invoking the ``UFFDIO_API`` ioctl, to request that those features be |
|
enabled if supported. |
|
|
|
Once the ``userfaultfd`` API has been enabled the ``UFFDIO_REGISTER`` |
|
ioctl should be invoked (if present in the returned ``uffdio_api.ioctls`` |
|
bitmask) to register a memory range in the ``userfaultfd`` by setting the |
|
uffdio_register structure accordingly. The ``uffdio_register.mode`` |
|
bitmask will specify to the kernel which kind of faults to track for |
|
the range. The ``UFFDIO_REGISTER`` ioctl will return the |
|
``uffdio_register.ioctls`` bitmask of ioctls that are suitable to resolve |
|
userfaults on the range registered. Not all ioctls will necessarily be |
|
supported for all memory types (e.g. anonymous memory vs. shmem vs. |
|
hugetlbfs), or all types of intercepted faults. |
|
|
|
Userland can use the ``uffdio_register.ioctls`` to manage the virtual |
|
address space in the background (to add or potentially also remove |
|
memory from the ``userfaultfd`` registered range). This means a userfault |
|
could be triggering just before userland maps in the background the |
|
user-faulted page. |
|
|
|
Resolving Userfaults |
|
-------------------- |
|
|
|
There are three basic ways to resolve userfaults: |
|
|
|
- ``UFFDIO_COPY`` atomically copies some existing page contents from |
|
userspace. |
|
|
|
- ``UFFDIO_ZEROPAGE`` atomically zeros the new page. |
|
|
|
- ``UFFDIO_CONTINUE`` maps an existing, previously-populated page. |
|
|
|
These operations are atomic in the sense that they guarantee nothing can |
|
see a half-populated page, since readers will keep userfaulting until the |
|
operation has finished. |
|
|
|
By default, these wake up userfaults blocked on the range in question. |
|
They support a ``UFFDIO_*_MODE_DONTWAKE`` ``mode`` flag, which indicates |
|
that waking will be done separately at some later time. |
|
|
|
Which ioctl to choose depends on the kind of page fault, and what we'd |
|
like to do to resolve it: |
|
|
|
- For ``UFFDIO_REGISTER_MODE_MISSING`` faults, the fault needs to be |
|
resolved by either providing a new page (``UFFDIO_COPY``), or mapping |
|
the zero page (``UFFDIO_ZEROPAGE``). By default, the kernel would map |
|
the zero page for a missing fault. With userfaultfd, userspace can |
|
decide what content to provide before the faulting thread continues. |
|
|
|
- For ``UFFDIO_REGISTER_MODE_MINOR`` faults, there is an existing page (in |
|
the page cache). Userspace has the option of modifying the page's |
|
contents before resolving the fault. Once the contents are correct |
|
(modified or not), userspace asks the kernel to map the page and let the |
|
faulting thread continue with ``UFFDIO_CONTINUE``. |
|
|
|
Notes: |
|
|
|
- You can tell which kind of fault occurred by examining |
|
``pagefault.flags`` within the ``uffd_msg``, checking for the |
|
``UFFD_PAGEFAULT_FLAG_*`` flags. |
|
|
|
- None of the page-delivering ioctls default to the range that you |
|
registered with. You must fill in all fields for the appropriate |
|
ioctl struct including the range. |
|
|
|
- You get the address of the access that triggered the missing page |
|
event out of a struct uffd_msg that you read in the thread from the |
|
uffd. You can supply as many pages as you want with these IOCTLs. |
|
Keep in mind that unless you used DONTWAKE then the first of any of |
|
those IOCTLs wakes up the faulting thread. |
|
|
|
- Be sure to test for all errors including |
|
(``pollfd[0].revents & POLLERR``). This can happen, e.g. when ranges |
|
supplied were incorrect. |
|
|
|
Write Protect Notifications |
|
--------------------------- |
|
|
|
This is equivalent to (but faster than) using mprotect and a SIGSEGV |
|
signal handler. |
|
|
|
Firstly you need to register a range with ``UFFDIO_REGISTER_MODE_WP``. |
|
Instead of using mprotect(2) you use |
|
``ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect)`` |
|
while ``mode = UFFDIO_WRITEPROTECT_MODE_WP`` |
|
in the struct passed in. The range does not default to and does not |
|
have to be identical to the range you registered with. You can write |
|
protect as many ranges as you like (inside the registered range). |
|
Then, in the thread reading from uffd the struct will have |
|
``msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP`` set. Now you send |
|
``ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect)`` |
|
again while ``pagefault.mode`` does not have ``UFFDIO_WRITEPROTECT_MODE_WP`` |
|
set. This wakes up the thread which will continue to run with writes. This |
|
allows you to do the bookkeeping about the write in the uffd reading |
|
thread before the ioctl. |
|
|
|
If you registered with both ``UFFDIO_REGISTER_MODE_MISSING`` and |
|
``UFFDIO_REGISTER_MODE_WP`` then you need to think about the sequence in |
|
which you supply a page and undo write protect. Note that there is a |
|
difference between writes into a WP area and into a !WP area. The |
|
former will have ``UFFD_PAGEFAULT_FLAG_WP`` set, the latter |
|
``UFFD_PAGEFAULT_FLAG_WRITE``. The latter did not fail on protection but |
|
you still need to supply a page when ``UFFDIO_REGISTER_MODE_MISSING`` was |
|
used. |
|
|
|
QEMU/KVM |
|
======== |
|
|
|
QEMU/KVM is using the ``userfaultfd`` syscall to implement postcopy live |
|
migration. Postcopy live migration is one form of memory |
|
externalization consisting of a virtual machine running with part or |
|
all of its memory residing on a different node in the cloud. The |
|
``userfaultfd`` abstraction is generic enough that not a single line of |
|
KVM kernel code had to be modified in order to add postcopy live |
|
migration to QEMU. |
|
|
|
Guest async page faults, ``FOLL_NOWAIT`` and all other ``GUP*`` features work |
|
just fine in combination with userfaults. Userfaults trigger async |
|
page faults in the guest scheduler so those guest processes that |
|
aren't waiting for userfaults (i.e. network bound) can keep running in |
|
the guest vcpus. |
|
|
|
It is generally beneficial to run one pass of precopy live migration |
|
just before starting postcopy live migration, in order to avoid |
|
generating userfaults for readonly guest regions. |
|
|
|
The implementation of postcopy live migration currently uses one |
|
single bidirectional socket but in the future two different sockets |
|
will be used (to reduce the latency of the userfaults to the minimum |
|
possible without having to decrease ``/proc/sys/net/ipv4/tcp_wmem``). |
|
|
|
The QEMU in the source node writes all pages that it knows are missing |
|
in the destination node, into the socket, and the migration thread of |
|
the QEMU running in the destination node runs ``UFFDIO_COPY|ZEROPAGE`` |
|
ioctls on the ``userfaultfd`` in order to map the received pages into the |
|
guest (``UFFDIO_ZEROCOPY`` is used if the source page was a zero page). |
|
|
|
A different postcopy thread in the destination node listens with |
|
poll() to the ``userfaultfd`` in parallel. When a ``POLLIN`` event is |
|
generated after a userfault triggers, the postcopy thread read() from |
|
the ``userfaultfd`` and receives the fault address (or ``-EAGAIN`` in case the |
|
userfault was already resolved and waken by a ``UFFDIO_COPY|ZEROPAGE`` run |
|
by the parallel QEMU migration thread). |
|
|
|
After the QEMU postcopy thread (running in the destination node) gets |
|
the userfault address it writes the information about the missing page |
|
into the socket. The QEMU source node receives the information and |
|
roughly "seeks" to that page address and continues sending all |
|
remaining missing pages from that new page offset. Soon after that |
|
(just the time to flush the tcp_wmem queue through the network) the |
|
migration thread in the QEMU running in the destination node will |
|
receive the page that triggered the userfault and it'll map it as |
|
usual with the ``UFFDIO_COPY|ZEROPAGE`` (without actually knowing if it |
|
was spontaneously sent by the source or if it was an urgent page |
|
requested through a userfault). |
|
|
|
By the time the userfaults start, the QEMU in the destination node |
|
doesn't need to keep any per-page state bitmap relative to the live |
|
migration around and a single per-page bitmap has to be maintained in |
|
the QEMU running in the source node to know which pages are still |
|
missing in the destination node. The bitmap in the source node is |
|
checked to find which missing pages to send in round robin and we seek |
|
over it when receiving incoming userfaults. After sending each page of |
|
course the bitmap is updated accordingly. It's also useful to avoid |
|
sending the same page twice (in case the userfault is read by the |
|
postcopy thread just before ``UFFDIO_COPY|ZEROPAGE`` runs in the migration |
|
thread). |
|
|
|
Non-cooperative userfaultfd |
|
=========================== |
|
|
|
When the ``userfaultfd`` is monitored by an external manager, the manager |
|
must be able to track changes in the process virtual memory |
|
layout. Userfaultfd can notify the manager about such changes using |
|
the same read(2) protocol as for the page fault notifications. The |
|
manager has to explicitly enable these events by setting appropriate |
|
bits in ``uffdio_api.features`` passed to ``UFFDIO_API`` ioctl: |
|
|
|
``UFFD_FEATURE_EVENT_FORK`` |
|
enable ``userfaultfd`` hooks for fork(). When this feature is |
|
enabled, the ``userfaultfd`` context of the parent process is |
|
duplicated into the newly created process. The manager |
|
receives ``UFFD_EVENT_FORK`` with file descriptor of the new |
|
``userfaultfd`` context in the ``uffd_msg.fork``. |
|
|
|
``UFFD_FEATURE_EVENT_REMAP`` |
|
enable notifications about mremap() calls. When the |
|
non-cooperative process moves a virtual memory area to a |
|
different location, the manager will receive |
|
``UFFD_EVENT_REMAP``. The ``uffd_msg.remap`` will contain the old and |
|
new addresses of the area and its original length. |
|
|
|
``UFFD_FEATURE_EVENT_REMOVE`` |
|
enable notifications about madvise(MADV_REMOVE) and |
|
madvise(MADV_DONTNEED) calls. The event ``UFFD_EVENT_REMOVE`` will |
|
be generated upon these calls to madvise(). The ``uffd_msg.remove`` |
|
will contain start and end addresses of the removed area. |
|
|
|
``UFFD_FEATURE_EVENT_UNMAP`` |
|
enable notifications about memory unmapping. The manager will |
|
get ``UFFD_EVENT_UNMAP`` with ``uffd_msg.remove`` containing start and |
|
end addresses of the unmapped area. |
|
|
|
Although the ``UFFD_FEATURE_EVENT_REMOVE`` and ``UFFD_FEATURE_EVENT_UNMAP`` |
|
are pretty similar, they quite differ in the action expected from the |
|
``userfaultfd`` manager. In the former case, the virtual memory is |
|
removed, but the area is not, the area remains monitored by the |
|
``userfaultfd``, and if a page fault occurs in that area it will be |
|
delivered to the manager. The proper resolution for such page fault is |
|
to zeromap the faulting address. However, in the latter case, when an |
|
area is unmapped, either explicitly (with munmap() system call), or |
|
implicitly (e.g. during mremap()), the area is removed and in turn the |
|
``userfaultfd`` context for such area disappears too and the manager will |
|
not get further userland page faults from the removed area. Still, the |
|
notification is required in order to prevent manager from using |
|
``UFFDIO_COPY`` on the unmapped area. |
|
|
|
Unlike userland page faults which have to be synchronous and require |
|
explicit or implicit wakeup, all the events are delivered |
|
asynchronously and the non-cooperative process resumes execution as |
|
soon as manager executes read(). The ``userfaultfd`` manager should |
|
carefully synchronize calls to ``UFFDIO_COPY`` with the events |
|
processing. To aid the synchronization, the ``UFFDIO_COPY`` ioctl will |
|
return ``-ENOSPC`` when the monitored process exits at the time of |
|
``UFFDIO_COPY``, and ``-ENOENT``, when the non-cooperative process has changed |
|
its virtual memory layout simultaneously with outstanding ``UFFDIO_COPY`` |
|
operation. |
|
|
|
The current asynchronous model of the event delivery is optimal for |
|
single threaded non-cooperative ``userfaultfd`` manager implementations. A |
|
synchronous event delivery model can be added later as a new |
|
``userfaultfd`` feature to facilitate multithreading enhancements of the |
|
non cooperative manager, for example to allow ``UFFDIO_COPY`` ioctls to |
|
run in parallel to the event reception. Single threaded |
|
implementations should continue to use the current async event |
|
delivery model instead.
|
|
|