forked from Qortal/Brooklyn
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
256 lines
8.6 KiB
256 lines
8.6 KiB
|
|
============ |
|
MSG_ZEROCOPY |
|
============ |
|
|
|
Intro |
|
===== |
|
|
|
The MSG_ZEROCOPY flag enables copy avoidance for socket send calls. |
|
The feature is currently implemented for TCP and UDP sockets. |
|
|
|
|
|
Opportunity and Caveats |
|
----------------------- |
|
|
|
Copying large buffers between user process and kernel can be |
|
expensive. Linux supports various interfaces that eschew copying, |
|
such as sendpage and splice. The MSG_ZEROCOPY flag extends the |
|
underlying copy avoidance mechanism to common socket send calls. |
|
|
|
Copy avoidance is not a free lunch. As implemented, with page pinning, |
|
it replaces per byte copy cost with page accounting and completion |
|
notification overhead. As a result, MSG_ZEROCOPY is generally only |
|
effective at writes over around 10 KB. |
|
|
|
Page pinning also changes system call semantics. It temporarily shares |
|
the buffer between process and network stack. Unlike with copying, the |
|
process cannot immediately overwrite the buffer after system call |
|
return without possibly modifying the data in flight. Kernel integrity |
|
is not affected, but a buggy program can possibly corrupt its own data |
|
stream. |
|
|
|
The kernel returns a notification when it is safe to modify data. |
|
Converting an existing application to MSG_ZEROCOPY is not always as |
|
trivial as just passing the flag, then. |
|
|
|
|
|
More Info |
|
--------- |
|
|
|
Much of this document was derived from a longer paper presented at |
|
netdev 2.1. For more in-depth information see that paper and talk, |
|
the excellent reporting over at LWN.net or read the original code. |
|
|
|
paper, slides, video |
|
https://netdevconf.org/2.1/session.html?debruijn |
|
|
|
LWN article |
|
https://lwn.net/Articles/726917/ |
|
|
|
patchset |
|
[PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY |
|
https://lkml.kernel.org/netdev/[email protected] |
|
|
|
|
|
Interface |
|
========= |
|
|
|
Passing the MSG_ZEROCOPY flag is the most obvious step to enable copy |
|
avoidance, but not the only one. |
|
|
|
Socket Setup |
|
------------ |
|
|
|
The kernel is permissive when applications pass undefined flags to the |
|
send system call. By default it simply ignores these. To avoid enabling |
|
copy avoidance mode for legacy processes that accidentally already pass |
|
this flag, a process must first signal intent by setting a socket option: |
|
|
|
:: |
|
|
|
if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one))) |
|
error(1, errno, "setsockopt zerocopy"); |
|
|
|
Transmission |
|
------------ |
|
|
|
The change to send (or sendto, sendmsg, sendmmsg) itself is trivial. |
|
Pass the new flag. |
|
|
|
:: |
|
|
|
ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY); |
|
|
|
A zerocopy failure will return -1 with errno ENOBUFS. This happens if |
|
the socket option was not set, the socket exceeds its optmem limit or |
|
the user exceeds its ulimit on locked pages. |
|
|
|
|
|
Mixing copy avoidance and copying |
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
|
|
Many workloads have a mixture of large and small buffers. Because copy |
|
avoidance is more expensive than copying for small packets, the |
|
feature is implemented as a flag. It is safe to mix calls with the flag |
|
with those without. |
|
|
|
|
|
Notifications |
|
------------- |
|
|
|
The kernel has to notify the process when it is safe to reuse a |
|
previously passed buffer. It queues completion notifications on the |
|
socket error queue, akin to the transmit timestamping interface. |
|
|
|
The notification itself is a simple scalar value. Each socket |
|
maintains an internal unsigned 32-bit counter. Each send call with |
|
MSG_ZEROCOPY that successfully sends data increments the counter. The |
|
counter is not incremented on failure or if called with length zero. |
|
The counter counts system call invocations, not bytes. It wraps after |
|
UINT_MAX calls. |
|
|
|
|
|
Notification Reception |
|
~~~~~~~~~~~~~~~~~~~~~~ |
|
|
|
The below snippet demonstrates the API. In the simplest case, each |
|
send syscall is followed by a poll and recvmsg on the error queue. |
|
|
|
Reading from the error queue is always a non-blocking operation. The |
|
poll call is there to block until an error is outstanding. It will set |
|
POLLERR in its output flags. That flag does not have to be set in the |
|
events field. Errors are signaled unconditionally. |
|
|
|
:: |
|
|
|
pfd.fd = fd; |
|
pfd.events = 0; |
|
if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0) |
|
error(1, errno, "poll"); |
|
|
|
ret = recvmsg(fd, &msg, MSG_ERRQUEUE); |
|
if (ret == -1) |
|
error(1, errno, "recvmsg"); |
|
|
|
read_notification(msg); |
|
|
|
The example is for demonstration purpose only. In practice, it is more |
|
efficient to not wait for notifications, but read without blocking |
|
every couple of send calls. |
|
|
|
Notifications can be processed out of order with other operations on |
|
the socket. A socket that has an error queued would normally block |
|
other operations until the error is read. Zerocopy notifications have |
|
a zero error code, however, to not block send and recv calls. |
|
|
|
|
|
Notification Batching |
|
~~~~~~~~~~~~~~~~~~~~~ |
|
|
|
Multiple outstanding packets can be read at once using the recvmmsg |
|
call. This is often not needed. In each message the kernel returns not |
|
a single value, but a range. It coalesces consecutive notifications |
|
while one is outstanding for reception on the error queue. |
|
|
|
When a new notification is about to be queued, it checks whether the |
|
new value extends the range of the notification at the tail of the |
|
queue. If so, it drops the new notification packet and instead increases |
|
the range upper value of the outstanding notification. |
|
|
|
For protocols that acknowledge data in-order, like TCP, each |
|
notification can be squashed into the previous one, so that no more |
|
than one notification is outstanding at any one point. |
|
|
|
Ordered delivery is the common case, but not guaranteed. Notifications |
|
may arrive out of order on retransmission and socket teardown. |
|
|
|
|
|
Notification Parsing |
|
~~~~~~~~~~~~~~~~~~~~ |
|
|
|
The below snippet demonstrates how to parse the control message: the |
|
read_notification() call in the previous snippet. A notification |
|
is encoded in the standard error format, sock_extended_err. |
|
|
|
The level and type fields in the control data are protocol family |
|
specific, IP_RECVERR or IPV6_RECVERR. |
|
|
|
Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero, |
|
as explained before, to avoid blocking read and write system calls on |
|
the socket. |
|
|
|
The 32-bit notification range is encoded as [ee_info, ee_data]. This |
|
range is inclusive. Other fields in the struct must be treated as |
|
undefined, bar for ee_code, as discussed below. |
|
|
|
:: |
|
|
|
struct sock_extended_err *serr; |
|
struct cmsghdr *cm; |
|
|
|
cm = CMSG_FIRSTHDR(msg); |
|
if (cm->cmsg_level != SOL_IP && |
|
cm->cmsg_type != IP_RECVERR) |
|
error(1, 0, "cmsg"); |
|
|
|
serr = (void *) CMSG_DATA(cm); |
|
if (serr->ee_errno != 0 || |
|
serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) |
|
error(1, 0, "serr"); |
|
|
|
printf("completed: %u..%u\n", serr->ee_info, serr->ee_data); |
|
|
|
|
|
Deferred copies |
|
~~~~~~~~~~~~~~~ |
|
|
|
Passing flag MSG_ZEROCOPY is a hint to the kernel to apply copy |
|
avoidance, and a contract that the kernel will queue a completion |
|
notification. It is not a guarantee that the copy is elided. |
|
|
|
Copy avoidance is not always feasible. Devices that do not support |
|
scatter-gather I/O cannot send packets made up of kernel generated |
|
protocol headers plus zerocopy user data. A packet may need to be |
|
converted to a private copy of data deep in the stack, say to compute |
|
a checksum. |
|
|
|
In all these cases, the kernel returns a completion notification when |
|
it releases its hold on the shared pages. That notification may arrive |
|
before the (copied) data is fully transmitted. A zerocopy completion |
|
notification is not a transmit completion notification, therefore. |
|
|
|
Deferred copies can be more expensive than a copy immediately in the |
|
system call, if the data is no longer warm in the cache. The process |
|
also incurs notification processing cost for no benefit. For this |
|
reason, the kernel signals if data was completed with a copy, by |
|
setting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return. |
|
A process may use this signal to stop passing flag MSG_ZEROCOPY on |
|
subsequent requests on the same socket. |
|
|
|
|
|
Implementation |
|
============== |
|
|
|
Loopback |
|
-------- |
|
|
|
Data sent to local sockets can be queued indefinitely if the receive |
|
process does not read its socket. Unbound notification latency is not |
|
acceptable. For this reason all packets generated with MSG_ZEROCOPY |
|
that are looped to a local socket will incur a deferred copy. This |
|
includes looping onto packet sockets (e.g., tcpdump) and tun devices. |
|
|
|
|
|
Testing |
|
======= |
|
|
|
More realistic example code can be found in the kernel source under |
|
tools/testing/selftests/net/msg_zerocopy.c. |
|
|
|
Be cognizant of the loopback constraint. The test can be run between |
|
a pair of hosts. But if run between a local pair of processes, for |
|
instance when run with msg_zerocopy.sh between a veth pair across |
|
namespaces, the test will not show any improvement. For testing, the |
|
loopback restriction can be temporarily relaxed by making |
|
skb_orphan_frags_rx identical to skb_orphan_frags.
|
|
|