mirror of https://github.com/Qortal/Brooklyn
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
302 lines
13 KiB
302 lines
13 KiB
.. SPDX-License-Identifier: GPL-2.0 |
|
|
|
=============================== |
|
Software Guard eXtensions (SGX) |
|
=============================== |
|
|
|
Overview |
|
======== |
|
|
|
Software Guard eXtensions (SGX) hardware enables for user space applications |
|
to set aside private memory regions of code and data: |
|
|
|
* Privileged (ring-0) ENCLS functions orchestrate the construction of the |
|
regions. |
|
* Unprivileged (ring-3) ENCLU functions allow an application to enter and |
|
execute inside the regions. |
|
|
|
These memory regions are called enclaves. An enclave can be only entered at a |
|
fixed set of entry points. Each entry point can hold a single hardware thread |
|
at a time. While the enclave is loaded from a regular binary file by using |
|
ENCLS functions, only the threads inside the enclave can access its memory. The |
|
region is denied from outside access by the CPU, and encrypted before it leaves |
|
from LLC. |
|
|
|
The support can be determined by |
|
|
|
``grep sgx /proc/cpuinfo`` |
|
|
|
SGX must both be supported in the processor and enabled by the BIOS. If SGX |
|
appears to be unsupported on a system which has hardware support, ensure |
|
support is enabled in the BIOS. If a BIOS presents a choice between "Enabled" |
|
and "Software Enabled" modes for SGX, choose "Enabled". |
|
|
|
Enclave Page Cache |
|
================== |
|
|
|
SGX utilizes an *Enclave Page Cache (EPC)* to store pages that are associated |
|
with an enclave. It is contained in a BIOS-reserved region of physical memory. |
|
Unlike pages used for regular memory, pages can only be accessed from outside of |
|
the enclave during enclave construction with special, limited SGX instructions. |
|
|
|
Only a CPU executing inside an enclave can directly access enclave memory. |
|
However, a CPU executing inside an enclave may access normal memory outside the |
|
enclave. |
|
|
|
The kernel manages enclave memory similar to how it treats device memory. |
|
|
|
Enclave Page Types |
|
------------------ |
|
|
|
**SGX Enclave Control Structure (SECS)** |
|
Enclave's address range, attributes and other global data are defined |
|
by this structure. |
|
|
|
**Regular (REG)** |
|
Regular EPC pages contain the code and data of an enclave. |
|
|
|
**Thread Control Structure (TCS)** |
|
Thread Control Structure pages define the entry points to an enclave and |
|
track the execution state of an enclave thread. |
|
|
|
**Version Array (VA)** |
|
Version Array pages contain 512 slots, each of which can contain a version |
|
number for a page evicted from the EPC. |
|
|
|
Enclave Page Cache Map |
|
---------------------- |
|
|
|
The processor tracks EPC pages in a hardware metadata structure called the |
|
*Enclave Page Cache Map (EPCM)*. The EPCM contains an entry for each EPC page |
|
which describes the owning enclave, access rights and page type among the other |
|
things. |
|
|
|
EPCM permissions are separate from the normal page tables. This prevents the |
|
kernel from, for instance, allowing writes to data which an enclave wishes to |
|
remain read-only. EPCM permissions may only impose additional restrictions on |
|
top of normal x86 page permissions. |
|
|
|
For all intents and purposes, the SGX architecture allows the processor to |
|
invalidate all EPCM entries at will. This requires that software be prepared to |
|
handle an EPCM fault at any time. In practice, this can happen on events like |
|
power transitions when the ephemeral key that encrypts enclave memory is lost. |
|
|
|
Application interface |
|
===================== |
|
|
|
Enclave build functions |
|
----------------------- |
|
|
|
In addition to the traditional compiler and linker build process, SGX has a |
|
separate enclave “build” process. Enclaves must be built before they can be |
|
executed (entered). The first step in building an enclave is opening the |
|
**/dev/sgx_enclave** device. Since enclave memory is protected from direct |
|
access, special privileged instructions are then used to copy data into enclave |
|
pages and establish enclave page permissions. |
|
|
|
.. kernel-doc:: arch/x86/kernel/cpu/sgx/ioctl.c |
|
:functions: sgx_ioc_enclave_create |
|
sgx_ioc_enclave_add_pages |
|
sgx_ioc_enclave_init |
|
sgx_ioc_enclave_provision |
|
|
|
Enclave runtime management |
|
-------------------------- |
|
|
|
Systems supporting SGX2 additionally support changes to initialized |
|
enclaves: modifying enclave page permissions and type, and dynamically |
|
adding and removing of enclave pages. When an enclave accesses an address |
|
within its address range that does not have a backing page then a new |
|
regular page will be dynamically added to the enclave. The enclave is |
|
still required to run EACCEPT on the new page before it can be used. |
|
|
|
.. kernel-doc:: arch/x86/kernel/cpu/sgx/ioctl.c |
|
:functions: sgx_ioc_enclave_restrict_permissions |
|
sgx_ioc_enclave_modify_types |
|
sgx_ioc_enclave_remove_pages |
|
|
|
Enclave vDSO |
|
------------ |
|
|
|
Entering an enclave can only be done through SGX-specific EENTER and ERESUME |
|
functions, and is a non-trivial process. Because of the complexity of |
|
transitioning to and from an enclave, enclaves typically utilize a library to |
|
handle the actual transitions. This is roughly analogous to how glibc |
|
implementations are used by most applications to wrap system calls. |
|
|
|
Another crucial characteristic of enclaves is that they can generate exceptions |
|
as part of their normal operation that need to be handled in the enclave or are |
|
unique to SGX. |
|
|
|
Instead of the traditional signal mechanism to handle these exceptions, SGX |
|
can leverage special exception fixup provided by the vDSO. The kernel-provided |
|
vDSO function wraps low-level transitions to/from the enclave like EENTER and |
|
ERESUME. The vDSO function intercepts exceptions that would otherwise generate |
|
a signal and return the fault information directly to its caller. This avoids |
|
the need to juggle signal handlers. |
|
|
|
.. kernel-doc:: arch/x86/include/uapi/asm/sgx.h |
|
:functions: vdso_sgx_enter_enclave_t |
|
|
|
ksgxd |
|
===== |
|
|
|
SGX support includes a kernel thread called *ksgxd*. |
|
|
|
EPC sanitization |
|
---------------- |
|
|
|
ksgxd is started when SGX initializes. Enclave memory is typically ready |
|
for use when the processor powers on or resets. However, if SGX has been in |
|
use since the reset, enclave pages may be in an inconsistent state. This might |
|
occur after a crash and kexec() cycle, for instance. At boot, ksgxd |
|
reinitializes all enclave pages so that they can be allocated and re-used. |
|
|
|
The sanitization is done by going through EPC address space and applying the |
|
EREMOVE function to each physical page. Some enclave pages like SECS pages have |
|
hardware dependencies on other pages which prevents EREMOVE from functioning. |
|
Executing two EREMOVE passes removes the dependencies. |
|
|
|
Page reclaimer |
|
-------------- |
|
|
|
Similar to the core kswapd, ksgxd, is responsible for managing the |
|
overcommitment of enclave memory. If the system runs out of enclave memory, |
|
*ksgxd* “swaps” enclave memory to normal memory. |
|
|
|
Launch Control |
|
============== |
|
|
|
SGX provides a launch control mechanism. After all enclave pages have been |
|
copied, kernel executes EINIT function, which initializes the enclave. Only after |
|
this the CPU can execute inside the enclave. |
|
|
|
EINIT function takes an RSA-3072 signature of the enclave measurement. The function |
|
checks that the measurement is correct and signature is signed with the key |
|
hashed to the four **IA32_SGXLEPUBKEYHASH{0, 1, 2, 3}** MSRs representing the |
|
SHA256 of a public key. |
|
|
|
Those MSRs can be configured by the BIOS to be either readable or writable. |
|
Linux supports only writable configuration in order to give full control to the |
|
kernel on launch control policy. Before calling EINIT function, the driver sets |
|
the MSRs to match the enclave's signing key. |
|
|
|
Encryption engines |
|
================== |
|
|
|
In order to conceal the enclave data while it is out of the CPU package, the |
|
memory controller has an encryption engine to transparently encrypt and decrypt |
|
enclave memory. |
|
|
|
In CPUs prior to Ice Lake, the Memory Encryption Engine (MEE) is used to |
|
encrypt pages leaving the CPU caches. MEE uses a n-ary Merkle tree with root in |
|
SRAM to maintain integrity of the encrypted data. This provides integrity and |
|
anti-replay protection but does not scale to large memory sizes because the time |
|
required to update the Merkle tree grows logarithmically in relation to the |
|
memory size. |
|
|
|
CPUs starting from Icelake use Total Memory Encryption (TME) in the place of |
|
MEE. TME-based SGX implementations do not have an integrity Merkle tree, which |
|
means integrity and replay-attacks are not mitigated. B, it includes |
|
additional changes to prevent cipher text from being returned and SW memory |
|
aliases from being created. |
|
|
|
DMA to enclave memory is blocked by range registers on both MEE and TME systems |
|
(SDM section 41.10). |
|
|
|
Usage Models |
|
============ |
|
|
|
Shared Library |
|
-------------- |
|
|
|
Sensitive data and the code that acts on it is partitioned from the application |
|
into a separate library. The library is then linked as a DSO which can be loaded |
|
into an enclave. The application can then make individual function calls into |
|
the enclave through special SGX instructions. A run-time within the enclave is |
|
configured to marshal function parameters into and out of the enclave and to |
|
call the correct library function. |
|
|
|
Application Container |
|
--------------------- |
|
|
|
An application may be loaded into a container enclave which is specially |
|
configured with a library OS and run-time which permits the application to run. |
|
The enclave run-time and library OS work together to execute the application |
|
when a thread enters the enclave. |
|
|
|
Impact of Potential Kernel SGX Bugs |
|
=================================== |
|
|
|
EPC leaks |
|
--------- |
|
|
|
When EPC page leaks happen, a WARNING like this is shown in dmesg: |
|
|
|
"EREMOVE returned ... and an EPC page was leaked. SGX may become unusable..." |
|
|
|
This is effectively a kernel use-after-free of an EPC page, and due |
|
to the way SGX works, the bug is detected at freeing. Rather than |
|
adding the page back to the pool of available EPC pages, the kernel |
|
intentionally leaks the page to avoid additional errors in the future. |
|
|
|
When this happens, the kernel will likely soon leak more EPC pages, and |
|
SGX will likely become unusable because the memory available to SGX is |
|
limited. However, while this may be fatal to SGX, the rest of the kernel |
|
is unlikely to be impacted and should continue to work. |
|
|
|
As a result, when this happpens, user should stop running any new |
|
SGX workloads, (or just any new workloads), and migrate all valuable |
|
workloads. Although a machine reboot can recover all EPC memory, the bug |
|
should be reported to Linux developers. |
|
|
|
|
|
Virtual EPC |
|
=========== |
|
|
|
The implementation has also a virtual EPC driver to support SGX enclaves |
|
in guests. Unlike the SGX driver, an EPC page allocated by the virtual |
|
EPC driver doesn't have a specific enclave associated with it. This is |
|
because KVM doesn't track how a guest uses EPC pages. |
|
|
|
As a result, the SGX core page reclaimer doesn't support reclaiming EPC |
|
pages allocated to KVM guests through the virtual EPC driver. If the |
|
user wants to deploy SGX applications both on the host and in guests |
|
on the same machine, the user should reserve enough EPC (by taking out |
|
total virtual EPC size of all SGX VMs from the physical EPC size) for |
|
host SGX applications so they can run with acceptable performance. |
|
|
|
Architectural behavior is to restore all EPC pages to an uninitialized |
|
state also after a guest reboot. Because this state can be reached only |
|
through the privileged ``ENCLS[EREMOVE]`` instruction, ``/dev/sgx_vepc`` |
|
provides the ``SGX_IOC_VEPC_REMOVE_ALL`` ioctl to execute the instruction |
|
on all pages in the virtual EPC. |
|
|
|
``EREMOVE`` can fail for three reasons. Userspace must pay attention |
|
to expected failures and handle them as follows: |
|
|
|
1. Page removal will always fail when any thread is running in the |
|
enclave to which the page belongs. In this case the ioctl will |
|
return ``EBUSY`` independent of whether it has successfully removed |
|
some pages; userspace can avoid these failures by preventing execution |
|
of any vcpu which maps the virtual EPC. |
|
|
|
2. Page removal will cause a general protection fault if two calls to |
|
``EREMOVE`` happen concurrently for pages that refer to the same |
|
"SECS" metadata pages. This can happen if there are concurrent |
|
invocations to ``SGX_IOC_VEPC_REMOVE_ALL``, or if a ``/dev/sgx_vepc`` |
|
file descriptor in the guest is closed at the same time as |
|
``SGX_IOC_VEPC_REMOVE_ALL``; it will also be reported as ``EBUSY``. |
|
This can be avoided in userspace by serializing calls to the ioctl() |
|
and to close(), but in general it should not be a problem. |
|
|
|
3. Finally, page removal will fail for SECS metadata pages which still |
|
have child pages. Child pages can be removed by executing |
|
``SGX_IOC_VEPC_REMOVE_ALL`` on all ``/dev/sgx_vepc`` file descriptors |
|
mapped into the guest. This means that the ioctl() must be called |
|
twice: an initial set of calls to remove child pages and a subsequent |
|
set of calls to remove SECS pages. The second set of calls is only |
|
required for those mappings that returned a nonzero value from the |
|
first call. It indicates a bug in the kernel or the userspace client |
|
if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has |
|
a return code other than 0.
|
|
|