mirror of https://github.com/Qortal/Brooklyn
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
278 lines
9.5 KiB
278 lines
9.5 KiB
.. SPDX-License-Identifier: GPL-2.0 |
|
|
|
============================== |
|
Running nested guests with KVM |
|
============================== |
|
|
|
A nested guest is the ability to run a guest inside another guest (it |
|
can be KVM-based or a different hypervisor). The straightforward |
|
example is a KVM guest that in turn runs on a KVM guest (the rest of |
|
this document is built on this example):: |
|
|
|
.----------------. .----------------. |
|
| | | | |
|
| L2 | | L2 | |
|
| (Nested Guest) | | (Nested Guest) | |
|
| | | | |
|
|----------------'--'----------------| |
|
| | |
|
| L1 (Guest Hypervisor) | |
|
| KVM (/dev/kvm) | |
|
| | |
|
.------------------------------------------------------. |
|
| L0 (Host Hypervisor) | |
|
| KVM (/dev/kvm) | |
|
|------------------------------------------------------| |
|
| Hardware (with virtualization extensions) | |
|
'------------------------------------------------------' |
|
|
|
Terminology: |
|
|
|
- L0 – level-0; the bare metal host, running KVM |
|
|
|
- L1 – level-1 guest; a VM running on L0; also called the "guest |
|
hypervisor", as it itself is capable of running KVM. |
|
|
|
- L2 – level-2 guest; a VM running on L1, this is the "nested guest" |
|
|
|
.. note:: The above diagram is modelled after the x86 architecture; |
|
s390x, ppc64 and other architectures are likely to have |
|
a different design for nesting. |
|
|
|
For example, s390x always has an LPAR (LogicalPARtition) |
|
hypervisor running on bare metal, adding another layer and |
|
resulting in at least four levels in a nested setup — L0 (bare |
|
metal, running the LPAR hypervisor), L1 (host hypervisor), L2 |
|
(guest hypervisor), L3 (nested guest). |
|
|
|
This document will stick with the three-level terminology (L0, |
|
L1, and L2) for all architectures; and will largely focus on |
|
x86. |
|
|
|
|
|
Use Cases |
|
--------- |
|
|
|
There are several scenarios where nested KVM can be useful, to name a |
|
few: |
|
|
|
- As a developer, you want to test your software on different operating |
|
systems (OSes). Instead of renting multiple VMs from a Cloud |
|
Provider, using nested KVM lets you rent a large enough "guest |
|
hypervisor" (level-1 guest). This in turn allows you to create |
|
multiple nested guests (level-2 guests), running different OSes, on |
|
which you can develop and test your software. |
|
|
|
- Live migration of "guest hypervisors" and their nested guests, for |
|
load balancing, disaster recovery, etc. |
|
|
|
- VM image creation tools (e.g. ``virt-install``, etc) often run |
|
their own VM, and users expect these to work inside a VM. |
|
|
|
- Some OSes use virtualization internally for security (e.g. to let |
|
applications run safely in isolation). |
|
|
|
|
|
Enabling "nested" (x86) |
|
----------------------- |
|
|
|
From Linux kernel v4.20 onwards, the ``nested`` KVM parameter is enabled |
|
by default for Intel and AMD. (Though your Linux distribution might |
|
override this default.) |
|
|
|
In case you are running a Linux kernel older than v4.19, to enable |
|
nesting, set the ``nested`` KVM module parameter to ``Y`` or ``1``. To |
|
persist this setting across reboots, you can add it in a config file, as |
|
shown below: |
|
|
|
1. On the bare metal host (L0), list the kernel modules and ensure that |
|
the KVM modules:: |
|
|
|
$ lsmod | grep -i kvm |
|
kvm_intel 133627 0 |
|
kvm 435079 1 kvm_intel |
|
|
|
2. Show information for ``kvm_intel`` module:: |
|
|
|
$ modinfo kvm_intel | grep -i nested |
|
parm: nested:bool |
|
|
|
3. For the nested KVM configuration to persist across reboots, place the |
|
below in ``/etc/modprobed/kvm_intel.conf`` (create the file if it |
|
doesn't exist):: |
|
|
|
$ cat /etc/modprobe.d/kvm_intel.conf |
|
options kvm-intel nested=y |
|
|
|
4. Unload and re-load the KVM Intel module:: |
|
|
|
$ sudo rmmod kvm-intel |
|
$ sudo modprobe kvm-intel |
|
|
|
5. Verify if the ``nested`` parameter for KVM is enabled:: |
|
|
|
$ cat /sys/module/kvm_intel/parameters/nested |
|
Y |
|
|
|
For AMD hosts, the process is the same as above, except that the module |
|
name is ``kvm-amd``. |
|
|
|
|
|
Additional nested-related kernel parameters (x86) |
|
------------------------------------------------- |
|
|
|
If your hardware is sufficiently advanced (Intel Haswell processor or |
|
higher, which has newer hardware virt extensions), the following |
|
additional features will also be enabled by default: "Shadow VMCS |
|
(Virtual Machine Control Structure)", APIC Virtualization on your bare |
|
metal host (L0). Parameters for Intel hosts:: |
|
|
|
$ cat /sys/module/kvm_intel/parameters/enable_shadow_vmcs |
|
Y |
|
|
|
$ cat /sys/module/kvm_intel/parameters/enable_apicv |
|
Y |
|
|
|
$ cat /sys/module/kvm_intel/parameters/ept |
|
Y |
|
|
|
.. note:: If you suspect your L2 (i.e. nested guest) is running slower, |
|
ensure the above are enabled (particularly |
|
``enable_shadow_vmcs`` and ``ept``). |
|
|
|
|
|
Starting a nested guest (x86) |
|
----------------------------- |
|
|
|
Once your bare metal host (L0) is configured for nesting, you should be |
|
able to start an L1 guest with:: |
|
|
|
$ qemu-kvm -cpu host [...] |
|
|
|
The above will pass through the host CPU's capabilities as-is to the |
|
gues); or for better live migration compatibility, use a named CPU |
|
model supported by QEMU. e.g.:: |
|
|
|
$ qemu-kvm -cpu Haswell-noTSX-IBRS,vmx=on |
|
|
|
then the guest hypervisor will subsequently be capable of running a |
|
nested guest with accelerated KVM. |
|
|
|
|
|
Enabling "nested" (s390x) |
|
------------------------- |
|
|
|
1. On the host hypervisor (L0), enable the ``nested`` parameter on |
|
s390x:: |
|
|
|
$ rmmod kvm |
|
$ modprobe kvm nested=1 |
|
|
|
.. note:: On s390x, the kernel parameter ``hpage`` is mutually exclusive |
|
with the ``nested`` paramter — i.e. to be able to enable |
|
``nested``, the ``hpage`` parameter *must* be disabled. |
|
|
|
2. The guest hypervisor (L1) must be provided with the ``sie`` CPU |
|
feature — with QEMU, this can be done by using "host passthrough" |
|
(via the command-line ``-cpu host``). |
|
|
|
3. Now the KVM module can be loaded in the L1 (guest hypervisor):: |
|
|
|
$ modprobe kvm |
|
|
|
|
|
Live migration with nested KVM |
|
------------------------------ |
|
|
|
Migrating an L1 guest, with a *live* nested guest in it, to another |
|
bare metal host, works as of Linux kernel 5.3 and QEMU 4.2.0 for |
|
Intel x86 systems, and even on older versions for s390x. |
|
|
|
On AMD systems, once an L1 guest has started an L2 guest, the L1 guest |
|
should no longer be migrated or saved (refer to QEMU documentation on |
|
"savevm"/"loadvm") until the L2 guest shuts down. Attempting to migrate |
|
or save-and-load an L1 guest while an L2 guest is running will result in |
|
undefined behavior. You might see a ``kernel BUG!`` entry in ``dmesg``, a |
|
kernel 'oops', or an outright kernel panic. Such a migrated or loaded L1 |
|
guest can no longer be considered stable or secure, and must be restarted. |
|
Migrating an L1 guest merely configured to support nesting, while not |
|
actually running L2 guests, is expected to function normally even on AMD |
|
systems but may fail once guests are started. |
|
|
|
Migrating an L2 guest is always expected to succeed, so all the following |
|
scenarios should work even on AMD systems: |
|
|
|
- Migrating a nested guest (L2) to another L1 guest on the *same* bare |
|
metal host. |
|
|
|
- Migrating a nested guest (L2) to another L1 guest on a *different* |
|
bare metal host. |
|
|
|
- Migrating a nested guest (L2) to a bare metal host. |
|
|
|
Reporting bugs from nested setups |
|
----------------------------------- |
|
|
|
Debugging "nested" problems can involve sifting through log files across |
|
L0, L1 and L2; this can result in tedious back-n-forth between the bug |
|
reporter and the bug fixer. |
|
|
|
- Mention that you are in a "nested" setup. If you are running any kind |
|
of "nesting" at all, say so. Unfortunately, this needs to be called |
|
out because when reporting bugs, people tend to forget to even |
|
*mention* that they're using nested virtualization. |
|
|
|
- Ensure you are actually running KVM on KVM. Sometimes people do not |
|
have KVM enabled for their guest hypervisor (L1), which results in |
|
them running with pure emulation or what QEMU calls it as "TCG", but |
|
they think they're running nested KVM. Thus confusing "nested Virt" |
|
(which could also mean, QEMU on KVM) with "nested KVM" (KVM on KVM). |
|
|
|
Information to collect (generic) |
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
|
|
The following is not an exhaustive list, but a very good starting point: |
|
|
|
- Kernel, libvirt, and QEMU version from L0 |
|
|
|
- Kernel, libvirt and QEMU version from L1 |
|
|
|
- QEMU command-line of L1 -- when using libvirt, you'll find it here: |
|
``/var/log/libvirt/qemu/instance.log`` |
|
|
|
- QEMU command-line of L2 -- as above, when using libvirt, get the |
|
complete libvirt-generated QEMU command-line |
|
|
|
- ``cat /sys/cpuinfo`` from L0 |
|
|
|
- ``cat /sys/cpuinfo`` from L1 |
|
|
|
- ``lscpu`` from L0 |
|
|
|
- ``lscpu`` from L1 |
|
|
|
- Full ``dmesg`` output from L0 |
|
|
|
- Full ``dmesg`` output from L1 |
|
|
|
x86-specific info to collect |
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
|
|
Both the below commands, ``x86info`` and ``dmidecode``, should be |
|
available on most Linux distributions with the same name: |
|
|
|
- Output of: ``x86info -a`` from L0 |
|
|
|
- Output of: ``x86info -a`` from L1 |
|
|
|
- Output of: ``dmidecode`` from L0 |
|
|
|
- Output of: ``dmidecode`` from L1 |
|
|
|
s390x-specific info to collect |
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
|
|
Along with the earlier mentioned generic details, the below is |
|
also recommended: |
|
|
|
- ``/proc/sysinfo`` from L1; this will also include the info from L0
|
|
|