mirror of https://github.com/Qortal/Brooklyn
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
1209 lines
45 KiB
1209 lines
45 KiB
.. SPDX-License-Identifier: GPL-2.0 |
|
|
|
######### |
|
UML HowTo |
|
######### |
|
|
|
.. contents:: :local: |
|
|
|
************ |
|
Introduction |
|
************ |
|
|
|
Welcome to User Mode Linux |
|
|
|
User Mode Linux is the first Open Source virtualization platform (first |
|
release date 1991) and second virtualization platform for an x86 PC. |
|
|
|
How is UML Different from a VM using Virtualization package X? |
|
============================================================== |
|
|
|
We have come to assume that virtualization also means some level of |
|
hardware emulation. In fact, it does not. As long as a virtualization |
|
package provides the OS with devices which the OS can recognize and |
|
has a driver for, the devices do not need to emulate real hardware. |
|
Most OSes today have built-in support for a number of "fake" |
|
devices used only under virtualization. |
|
User Mode Linux takes this concept to the ultimate extreme - there |
|
is not a single real device in sight. It is 100% artificial or if |
|
we use the correct term 100% paravirtual. All UML devices are abstract |
|
concepts which map onto something provided by the host - files, sockets, |
|
pipes, etc. |
|
|
|
The other major difference between UML and various virtualization |
|
packages is that there is a distinct difference between the way the UML |
|
kernel and the UML programs operate. |
|
The UML kernel is just a process running on Linux - same as any other |
|
program. It can be run by an unprivileged user and it does not require |
|
anything in terms of special CPU features. |
|
The UML userspace, however, is a bit different. The Linux kernel on the |
|
host machine assists UML in intercepting everything the program running |
|
on a UML instance is trying to do and making the UML kernel handle all |
|
of its requests. |
|
This is different from other virtualization packages which do not make any |
|
difference between the guest kernel and guest programs. This difference |
|
results in a number of advantages and disadvantages of UML over let's say |
|
QEMU which we will cover later in this document. |
|
|
|
|
|
Why Would I Want User Mode Linux? |
|
================================= |
|
|
|
|
|
* If User Mode Linux kernel crashes, your host kernel is still fine. It |
|
is not accelerated in any way (vhost, kvm, etc) and it is not trying to |
|
access any devices directly. It is, in fact, a process like any other. |
|
|
|
* You can run a usermode kernel as a non-root user (you may need to |
|
arrange appropriate permissions for some devices). |
|
|
|
* You can run a very small VM with a minimal footprint for a specific |
|
task (for example 32M or less). |
|
|
|
* You can get extremely high performance for anything which is a "kernel |
|
specific task" such as forwarding, firewalling, etc while still being |
|
isolated from the host kernel. |
|
|
|
* You can play with kernel concepts without breaking things. |
|
|
|
* You are not bound by "emulating" hardware, so you can try weird and |
|
wonderful concepts which are very difficult to support when emulating |
|
real hardware such as time travel and making your system clock |
|
dependent on what UML does (very useful for things like tests). |
|
|
|
* It's fun. |
|
|
|
Why not to run UML |
|
================== |
|
|
|
* The syscall interception technique used by UML makes it inherently |
|
slower for any userspace applications. While it can do kernel tasks |
|
on par with most other virtualization packages, its userspace is |
|
**slow**. The root cause is that UML has a very high cost of creating |
|
new processes and threads (something most Unix/Linux applications |
|
take for granted). |
|
|
|
* UML is strictly uniprocessor at present. If you want to run an |
|
application which needs many CPUs to function, it is clearly the |
|
wrong choice. |
|
|
|
*********************** |
|
Building a UML instance |
|
*********************** |
|
|
|
There is no UML installer in any distribution. While you can use off |
|
the shelf install media to install into a blank VM using a virtualization |
|
package, there is no UML equivalent. You have to use appropriate tools on |
|
your host to build a viable filesystem image. |
|
|
|
This is extremely easy on Debian - you can do it using debootstrap. It is |
|
also easy on OpenWRT - the build process can build UML images. All other |
|
distros - YMMV. |
|
|
|
Creating an image |
|
================= |
|
|
|
Create a sparse raw disk image:: |
|
|
|
# dd if=/dev/zero of=disk_image_name bs=1 count=1 seek=16G |
|
|
|
This will create a 16G disk image. The OS will initially allocate only one |
|
block and will allocate more as they are written by UML. As of kernel |
|
version 4.19 UML fully supports TRIM (as usually used by flash drives). |
|
Using TRIM inside the UML image by specifying discard as a mount option |
|
or by running ``tune2fs -o discard /dev/ubdXX`` will request UML to |
|
return any unused blocks to the OS. |
|
|
|
Create a filesystem on the disk image and mount it:: |
|
|
|
# mkfs.ext4 ./disk_image_name && mount ./disk_image_name /mnt |
|
|
|
This example uses ext4, any other filesystem such as ext3, btrfs, xfs, |
|
jfs, etc will work too. |
|
|
|
Create a minimal OS installation on the mounted filesystem:: |
|
|
|
# debootstrap buster /mnt http://deb.debian.org/debian |
|
|
|
debootstrap does not set up the root password, fstab, hostname or |
|
anything related to networking. It is up to the user to do that. |
|
|
|
Set the root password -t he easiest way to do that is to chroot into the |
|
mounted image:: |
|
|
|
# chroot /mnt |
|
# passwd |
|
# exit |
|
|
|
Edit key system files |
|
===================== |
|
|
|
UML block devices are called ubds. The fstab created by debootstrap |
|
will be empty and it needs an entry for the root file system:: |
|
|
|
/dev/ubd0 ext4 discard,errors=remount-ro 0 1 |
|
|
|
The image hostname will be set to the same as the host on which you |
|
are creating it image. It is a good idea to change that to avoid |
|
"Oh, bummer, I rebooted the wrong machine". |
|
|
|
UML supports two classes of network devices - the older uml_net ones |
|
which are scheduled for obsoletion. These are called ethX. It also |
|
supports the newer vector IO devices which are significantly faster |
|
and have support for some standard virtual network encapsulations like |
|
Ethernet over GRE and Ethernet over L2TPv3. These are called vec0. |
|
|
|
Depending on which one is in use, ``/etc/network/interfaces`` will |
|
need entries like:: |
|
|
|
# legacy UML network devices |
|
auto eth0 |
|
iface eth0 inet dhcp |
|
|
|
# vector UML network devices |
|
auto vec0 |
|
iface eth0 inet dhcp |
|
|
|
We now have a UML image which is nearly ready to run, all we need is a |
|
UML kernel and modules for it. |
|
|
|
Most distributions have a UML package. Even if you intend to use your own |
|
kernel, testing the image with a stock one is always a good start. These |
|
packages come with a set of modules which should be copied to the target |
|
filesystem. The location is distribution dependent. For Debian these |
|
reside under /usr/lib/uml/modules. Copy recursively the content of this |
|
directory to the mounted UML filesystem:: |
|
|
|
# cp -rax /usr/lib/uml/modules /mnt/lib/modules |
|
|
|
If you have compiled your own kernel, you need to use the usual "install |
|
modules to a location" procedure by running:: |
|
|
|
# make install MODULES_DIR=/mnt/lib/modules |
|
|
|
At this point the image is ready to be brought up. |
|
|
|
************************* |
|
Setting Up UML Networking |
|
************************* |
|
|
|
UML networking is designed to emulate an Ethernet connection. This |
|
connection may be either a point-to-point (similar to a connection |
|
between machines using a back-to-back cable) or a connection to a |
|
switch. UML supports a wide variety of means to build these |
|
connections to all of: local machine, remote machine(s), local and |
|
remote UML and other VM instances. |
|
|
|
|
|
+-----------+--------+------------------------------------+------------+ |
|
| Transport | Type | Capabilities | Throughput | |
|
+===========+========+====================================+============+ |
|
| tap | vector | checksum, tso | > 8Gbit | |
|
+-----------+--------+------------------------------------+------------+ |
|
| hybrid | vector | checksum, tso, multipacket rx | > 6GBit | |
|
+-----------+--------+------------------------------------+------------+ |
|
| raw | vector | checksum, tso, multipacket rx, tx" | > 6GBit | |
|
+-----------+--------+------------------------------------+------------+ |
|
| EoGRE | vector | multipacket rx, tx | > 3Gbit | |
|
+-----------+--------+------------------------------------+------------+ |
|
| Eol2tpv3 | vector | multipacket rx, tx | > 3Gbit | |
|
+-----------+--------+------------------------------------+------------+ |
|
| bess | vector | multipacket rx, tx | > 3Gbit | |
|
+-----------+--------+------------------------------------+------------+ |
|
| fd | vector | dependent on fd type | varies | |
|
+-----------+--------+------------------------------------+------------+ |
|
| tuntap | legacy | none | ~ 500Mbit | |
|
+-----------+--------+------------------------------------+------------+ |
|
| daemon | legacy | none | ~ 450Mbit | |
|
+-----------+--------+------------------------------------+------------+ |
|
| socket | legacy | none | ~ 450Mbit | |
|
+-----------+--------+------------------------------------+------------+ |
|
| pcap | legacy | rx only | ~ 450Mbit | |
|
+-----------+--------+------------------------------------+------------+ |
|
| ethertap | legacy | obsolete | ~ 500Mbit | |
|
+-----------+--------+------------------------------------+------------+ |
|
| vde | legacy | obsolete | ~ 500Mbit | |
|
+-----------+--------+------------------------------------+------------+ |
|
|
|
* All transports which have tso and checksum offloads can deliver speeds |
|
approaching 10G on TCP streams. |
|
|
|
* All transports which have multi-packet rx and/or tx can deliver pps |
|
rates of up to 1Mps or more. |
|
|
|
* All legacy transports are generally limited to ~600-700MBit and 0.05Mps |
|
|
|
* GRE and L2TPv3 allow connections to all of: local machine, remote |
|
machines, remote network devices and remote UML instances. |
|
|
|
* Socket allows connections only between UML instances. |
|
|
|
* Daemon and bess require running a local switch. This switch may be |
|
connected to the host as well. |
|
|
|
|
|
Network configuration privileges |
|
================================ |
|
|
|
The majority of the supported networking modes need ``root`` privileges. |
|
For example, in the legacy tuntap networking mode, users were required |
|
to be part of the group associated with the tunnel device. |
|
|
|
For newer network drivers like the vector transports, ``root`` privilege |
|
is required to fire an ioctl to setup the tun interface and/or use |
|
raw sockets where needed. |
|
|
|
This can be achieved by granting the user a particular capability instead |
|
of running UML as root. In case of vector transport, a user can add the |
|
capability ``CAP_NET_ADMIN`` or ``CAP_NET_RAW``, to the uml binary. |
|
Thenceforth, UML can be run with normal user privilges, along with |
|
full networking. |
|
|
|
For example:: |
|
|
|
# sudo setcap cap_net_raw,cap_net_admin+ep linux |
|
|
|
Configuring vector transports |
|
=============================== |
|
|
|
All vector transports support a similar syntax: |
|
|
|
If X is the interface number as in vec0, vec1, vec2, etc, the general |
|
syntax for options is:: |
|
|
|
vecX:transport="Transport Name",option=value,option=value,...,option=value |
|
|
|
Common options |
|
-------------- |
|
|
|
These options are common for all transports: |
|
|
|
* ``depth=int`` - sets the queue depth for vector IO. This is the |
|
amount of packets UML will attempt to read or write in a single |
|
system call. The default number is 64 and is generally sufficient |
|
for most applications that need throughput in the 2-4 Gbit range. |
|
Higher speeds may require larger values. |
|
|
|
* ``mac=XX:XX:XX:XX:XX`` - sets the interface MAC address value. |
|
|
|
* ``gro=[0,1]`` - sets GRO on or off. Enables receive/transmit offloads. |
|
The effect of this option depends on the host side support in the transport |
|
which is being configured. In most cases it will enable TCP segmentation and |
|
RX/TX checksumming offloads. The setting must be identical on the host side |
|
and the UML side. The UML kernel will produce warnings if it is not. |
|
For example, GRO is enabled by default on local machine interfaces |
|
(e.g. veth pairs, bridge, etc), so it should be enabled in UML in the |
|
corresponding UML transports (raw, tap, hybrid) in order for networking to |
|
operate correctly. |
|
|
|
* ``mtu=int`` - sets the interface MTU |
|
|
|
* ``headroom=int`` - adjusts the default headroom (32 bytes) reserved |
|
if a packet will need to be re-encapsulated into for instance VXLAN. |
|
|
|
* ``vec=0`` - disable multipacket io and fall back to packet at a |
|
time mode |
|
|
|
Shared Options |
|
-------------- |
|
|
|
* ``ifname=str`` Transports which bind to a local network interface |
|
have a shared option - the name of the interface to bind to. |
|
|
|
* ``src, dst, src_port, dst_port`` - all transports which use sockets |
|
which have the notion of source and destination and/or source port |
|
and destination port use these to specify them. |
|
|
|
* ``v6=[0,1]`` to specify if a v6 connection is desired for all |
|
transports which operate over IP. Additionally, for transports that |
|
have some differences in the way they operate over v4 and v6 (for example |
|
EoL2TPv3), sets the correct mode of operation. In the absense of this |
|
option, the socket type is determined based on what do the src and dst |
|
arguments resolve/parse to. |
|
|
|
tap transport |
|
------------- |
|
|
|
Example:: |
|
|
|
vecX:transport=tap,ifname=tap0,depth=128,gro=1 |
|
|
|
This will connect vec0 to tap0 on the host. Tap0 must already exist (for example |
|
created using tunctl) and UP. |
|
|
|
tap0 can be configured as a point-to-point interface and given an ip |
|
address so that UML can talk to the host. Alternatively, it is possible |
|
to connect UML to a tap interface which is connected to a bridge. |
|
|
|
While tap relies on the vector infrastructure, it is not a true vector |
|
transport at this point, because Linux does not support multi-packet |
|
IO on tap file descriptors for normal userspace apps like UML. This |
|
is a privilege which is offered only to something which can hook up |
|
to it at kernel level via specialized interfaces like vhost-net. A |
|
vhost-net like helper for UML is planned at some point in the future. |
|
|
|
Privileges required: tap transport requires either: |
|
|
|
* tap interface to exist and be created persistent and owned by the |
|
UML user using tunctl. Example ``tunctl -u uml-user -t tap0`` |
|
|
|
* binary to have ``CAP_NET_ADMIN`` privilege |
|
|
|
hybrid transport |
|
---------------- |
|
|
|
Example:: |
|
|
|
vecX:transport=hybrid,ifname=tap0,depth=128,gro=1 |
|
|
|
This is an experimental/demo transport which couples tap for transmit |
|
and a raw socket for receive. The raw socket allows multi-packet |
|
receive resulting in significantly higher packet rates than normal tap |
|
|
|
Privileges required: hybrid requires ``CAP_NET_RAW`` capability by |
|
the UML user as well as the requirements for the tap transport. |
|
|
|
raw socket transport |
|
-------------------- |
|
|
|
Example:: |
|
|
|
vecX:transport=raw,ifname=p-veth0,depth=128,gro=1 |
|
|
|
|
|
This transport uses vector IO on raw sockets. While you can bind to any |
|
interface including a physical one, the most common use it to bind to |
|
the "peer" side of a veth pair with the other side configured on the |
|
host. |
|
|
|
Example host configuration for Debian: |
|
|
|
**/etc/network/interfaces**:: |
|
|
|
auto veth0 |
|
iface veth0 inet static |
|
address 192.168.4.1 |
|
netmask 255.255.255.252 |
|
broadcast 192.168.4.3 |
|
pre-up ip link add veth0 type veth peer name p-veth0 && \ |
|
ifconfig p-veth0 up |
|
|
|
UML can now bind to p-veth0 like this:: |
|
|
|
vec0:transport=raw,ifname=p-veth0,depth=128,gro=1 |
|
|
|
|
|
If the UML guest is configured with 192.168.4.2 and netmask 255.255.255.0 |
|
it can talk to the host on 192.168.4.1 |
|
|
|
The raw transport also provides some support for offloading some of the |
|
filtering to the host. The two options to control it are: |
|
|
|
* ``bpffile=str`` filename of raw bpf code to be loaded as a socket filter |
|
|
|
* ``bpfflash=int`` 0/1 allow loading of bpf from inside User Mode Linux. |
|
This option allows the use of the ethtool load firmware command to |
|
load bpf code. |
|
|
|
In either case the bpf code is loaded into the host kernel. While this is |
|
presently limited to legacy bpf syntax (not ebpf), it is still a security |
|
risk. It is not recommended to allow this unless the User Mode Linux |
|
instance is considered trusted. |
|
|
|
Privileges required: raw socket transport requires `CAP_NET_RAW` |
|
capability. |
|
|
|
GRE socket transport |
|
-------------------- |
|
|
|
Example:: |
|
|
|
vecX:transport=gre,src=$src_host,dst=$dst_host |
|
|
|
|
|
This will configure an Ethernet over ``GRE`` (aka ``GRETAP`` or |
|
``GREIRB``) tunnel which will connect the UML instance to a ``GRE`` |
|
endpoint at host dst_host. ``GRE`` supports the following additional |
|
options: |
|
|
|
* ``rx_key=int`` - GRE 32 bit integer key for rx packets, if set, |
|
``txkey`` must be set too |
|
|
|
* ``tx_key=int`` - GRE 32 bit integer key for tx packets, if set |
|
``rx_key`` must be set too |
|
|
|
* ``sequence=[0,1]`` - enable GRE sequence |
|
|
|
* ``pin_sequence=[0,1]`` - pretend that the sequence is always reset |
|
on each packet (needed to interoperate with some really broken |
|
implementations) |
|
|
|
* ``v6=[0,1]`` - force IPv4 or IPv6 sockets respectively |
|
|
|
* GRE checksum is not presently supported |
|
|
|
GRE has a number of caveats: |
|
|
|
* You can use only one GRE connection per ip address. There is no way to |
|
multiplex connections as each GRE tunnel is terminated directly on |
|
the UML instance. |
|
|
|
* The key is not really a security feature. While it was intended as such |
|
it's "security" is laughable. It is, however, a useful feature to |
|
ensure that the tunnel is not misconfigured. |
|
|
|
An example configuration for a Linux host with a local address of |
|
192.168.128.1 to connect to a UML instance at 192.168.129.1 |
|
|
|
**/etc/network/interfaces**:: |
|
|
|
auto gt0 |
|
iface gt0 inet static |
|
address 10.0.0.1 |
|
netmask 255.255.255.0 |
|
broadcast 10.0.0.255 |
|
mtu 1500 |
|
pre-up ip link add gt0 type gretap local 192.168.128.1 \ |
|
remote 192.168.129.1 || true |
|
down ip link del gt0 || true |
|
|
|
Additionally, GRE has been tested versus a variety of network equipment. |
|
|
|
Privileges required: GRE requires ``CAP_NET_RAW`` |
|
|
|
l2tpv3 socket transport |
|
----------------------- |
|
|
|
_Warning_. L2TPv3 has a "bug". It is the "bug" known as "has more |
|
options than GNU ls". While it has some advantages, there are usually |
|
easier (and less verbose) ways to connect a UML instance to something. |
|
For example, most devices which support L2TPv3 also support GRE. |
|
|
|
Example:: |
|
|
|
vec0:transport=l2tpv3,udp=1,src=$src_host,dst=$dst_host,srcport=$src_port,dstport=$dst_port,depth=128,rx_session=0xffffffff,tx_session=0xffff |
|
|
|
This will configure an Ethernet over L2TPv3 fixed tunnel which will |
|
connect the UML instance to a L2TPv3 endpoint at host $dst_host using |
|
the L2TPv3 UDP flavour and UDP destination port $dst_port. |
|
|
|
L2TPv3 always requires the following additional options: |
|
|
|
* ``rx_session=int`` - l2tpv3 32 bit integer session for rx packets |
|
|
|
* ``tx_session=int`` - l2tpv3 32 bit integer session for tx packets |
|
|
|
As the tunnel is fixed these are not negotiated and they are |
|
preconfigured on both ends. |
|
|
|
Additionally, L2TPv3 supports the following optional parameters |
|
|
|
* ``rx_cookie=int`` - l2tpv3 32 bit integer cookie for rx packets - same |
|
functionality as GRE key, more to prevent misconfiguration than provide |
|
actual security |
|
|
|
* ``tx_cookie=int`` - l2tpv3 32 bit integer cookie for tx packets |
|
|
|
* ``cookie64=[0,1]`` - use 64 bit cookies instead of 32 bit. |
|
|
|
* ``counter=[0,1]`` - enable l2tpv3 counter |
|
|
|
* ``pin_counter=[0,1]`` - pretend that the counter is always reset on |
|
each packet (needed to interoperate with some really broken |
|
implementations) |
|
|
|
* ``v6=[0,1]`` - force v6 sockets |
|
|
|
* ``udp=[0,1]`` - use raw sockets (0) or UDP (1) version of the protocol |
|
|
|
L2TPv3 has a number of caveats: |
|
|
|
* you can use only one connection per ip address in raw mode. There is |
|
no way to multiplex connections as each L2TPv3 tunnel is terminated |
|
directly on the UML instance. UDP mode can use different ports for |
|
this purpose. |
|
|
|
Here is an example of how to configure a linux host to connect to UML |
|
via L2TPv3: |
|
|
|
**/etc/network/interfaces**:: |
|
|
|
auto l2tp1 |
|
iface l2tp1 inet static |
|
address 192.168.126.1 |
|
netmask 255.255.255.0 |
|
broadcast 192.168.126.255 |
|
mtu 1500 |
|
pre-up ip l2tp add tunnel remote 127.0.0.1 \ |
|
local 127.0.0.1 encap udp tunnel_id 2 \ |
|
peer_tunnel_id 2 udp_sport 1706 udp_dport 1707 && \ |
|
ip l2tp add session name l2tp1 tunnel_id 2 \ |
|
session_id 0xffffffff peer_session_id 0xffffffff |
|
down ip l2tp del session tunnel_id 2 session_id 0xffffffff && \ |
|
ip l2tp del tunnel tunnel_id 2 |
|
|
|
|
|
Privileges required: L2TPv3 requires ``CAP_NET_RAW`` for raw IP mode and |
|
no special privileges for the UDP mode. |
|
|
|
BESS socket transport |
|
--------------------- |
|
|
|
BESS is a high performance modular network switch. |
|
|
|
https://github.com/NetSys/bess |
|
|
|
It has support for a simple sequential packet socket mode which in the |
|
more recent versions is using vector IO for high performance. |
|
|
|
Example:: |
|
|
|
vecX:transport=bess,src=$unix_src,dst=$unix_dst |
|
|
|
This will configure a BESS transport using the unix_src Unix domain |
|
socket address as source and unix_dst socket address as destination. |
|
|
|
For BESS configuration and how to allocate a BESS Unix domain socket port |
|
please see the BESS documentation. |
|
|
|
https://github.com/NetSys/bess/wiki/Built-In-Modules-and-Ports |
|
|
|
BESS transport does not require any special privileges. |
|
|
|
Configuring Legacy transports |
|
============================= |
|
|
|
Legacy transports are now considered obsolete. Please use the vector |
|
versions. |
|
|
|
*********** |
|
Running UML |
|
*********** |
|
|
|
This section assumes that either the user-mode-linux package from the |
|
distribution or a custom built kernel has been installed on the host. |
|
|
|
These add an executable called linux to the system. This is the UML |
|
kernel. It can be run just like any other executable. |
|
It will take most normal linux kernel arguments as command line |
|
arguments. Additionally, it will need some UML specific arguments |
|
in order to do something useful. |
|
|
|
Arguments |
|
========= |
|
|
|
Mandatory Arguments: |
|
-------------------- |
|
|
|
* ``mem=int[K,M,G]`` - amount of memory. By default bytes. It will |
|
also accept K, M or G qualifiers. |
|
|
|
* ``ubdX[s,d,c,t]=`` virtual disk specification. This is not really |
|
mandatory, but it is likely to be needed in nearly all cases so we can |
|
specify a root file system. |
|
The simplest possible image specification is the name of the image |
|
file for the filesystem (created using one of the methods described |
|
in `Creating an image`_) |
|
|
|
* UBD devices support copy on write (COW). The changes are kept in |
|
a separate file which can be discarded allowing a rollback to the |
|
original pristine image. If COW is desired, the UBD image is |
|
specified as: ``cow_file,master_image``. |
|
Example:``ubd0=Filesystem.cow,Filesystem.img`` |
|
|
|
* UBD devices can be set to use synchronous IO. Any writes are |
|
immediately flushed to disk. This is done by adding ``s`` after |
|
the ``ubdX`` specification |
|
|
|
* UBD performs some euristics on devices specified as a single |
|
filename to make sure that a COW file has not been specified as |
|
the image. To turn them off, use the ``d`` flag after ``ubdX`` |
|
|
|
* UBD supports TRIM - asking the Host OS to reclaim any unused |
|
blocks in the image. To turn it off, specify the ``t`` flag after |
|
``ubdX`` |
|
|
|
* ``root=`` root device - most likely ``/dev/ubd0`` (this is a Linux |
|
filesystem image) |
|
|
|
Important Optional Arguments |
|
---------------------------- |
|
|
|
If UML is run as "linux" with no extra arguments, it will try to start an |
|
xterm for every console configured inside the image (up to 6 in most |
|
linux distributions). Each console is started inside an |
|
xterm. This makes it nice and easy to use UML on a host with a GUI. It is, |
|
however, the wrong approach if UML is to be used as a testing harness or run |
|
in a text-only environment. |
|
|
|
In order to change this behaviour we need to specify an alternative console |
|
and wire it to one of the supported "line" channels. For this we need to map a |
|
console to use something different from the default xterm. |
|
|
|
Example which will divert console number 1 to stdin/stdout:: |
|
|
|
con1=fd:0,fd:1 |
|
|
|
UML supports a wide variety of serial line channels which are specified using |
|
the following syntax |
|
|
|
conX=channel_type:options[,channel_type:options] |
|
|
|
|
|
If the channel specification contains two parts separated by comma, the first |
|
one is input, the second one output. |
|
|
|
* The null channel - Discard all input or output. Example ``con=null`` will set |
|
all consoles to null by default. |
|
|
|
* The fd channel - use file descriptor numbers for input/out. Example: |
|
``con1=fd:0,fd:1.`` |
|
|
|
* The port channel - listen on tcp port number. Example: ``con1=port:4321`` |
|
|
|
* The pty and pts channels - use system pty/pts. |
|
|
|
* The tty channel - bind to an existing system tty. Example: ``con1=/dev/tty8`` |
|
will make UML use the host 8th console (usually unused). |
|
|
|
* The xterm channel - this is the default - bring up an xterm on this channel |
|
and direct IO to it. Note, that in order for xterm to work, the host must |
|
have the UML distribution package installed. This usually contains the |
|
port-helper and other utilities needed for UML to communicate with the xterm. |
|
Alternatively, these need to be complied and installed from source. All |
|
options applicable to consoles also apply to UML serial lines which are |
|
presented as ttyS inside UML. |
|
|
|
Starting UML |
|
============ |
|
|
|
We can now run UML. |
|
:: |
|
|
|
# linux mem=2048M umid=TEST \ |
|
ubd0=Filesystem.img \ |
|
vec0:transport=tap,ifname=tap0,depth=128,gro=1 \ |
|
root=/dev/ubda con=null con0=null,fd:2 con1=fd:0,fd:1 |
|
|
|
This will run an instance with ``2048M RAM``, try to use the image file |
|
called ``Filesystem.img`` as root. It will connect to the host using tap0. |
|
All consoles except ``con1`` will be disabled and console 1 will |
|
use standard input/output making it appear in the same terminal it was started. |
|
|
|
Logging in |
|
============ |
|
|
|
If you have not set up a password when generating the image, you will have to |
|
shut down the UML instance, mount the image, chroot into it and set it - as |
|
described in the Generating an Image section. If the password is already set, |
|
you can just log in. |
|
|
|
The UML Management Console |
|
============================ |
|
|
|
In addition to managing the image from "the inside" using normal sysadmin tools, |
|
it is possible to perform a number of low level operations using the UML |
|
management console. The UML management console is a low-level interface to the |
|
kernel on a running UML instance, somewhat like the i386 SysRq interface. Since |
|
there is a full-blown operating system under UML, there is much greater |
|
flexibility possible than with the SysRq mechanism. |
|
|
|
There are a number of things you can do with the mconsole interface: |
|
|
|
* get the kernel version |
|
* add and remove devices |
|
* halt or reboot the machine |
|
* Send SysRq commands |
|
* Pause and resume the UML |
|
* Inspect processes running inside UML |
|
* Inspect UML internal /proc state |
|
|
|
You need the mconsole client (uml\_mconsole) which is a part of the UML |
|
tools package available in most Linux distritions. |
|
|
|
You also need ``CONFIG_MCONSOLE`` (under 'General Setup') enabled in the UML |
|
kernel. When you boot UML, you'll see a line like:: |
|
|
|
mconsole initialized on /home/jdike/.uml/umlNJ32yL/mconsole |
|
|
|
If you specify a unique machine id one the UML command line, i.e. |
|
``umid=debian``, you'll see this:: |
|
|
|
mconsole initialized on /home/jdike/.uml/debian/mconsole |
|
|
|
|
|
That file is the socket that uml_mconsole will use to communicate with |
|
UML. Run it with either the umid or the full path as its argument:: |
|
|
|
# uml_mconsole debian |
|
|
|
or |
|
|
|
# uml_mconsole /home/jdike/.uml/debian/mconsole |
|
|
|
|
|
You'll get a prompt, at which you can run one of these commands: |
|
|
|
* version |
|
* help |
|
* halt |
|
* reboot |
|
* config |
|
* remove |
|
* sysrq |
|
* help |
|
* cad |
|
* stop |
|
* go |
|
* proc |
|
* stack |
|
|
|
version |
|
------- |
|
|
|
This command takes no arguments. It prints the UML version:: |
|
|
|
(mconsole) version |
|
OK Linux OpenWrt 4.14.106 #0 Tue Mar 19 08:19:41 2019 x86_64 |
|
|
|
|
|
There are a couple actual uses for this. It's a simple no-op which |
|
can be used to check that a UML is running. It's also a way of |
|
sending a device interrupt to the UML. UML mconsole is treated internally as |
|
a UML device. |
|
|
|
help |
|
---- |
|
|
|
This command takes no arguments. It prints a short help screen with the |
|
supported mconsole commands. |
|
|
|
|
|
halt and reboot |
|
--------------- |
|
|
|
These commands take no arguments. They shut the machine down immediately, with |
|
no syncing of disks and no clean shutdown of userspace. So, they are |
|
pretty close to crashing the machine:: |
|
|
|
(mconsole) halt |
|
OK |
|
|
|
config |
|
------ |
|
|
|
"config" adds a new device to the virtual machine. This is supported |
|
by most UML device drivers. It takes one argument, which is the |
|
device to add, with the same syntax as the kernel command line:: |
|
|
|
(mconsole) config ubd3=/home/jdike/incoming/roots/root_fs_debian22 |
|
|
|
remove |
|
------ |
|
|
|
"remove" deletes a device from the system. Its argument is just the |
|
name of the device to be removed. The device must be idle in whatever |
|
sense the driver considers necessary. In the case of the ubd driver, |
|
the removed block device must not be mounted, swapped on, or otherwise |
|
open, and in the case of the network driver, the device must be down:: |
|
|
|
(mconsole) remove ubd3 |
|
|
|
sysrq |
|
----- |
|
|
|
This command takes one argument, which is a single letter. It calls the |
|
generic kernel's SysRq driver, which does whatever is called for by |
|
that argument. See the SysRq documentation in |
|
Documentation/admin-guide/sysrq.rst in your favorite kernel tree to |
|
see what letters are valid and what they do. |
|
|
|
cad |
|
--- |
|
|
|
This invokes the ``Ctl-Alt-Del`` action in the running image. What exactly |
|
this ends up doing is up to init, systemd, etc. Normally, it reboots the |
|
machine. |
|
|
|
stop |
|
---- |
|
|
|
This puts the UML in a loop reading mconsole requests until a 'go' |
|
mconsole command is received. This is very useful as a |
|
debugging/snapshotting tool. |
|
|
|
go |
|
-- |
|
|
|
This resumes a UML after being paused by a 'stop' command. Note that |
|
when the UML has resumed, TCP connections may have timed out and if |
|
the UML is paused for a long period of time, crond might go a little |
|
crazy, running all the jobs it didn't do earlier. |
|
|
|
proc |
|
---- |
|
|
|
This takes one argument - the name of a file in /proc which is printed |
|
to the mconsole standard output |
|
|
|
stack |
|
----- |
|
|
|
This takes one argument - the pid number of a process. Its stack is |
|
printed to a standard output. |
|
|
|
******************* |
|
Advanced UML Topics |
|
******************* |
|
|
|
Sharing Filesystems between Virtual Machines |
|
============================================ |
|
|
|
Don't attempt to share filesystems simply by booting two UMLs from the |
|
same file. That's the same thing as booting two physical machines |
|
from a shared disk. It will result in filesystem corruption. |
|
|
|
Using layered block devices |
|
--------------------------- |
|
|
|
The way to share a filesystem between two virtual machines is to use |
|
the copy-on-write (COW) layering capability of the ubd block driver. |
|
Any changed blocks are stored in the private COW file, while reads come |
|
from either device - the private one if the requested block is valid in |
|
it, the shared one if not. Using this scheme, the majority of data |
|
which is unchanged is shared between an arbitrary number of virtual |
|
machines, each of which has a much smaller file containing the changes |
|
that it has made. With a large number of UMLs booting from a large root |
|
filesystem, this leads to a huge disk space saving. |
|
|
|
Sharing file system data will also help performance, since the host will |
|
be able to cache the shared data using a much smaller amount of memory, |
|
so UML disk requests will be served from the host's memory rather than |
|
its disks. There is a major caveat in doing this on multisocket NUMA |
|
machines. On such hardware, running many UML instances with a shared |
|
master image and COW changes may caise issues like NMIs from excess of |
|
inter-socket traffic. |
|
|
|
If you are running UML on high end hardware like this, make sure to |
|
bind UML to a set of logical cpus residing on the same socket using the |
|
``taskset`` command or have a look at the "tuning" section. |
|
|
|
To add a copy-on-write layer to an existing block device file, simply |
|
add the name of the COW file to the appropriate ubd switch:: |
|
|
|
ubd0=root_fs_cow,root_fs_debian_22 |
|
|
|
where ``root_fs_cow`` is the private COW file and ``root_fs_debian_22`` is |
|
the existing shared filesystem. The COW file need not exist. If it |
|
doesn't, the driver will create and initialize it. |
|
|
|
Disk Usage |
|
---------- |
|
|
|
UML has TRIM support which will release any unused space in its disk |
|
image files to the underlying OS. It is important to use either ls -ls |
|
or du to verify the actual file size. |
|
|
|
COW validity. |
|
------------- |
|
|
|
Any changes to the master image will invalidate all COW files. If this |
|
happens, UML will *NOT* automatically delete any of the COW files and |
|
will refuse to boot. In this case the only solution is to either |
|
restore the old image (including its last modified timestamp) or remove |
|
all COW files which will result in their recreation. Any changes in |
|
the COW files will be lost. |
|
|
|
Cows can moo - uml_moo : Merging a COW file with its backing file |
|
----------------------------------------------------------------- |
|
|
|
Depending on how you use UML and COW devices, it may be advisable to |
|
merge the changes in the COW file into the backing file every once in |
|
a while. |
|
|
|
The utility that does this is uml_moo. Its usage is:: |
|
|
|
uml_moo COW_file new_backing_file |
|
|
|
|
|
There's no need to specify the backing file since that information is |
|
already in the COW file header. If you're paranoid, boot the new |
|
merged file, and if you're happy with it, move it over the old backing |
|
file. |
|
|
|
``uml_moo`` creates a new backing file by default as a safety measure. |
|
It also has a destructive merge option which will merge the COW file |
|
directly into its current backing file. This is really only usable |
|
when the backing file only has one COW file associated with it. If |
|
there are multiple COWs associated with a backing file, a -d merge of |
|
one of them will invalidate all of the others. However, it is |
|
convenient if you're short of disk space, and it should also be |
|
noticeably faster than a non-destructive merge. |
|
|
|
``uml_moo`` is installed with the UML distribution packages and is |
|
available as a part of UML utilities. |
|
|
|
Host file access |
|
================== |
|
|
|
If you want to access files on the host machine from inside UML, you |
|
can treat it as a separate machine and either nfs mount directories |
|
from the host or copy files into the virtual machine with scp. |
|
However, since UML is running on the host, it can access those |
|
files just like any other process and make them available inside the |
|
virtual machine without the need to use the network. |
|
This is possible with the hostfs virtual filesystem. With it, you |
|
can mount a host directory into the UML filesystem and access the |
|
files contained in it just as you would on the host. |
|
|
|
*SECURITY WARNING* |
|
|
|
Hostfs without any parameters to the UML Image will allow the image |
|
to mount any part of the host filesystem and write to it. Always |
|
confine hostfs to a specific "harmless" directory (for example ``/var/tmp``) |
|
if running UML. This is especially important if UML is being run as root. |
|
|
|
Using hostfs |
|
------------ |
|
|
|
To begin with, make sure that hostfs is available inside the virtual |
|
machine with:: |
|
|
|
# cat /proc/filesystems |
|
|
|
``hostfs`` should be listed. If it's not, either rebuild the kernel |
|
with hostfs configured into it or make sure that hostfs is built as a |
|
module and available inside the virtual machine, and insmod it. |
|
|
|
|
|
Now all you need to do is run mount:: |
|
|
|
# mount none /mnt/host -t hostfs |
|
|
|
will mount the host's ``/`` on the virtual machine's ``/mnt/host``. |
|
If you don't want to mount the host root directory, then you can |
|
specify a subdirectory to mount with the -o switch to mount:: |
|
|
|
# mount none /mnt/home -t hostfs -o /home |
|
|
|
will mount the hosts's /home on the virtual machine's /mnt/home. |
|
|
|
hostfs as the root filesystem |
|
----------------------------- |
|
|
|
It's possible to boot from a directory hierarchy on the host using |
|
hostfs rather than using the standard filesystem in a file. |
|
To start, you need that hierarchy. The easiest way is to loop mount |
|
an existing root_fs file:: |
|
|
|
# mount root_fs uml_root_dir -o loop |
|
|
|
|
|
You need to change the filesystem type of ``/`` in ``etc/fstab`` to be |
|
'hostfs', so that line looks like this:: |
|
|
|
/dev/ubd/0 / hostfs defaults 1 1 |
|
|
|
Then you need to chown to yourself all the files in that directory |
|
that are owned by root. This worked for me:: |
|
|
|
# find . -uid 0 -exec chown jdike {} \; |
|
|
|
Next, make sure that your UML kernel has hostfs compiled in, not as a |
|
module. Then run UML with the boot device pointing at that directory:: |
|
|
|
ubd0=/path/to/uml/root/directory |
|
|
|
UML should then boot as it does normally. |
|
|
|
Hostfs Caveats |
|
-------------- |
|
|
|
Hostfs does not support keeping track of host filesystem changes on the |
|
host (outside UML). As a result, if a file is changed without UML's |
|
knowledge, UML will not know about it and its own in-memory cache of |
|
the file may be corrupt. While it is possible to fix this, it is not |
|
something which is being worked on at present. |
|
|
|
Tuning UML |
|
============ |
|
|
|
UML at present is strictly uniprocessor. It will, however spin up a |
|
number of threads to handle various functions. |
|
|
|
The UBD driver, SIGIO and the MMU emulation do that. If the system is |
|
idle, these threads will be migrated to other processors on a SMP host. |
|
This, unfortunately, will usually result in LOWER performance because of |
|
all of the cache/memory synchronization traffic between cores. As a |
|
result, UML will usually benefit from being pinned on a single CPU |
|
especially on a large system. This can result in performance differences |
|
of 5 times or higher on some benchmarks. |
|
|
|
Similarly, on large multi-node NUMA systems UML will benefit if all of |
|
its memory is allocated from the same NUMA node it will run on. The |
|
OS will *NOT* do that by default. In order to do that, the sysadmin |
|
needs to create a suitable tmpfs ramdisk bound to a particular node |
|
and use that as the source for UML RAM allocation by specifying it |
|
in the TMP or TEMP environment variables. UML will look at the values |
|
of ``TMPDIR``, ``TMP`` or ``TEMP`` for that. If that fails, it will |
|
look for shmfs mounted under ``/dev/shm``. If everything else fails use |
|
``/tmp/`` regardless of the filesystem type used for it:: |
|
|
|
mount -t tmpfs -ompol=bind:X none /mnt/tmpfs-nodeX |
|
TEMP=/mnt/tmpfs-nodeX taskset -cX linux options options options.. |
|
|
|
******************************************* |
|
Contributing to UML and Developing with UML |
|
******************************************* |
|
|
|
UML is an excellent platform to develop new Linux kernel concepts - |
|
filesystems, devices, virtualization, etc. It provides unrivalled |
|
opportunities to create and test them without being constrained to |
|
emulating specific hardware. |
|
|
|
Example - want to try how linux will work with 4096 "proper" network |
|
devices? |
|
|
|
Not an issue with UML. At the same time, this is something which |
|
is difficult with other virtualization packages - they are |
|
constrained by the number of devices allowed on the hardware bus |
|
they are trying to emulate (for example 16 on a PCI bus in qemu). |
|
|
|
If you have something to contribute such as a patch, a bugfix, a |
|
new feature, please send it to ``[email protected]`` |
|
|
|
Please follow all standard Linux patch guidelines such as cc-ing |
|
relevant maintainers and run ``./sripts/checkpatch.pl`` on your patch. |
|
For more details see ``Documentation/process/submitting-patches.rst`` |
|
|
|
Note - the list does not accept HTML or attachments, all emails must |
|
be formatted as plain text. |
|
|
|
Developing always goes hand in hand with debugging. First of all, |
|
you can always run UML under gdb and there will be a whole section |
|
later on on how to do that. That, however, is not the only way to |
|
debug a linux kernel. Quite often adding tracing statements and/or |
|
using UML specific approaches such as ptracing the UML kernel process |
|
are significantly more informative. |
|
|
|
Tracing UML |
|
============= |
|
|
|
When running UML consists of a main kernel thread and a number of |
|
helper threads. The ones of interest for tracing are NOT the ones |
|
that are already ptraced by UML as a part of its MMU emulation. |
|
|
|
These are usually the first three threads visible in a ps display. |
|
The one with the lowest PID number and using most CPU is usually the |
|
kernel thread. The other threads are the disk |
|
(ubd) device helper thread and the sigio helper thread. |
|
Running ptrace on this thread usually results in the following picture:: |
|
|
|
host$ strace -p 16566 |
|
--- SIGIO {si_signo=SIGIO, si_code=POLL_IN, si_band=65} --- |
|
epoll_wait(4, [{EPOLLIN, {u32=3721159424, u64=3721159424}}], 64, 0) = 1 |
|
epoll_wait(4, [], 64, 0) = 0 |
|
rt_sigreturn({mask=[PIPE]}) = 16967 |
|
ptrace(PTRACE_GETREGS, 16967, NULL, 0xd5f34f38) = 0 |
|
ptrace(PTRACE_GETREGSET, 16967, NT_X86_XSTATE, [{iov_base=0xd5f35010, iov_len=832}]) = 0 |
|
ptrace(PTRACE_GETSIGINFO, 16967, NULL, {si_signo=SIGTRAP, si_code=0x85, si_pid=16967, si_uid=0}) = 0 |
|
ptrace(PTRACE_SETREGS, 16967, NULL, 0xd5f34f38) = 0 |
|
ptrace(PTRACE_SETREGSET, 16967, NT_X86_XSTATE, [{iov_base=0xd5f35010, iov_len=2696}]) = 0 |
|
ptrace(PTRACE_SYSEMU, 16967, NULL, 0) = 0 |
|
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_TRAPPED, si_pid=16967, si_uid=0, si_status=SIGTRAP, si_utime=65, si_stime=89} --- |
|
wait4(16967, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGTRAP | 0x80}], WSTOPPED|__WALL, NULL) = 16967 |
|
ptrace(PTRACE_GETREGS, 16967, NULL, 0xd5f34f38) = 0 |
|
ptrace(PTRACE_GETREGSET, 16967, NT_X86_XSTATE, [{iov_base=0xd5f35010, iov_len=832}]) = 0 |
|
ptrace(PTRACE_GETSIGINFO, 16967, NULL, {si_signo=SIGTRAP, si_code=0x85, si_pid=16967, si_uid=0}) = 0 |
|
timer_settime(0, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=2830912}}, NULL) = 0 |
|
getpid() = 16566 |
|
clock_nanosleep(CLOCK_MONOTONIC, 0, {tv_sec=1, tv_nsec=0}, NULL) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) |
|
--- SIGALRM {si_signo=SIGALRM, si_code=SI_TIMER, si_timerid=0, si_overrun=0, si_value={int=1631716592, ptr=0x614204f0}} --- |
|
rt_sigreturn({mask=[PIPE]}) = -1 EINTR (Interrupted system call) |
|
|
|
This is a typical picture from a mostly idle UML instance |
|
|
|
* UML interrupt controller uses epoll - this is UML waiting for IO |
|
interrupts: |
|
|
|
epoll_wait(4, [{EPOLLIN, {u32=3721159424, u64=3721159424}}], 64, 0) = 1 |
|
|
|
* The sequence of ptrace calls is part of MMU emulation and runnin the |
|
UML userspace |
|
* ``timer_settime`` is part of the UML high res timer subsystem mapping |
|
timer requests from inside UML onto the host high resultion timers. |
|
* ``clock_nanosleep`` is UML going into idle (similar to the way a PC |
|
will execute an ACPI idle). |
|
|
|
As you can see UML will generate quite a bit of output even in idle.The output |
|
can be very informative when observing IO. It shows the actual IO calls, their |
|
arguments and returns values. |
|
|
|
Kernel debugging |
|
================ |
|
|
|
You can run UML under gdb now, though it will not necessarily agree to |
|
be started under it. If you are trying to track a runtime bug, it is |
|
much better to attach gdb to a running UML instance and let UML run. |
|
|
|
Assuming the same PID number as in the previous example, this would be:: |
|
|
|
# gdb -p 16566 |
|
|
|
This will STOP the UML instance, so you must enter `cont` at the GDB |
|
command line to request it to continue. It may be a good idea to make |
|
this into a gdb script and pass it to gdb as an argument. |
|
|
|
Developing Device Drivers |
|
========================= |
|
|
|
Nearly all UML drivers are monolithic. While it is possible to build a |
|
UML driver as a kernel module, that limits the possible functionality |
|
to in-kernel only and non-UML specific. The reason for this is that |
|
in order to really leverage UML, one needs to write a piece of |
|
userspace code which maps driver concepts onto actual userspace host |
|
calls. |
|
|
|
This forms the so called "user" portion of the driver. While it can |
|
reuse a lot of kernel concepts, it is generally just another piece of |
|
userspace code. This portion needs some matching "kernel" code which |
|
resides inside the UML image and which implements the Linux kernel part. |
|
|
|
*Note: There are very few limitations in the way "kernel" and "user" interact*. |
|
|
|
UML does not have a strictly defined kernel to host API. It does not |
|
try to emulate a specific architecture or bus. UML's "kernel" and |
|
"user" can share memory, code and interact as needed to implement |
|
whatever design the software developer has in mind. The only |
|
limitations are purely technical. Due to a lot of functions and |
|
variables having the same names, the developer should be careful |
|
which includes and libraries they are trying to refer to. |
|
|
|
As a result a lot of userspace code consists of simple wrappers. |
|
F.e. ``os_close_file()`` is just a wrapper around ``close()`` |
|
which ensures that the userspace function close does not clash |
|
with similarly named function(s) in the kernel part. |
|
|
|
Security Considerations |
|
----------------------- |
|
|
|
Drivers or any new functionality should default to not |
|
accepting arbitrary filename, bpf code or other parameters |
|
which can affect the host from inside the UML instance. |
|
For example, specifying the socket used for IPC communication |
|
between a driver and the host at the UML command line is OK |
|
security-wise. Allowing it as a loadable module parameter |
|
isn't. |
|
|
|
If such functionality is desireable for a particular application |
|
(e.g. loading BPF "firmware" for raw socket network transports), |
|
it should be off by default and should be explicitly turned on |
|
as a command line parameter at startup. |
|
|
|
Even with this in mind, the level of isolation between UML |
|
and the host is relatively weak. If the UML userspace is |
|
allowed to load arbitrary kernel drivers, an attacker can |
|
use this to break out of UML. Thus, if UML is used in |
|
a production application, it is recommended that all modules |
|
are loaded at boot and kernel module loading is disabled |
|
afterwards.
|
|
|