forked from Qortal/Brooklyn
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
445 lines
16 KiB
445 lines
16 KiB
================================== |
|
vfio-ccw: the basic infrastructure |
|
================================== |
|
|
|
Introduction |
|
------------ |
|
|
|
Here we describe the vfio support for I/O subchannel devices for |
|
Linux/s390. Motivation for vfio-ccw is to passthrough subchannels to a |
|
virtual machine, while vfio is the means. |
|
|
|
Different than other hardware architectures, s390 has defined a unified |
|
I/O access method, which is so called Channel I/O. It has its own access |
|
patterns: |
|
|
|
- Channel programs run asynchronously on a separate (co)processor. |
|
- The channel subsystem will access any memory designated by the caller |
|
in the channel program directly, i.e. there is no iommu involved. |
|
|
|
Thus when we introduce vfio support for these devices, we realize it |
|
with a mediated device (mdev) implementation. The vfio mdev will be |
|
added to an iommu group, so as to make itself able to be managed by the |
|
vfio framework. And we add read/write callbacks for special vfio I/O |
|
regions to pass the channel programs from the mdev to its parent device |
|
(the real I/O subchannel device) to do further address translation and |
|
to perform I/O instructions. |
|
|
|
This document does not intend to explain the s390 I/O architecture in |
|
every detail. More information/reference could be found here: |
|
|
|
- A good start to know Channel I/O in general: |
|
https://en.wikipedia.org/wiki/Channel_I/O |
|
- s390 architecture: |
|
s390 Principles of Operation manual (IBM Form. No. SA22-7832) |
|
- The existing QEMU code which implements a simple emulated channel |
|
subsystem could also be a good reference. It makes it easier to follow |
|
the flow. |
|
qemu/hw/s390x/css.c |
|
|
|
For vfio mediated device framework: |
|
- Documentation/driver-api/vfio-mediated-device.rst |
|
|
|
Motivation of vfio-ccw |
|
---------------------- |
|
|
|
Typically, a guest virtualized via QEMU/KVM on s390 only sees |
|
paravirtualized virtio devices via the "Virtio Over Channel I/O |
|
(virtio-ccw)" transport. This makes virtio devices discoverable via |
|
standard operating system algorithms for handling channel devices. |
|
|
|
However this is not enough. On s390 for the majority of devices, which |
|
use the standard Channel I/O based mechanism, we also need to provide |
|
the functionality of passing through them to a QEMU virtual machine. |
|
This includes devices that don't have a virtio counterpart (e.g. tape |
|
drives) or that have specific characteristics which guests want to |
|
exploit. |
|
|
|
For passing a device to a guest, we want to use the same interface as |
|
everybody else, namely vfio. We implement this vfio support for channel |
|
devices via the vfio mediated device framework and the subchannel device |
|
driver "vfio_ccw". |
|
|
|
Access patterns of CCW devices |
|
------------------------------ |
|
|
|
s390 architecture has implemented a so called channel subsystem, that |
|
provides a unified view of the devices physically attached to the |
|
systems. Though the s390 hardware platform knows about a huge variety of |
|
different peripheral attachments like disk devices (aka. DASDs), tapes, |
|
communication controllers, etc. They can all be accessed by a well |
|
defined access method and they are presenting I/O completion a unified |
|
way: I/O interruptions. |
|
|
|
All I/O requires the use of channel command words (CCWs). A CCW is an |
|
instruction to a specialized I/O channel processor. A channel program is |
|
a sequence of CCWs which are executed by the I/O channel subsystem. To |
|
issue a channel program to the channel subsystem, it is required to |
|
build an operation request block (ORB), which can be used to point out |
|
the format of the CCW and other control information to the system. The |
|
operating system signals the I/O channel subsystem to begin executing |
|
the channel program with a SSCH (start sub-channel) instruction. The |
|
central processor is then free to proceed with non-I/O instructions |
|
until interrupted. The I/O completion result is received by the |
|
interrupt handler in the form of interrupt response block (IRB). |
|
|
|
Back to vfio-ccw, in short: |
|
|
|
- ORBs and channel programs are built in guest kernel (with guest |
|
physical addresses). |
|
- ORBs and channel programs are passed to the host kernel. |
|
- Host kernel translates the guest physical addresses to real addresses |
|
and starts the I/O with issuing a privileged Channel I/O instruction |
|
(e.g SSCH). |
|
- channel programs run asynchronously on a separate processor. |
|
- I/O completion will be signaled to the host with I/O interruptions. |
|
And it will be copied as IRB to user space to pass it back to the |
|
guest. |
|
|
|
Physical vfio ccw device and its child mdev |
|
------------------------------------------- |
|
|
|
As mentioned above, we realize vfio-ccw with a mdev implementation. |
|
|
|
Channel I/O does not have IOMMU hardware support, so the physical |
|
vfio-ccw device does not have an IOMMU level translation or isolation. |
|
|
|
Subchannel I/O instructions are all privileged instructions. When |
|
handling the I/O instruction interception, vfio-ccw has the software |
|
policing and translation how the channel program is programmed before |
|
it gets sent to hardware. |
|
|
|
Within this implementation, we have two drivers for two types of |
|
devices: |
|
|
|
- The vfio_ccw driver for the physical subchannel device. |
|
This is an I/O subchannel driver for the real subchannel device. It |
|
realizes a group of callbacks and registers to the mdev framework as a |
|
parent (physical) device. As a consequence, mdev provides vfio_ccw a |
|
generic interface (sysfs) to create mdev devices. A vfio mdev could be |
|
created by vfio_ccw then and added to the mediated bus. It is the vfio |
|
device that added to an IOMMU group and a vfio group. |
|
vfio_ccw also provides an I/O region to accept channel program |
|
request from user space and store I/O interrupt result for user |
|
space to retrieve. To notify user space an I/O completion, it offers |
|
an interface to setup an eventfd fd for asynchronous signaling. |
|
|
|
- The vfio_mdev driver for the mediated vfio ccw device. |
|
This is provided by the mdev framework. It is a vfio device driver for |
|
the mdev that created by vfio_ccw. |
|
It realizes a group of vfio device driver callbacks, adds itself to a |
|
vfio group, and registers itself to the mdev framework as a mdev |
|
driver. |
|
It uses a vfio iommu backend that uses the existing map and unmap |
|
ioctls, but rather than programming them into an IOMMU for a device, |
|
it simply stores the translations for use by later requests. This |
|
means that a device programmed in a VM with guest physical addresses |
|
can have the vfio kernel convert that address to process virtual |
|
address, pin the page and program the hardware with the host physical |
|
address in one step. |
|
For a mdev, the vfio iommu backend will not pin the pages during the |
|
VFIO_IOMMU_MAP_DMA ioctl. Mdev framework will only maintain a database |
|
of the iova<->vaddr mappings in this operation. And they export a |
|
vfio_pin_pages and a vfio_unpin_pages interfaces from the vfio iommu |
|
backend for the physical devices to pin and unpin pages by demand. |
|
|
|
Below is a high Level block diagram:: |
|
|
|
+-------------+ |
|
| | |
|
| +---------+ | mdev_register_driver() +--------------+ |
|
| | Mdev | +<-----------------------+ | |
|
| | bus | | | vfio_mdev.ko | |
|
| | driver | +----------------------->+ |<-> VFIO user |
|
| +---------+ | probe()/remove() +--------------+ APIs |
|
| | |
|
| MDEV CORE | |
|
| MODULE | |
|
| mdev.ko | |
|
| +---------+ | mdev_register_device() +--------------+ |
|
| |Physical | +<-----------------------+ | |
|
| | device | | | vfio_ccw.ko |<-> subchannel |
|
| |interface| +----------------------->+ | device |
|
| +---------+ | callback +--------------+ |
|
+-------------+ |
|
|
|
The process of how these work together. |
|
|
|
1. vfio_ccw.ko drives the physical I/O subchannel, and registers the |
|
physical device (with callbacks) to mdev framework. |
|
When vfio_ccw probing the subchannel device, it registers device |
|
pointer and callbacks to the mdev framework. Mdev related file nodes |
|
under the device node in sysfs would be created for the subchannel |
|
device, namely 'mdev_create', 'mdev_destroy' and |
|
'mdev_supported_types'. |
|
2. Create a mediated vfio ccw device. |
|
Use the 'mdev_create' sysfs file, we need to manually create one (and |
|
only one for our case) mediated device. |
|
3. vfio_mdev.ko drives the mediated ccw device. |
|
vfio_mdev is also the vfio device drvier. It will probe the mdev and |
|
add it to an iommu_group and a vfio_group. Then we could pass through |
|
the mdev to a guest. |
|
|
|
|
|
VFIO-CCW Regions |
|
---------------- |
|
|
|
The vfio-ccw driver exposes MMIO regions to accept requests from and return |
|
results to userspace. |
|
|
|
vfio-ccw I/O region |
|
------------------- |
|
|
|
An I/O region is used to accept channel program request from user |
|
space and store I/O interrupt result for user space to retrieve. The |
|
definition of the region is:: |
|
|
|
struct ccw_io_region { |
|
#define ORB_AREA_SIZE 12 |
|
__u8 orb_area[ORB_AREA_SIZE]; |
|
#define SCSW_AREA_SIZE 12 |
|
__u8 scsw_area[SCSW_AREA_SIZE]; |
|
#define IRB_AREA_SIZE 96 |
|
__u8 irb_area[IRB_AREA_SIZE]; |
|
__u32 ret_code; |
|
} __packed; |
|
|
|
This region is always available. |
|
|
|
While starting an I/O request, orb_area should be filled with the |
|
guest ORB, and scsw_area should be filled with the SCSW of the Virtual |
|
Subchannel. |
|
|
|
irb_area stores the I/O result. |
|
|
|
ret_code stores a return code for each access of the region. The following |
|
values may occur: |
|
|
|
``0`` |
|
The operation was successful. |
|
|
|
``-EOPNOTSUPP`` |
|
The orb specified transport mode or an unidentified IDAW format, or the |
|
scsw specified a function other than the start function. |
|
|
|
``-EIO`` |
|
A request was issued while the device was not in a state ready to accept |
|
requests, or an internal error occurred. |
|
|
|
``-EBUSY`` |
|
The subchannel was status pending or busy, or a request is already active. |
|
|
|
``-EAGAIN`` |
|
A request was being processed, and the caller should retry. |
|
|
|
``-EACCES`` |
|
The channel path(s) used for the I/O were found to be not operational. |
|
|
|
``-ENODEV`` |
|
The device was found to be not operational. |
|
|
|
``-EINVAL`` |
|
The orb specified a chain longer than 255 ccws, or an internal error |
|
occurred. |
|
|
|
|
|
vfio-ccw cmd region |
|
------------------- |
|
|
|
The vfio-ccw cmd region is used to accept asynchronous instructions |
|
from userspace:: |
|
|
|
#define VFIO_CCW_ASYNC_CMD_HSCH (1 << 0) |
|
#define VFIO_CCW_ASYNC_CMD_CSCH (1 << 1) |
|
struct ccw_cmd_region { |
|
__u32 command; |
|
__u32 ret_code; |
|
} __packed; |
|
|
|
This region is exposed via region type VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD. |
|
|
|
Currently, CLEAR SUBCHANNEL and HALT SUBCHANNEL use this region. |
|
|
|
command specifies the command to be issued; ret_code stores a return code |
|
for each access of the region. The following values may occur: |
|
|
|
``0`` |
|
The operation was successful. |
|
|
|
``-ENODEV`` |
|
The device was found to be not operational. |
|
|
|
``-EINVAL`` |
|
A command other than halt or clear was specified. |
|
|
|
``-EIO`` |
|
A request was issued while the device was not in a state ready to accept |
|
requests. |
|
|
|
``-EAGAIN`` |
|
A request was being processed, and the caller should retry. |
|
|
|
``-EBUSY`` |
|
The subchannel was status pending or busy while processing a halt request. |
|
|
|
vfio-ccw schib region |
|
--------------------- |
|
|
|
The vfio-ccw schib region is used to return Subchannel-Information |
|
Block (SCHIB) data to userspace:: |
|
|
|
struct ccw_schib_region { |
|
#define SCHIB_AREA_SIZE 52 |
|
__u8 schib_area[SCHIB_AREA_SIZE]; |
|
} __packed; |
|
|
|
This region is exposed via region type VFIO_REGION_SUBTYPE_CCW_SCHIB. |
|
|
|
Reading this region triggers a STORE SUBCHANNEL to be issued to the |
|
associated hardware. |
|
|
|
vfio-ccw crw region |
|
--------------------- |
|
|
|
The vfio-ccw crw region is used to return Channel Report Word (CRW) |
|
data to userspace:: |
|
|
|
struct ccw_crw_region { |
|
__u32 crw; |
|
__u32 pad; |
|
} __packed; |
|
|
|
This region is exposed via region type VFIO_REGION_SUBTYPE_CCW_CRW. |
|
|
|
Reading this region returns a CRW if one that is relevant for this |
|
subchannel (e.g. one reporting changes in channel path state) is |
|
pending, or all zeroes if not. If multiple CRWs are pending (including |
|
possibly chained CRWs), reading this region again will return the next |
|
one, until no more CRWs are pending and zeroes are returned. This is |
|
similar to how STORE CHANNEL REPORT WORD works. |
|
|
|
vfio-ccw operation details |
|
-------------------------- |
|
|
|
vfio-ccw follows what vfio-pci did on the s390 platform and uses |
|
vfio-iommu-type1 as the vfio iommu backend. |
|
|
|
* CCW translation APIs |
|
A group of APIs (start with `cp_`) to do CCW translation. The CCWs |
|
passed in by a user space program are organized with their guest |
|
physical memory addresses. These APIs will copy the CCWs into kernel |
|
space, and assemble a runnable kernel channel program by updating the |
|
guest physical addresses with their corresponding host physical addresses. |
|
Note that we have to use IDALs even for direct-access CCWs, as the |
|
referenced memory can be located anywhere, including above 2G. |
|
|
|
* vfio_ccw device driver |
|
This driver utilizes the CCW translation APIs and introduces |
|
vfio_ccw, which is the driver for the I/O subchannel devices you want |
|
to pass through. |
|
vfio_ccw implements the following vfio ioctls:: |
|
|
|
VFIO_DEVICE_GET_INFO |
|
VFIO_DEVICE_GET_IRQ_INFO |
|
VFIO_DEVICE_GET_REGION_INFO |
|
VFIO_DEVICE_RESET |
|
VFIO_DEVICE_SET_IRQS |
|
|
|
This provides an I/O region, so that the user space program can pass a |
|
channel program to the kernel, to do further CCW translation before |
|
issuing them to a real device. |
|
This also provides the SET_IRQ ioctl to setup an event notifier to |
|
notify the user space program the I/O completion in an asynchronous |
|
way. |
|
|
|
The use of vfio-ccw is not limited to QEMU, while QEMU is definitely a |
|
good example to get understand how these patches work. Here is a little |
|
bit more detail how an I/O request triggered by the QEMU guest will be |
|
handled (without error handling). |
|
|
|
Explanation: |
|
|
|
- Q1-Q7: QEMU side process. |
|
- K1-K5: Kernel side process. |
|
|
|
Q1. |
|
Get I/O region info during initialization. |
|
|
|
Q2. |
|
Setup event notifier and handler to handle I/O completion. |
|
|
|
... ... |
|
|
|
Q3. |
|
Intercept a ssch instruction. |
|
Q4. |
|
Write the guest channel program and ORB to the I/O region. |
|
|
|
K1. |
|
Copy from guest to kernel. |
|
K2. |
|
Translate the guest channel program to a host kernel space |
|
channel program, which becomes runnable for a real device. |
|
K3. |
|
With the necessary information contained in the orb passed in |
|
by QEMU, issue the ccwchain to the device. |
|
K4. |
|
Return the ssch CC code. |
|
Q5. |
|
Return the CC code to the guest. |
|
|
|
... ... |
|
|
|
K5. |
|
Interrupt handler gets the I/O result and write the result to |
|
the I/O region. |
|
K6. |
|
Signal QEMU to retrieve the result. |
|
|
|
Q6. |
|
Get the signal and event handler reads out the result from the I/O |
|
region. |
|
Q7. |
|
Update the irb for the guest. |
|
|
|
Limitations |
|
----------- |
|
|
|
The current vfio-ccw implementation focuses on supporting basic commands |
|
needed to implement block device functionality (read/write) of DASD/ECKD |
|
device only. Some commands may need special handling in the future, for |
|
example, anything related to path grouping. |
|
|
|
DASD is a kind of storage device. While ECKD is a data recording format. |
|
More information for DASD and ECKD could be found here: |
|
https://en.wikipedia.org/wiki/Direct-access_storage_device |
|
https://en.wikipedia.org/wiki/Count_key_data |
|
|
|
Together with the corresponding work in QEMU, we can bring the passed |
|
through DASD/ECKD device online in a guest now and use it as a block |
|
device. |
|
|
|
The current code allows the guest to start channel programs via |
|
START SUBCHANNEL, and to issue HALT SUBCHANNEL, CLEAR SUBCHANNEL, |
|
and STORE SUBCHANNEL. |
|
|
|
Currently all channel programs are prefetched, regardless of the |
|
p-bit setting in the ORB. As a result, self modifying channel |
|
programs are not supported. For this reason, IPL has to be handled as |
|
a special case by a userspace/guest program; this has been implemented |
|
in QEMU's s390-ccw bios as of QEMU 4.1. |
|
|
|
vfio-ccw supports classic (command mode) channel I/O only. Transport |
|
mode (HPF) is not supported. |
|
|
|
QDIO subchannels are currently not supported. Classic devices other than |
|
DASD/ECKD might work, but have not been tested. |
|
|
|
Reference |
|
--------- |
|
1. ESA/s390 Principles of Operation manual (IBM Form. No. SA22-7832) |
|
2. ESA/390 Common I/O Device Commands manual (IBM Form. No. SA22-7204) |
|
3. https://en.wikipedia.org/wiki/Channel_I/O |
|
4. Documentation/s390/cds.rst |
|
5. Documentation/driver-api/vfio.rst |
|
6. Documentation/driver-api/vfio-mediated-device.rst
|
|
|