mirror of https://github.com/Qortal/Brooklyn
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
300 lines
11 KiB
300 lines
11 KiB
.. SPDX-License-Identifier: GPL-2.0 |
|
.. include:: <isonum.txt> |
|
|
|
=========================================================== |
|
The PCI Express Advanced Error Reporting Driver Guide HOWTO |
|
=========================================================== |
|
|
|
:Authors: - T. Long Nguyen <[email protected]> |
|
- Yanmin Zhang <[email protected]> |
|
|
|
:Copyright: |copy| 2006 Intel Corporation |
|
|
|
Overview |
|
=========== |
|
|
|
About this guide |
|
---------------- |
|
|
|
This guide describes the basics of the PCI Express Advanced Error |
|
Reporting (AER) driver and provides information on how to use it, as |
|
well as how to enable the drivers of endpoint devices to conform with |
|
PCI Express AER driver. |
|
|
|
|
|
What is the PCI Express AER Driver? |
|
----------------------------------- |
|
|
|
PCI Express error signaling can occur on the PCI Express link itself |
|
or on behalf of transactions initiated on the link. PCI Express |
|
defines two error reporting paradigms: the baseline capability and |
|
the Advanced Error Reporting capability. The baseline capability is |
|
required of all PCI Express components providing a minimum defined |
|
set of error reporting requirements. Advanced Error Reporting |
|
capability is implemented with a PCI Express advanced error reporting |
|
extended capability structure providing more robust error reporting. |
|
|
|
The PCI Express AER driver provides the infrastructure to support PCI |
|
Express Advanced Error Reporting capability. The PCI Express AER |
|
driver provides three basic functions: |
|
|
|
- Gathers the comprehensive error information if errors occurred. |
|
- Reports error to the users. |
|
- Performs error recovery actions. |
|
|
|
AER driver only attaches root ports which support PCI-Express AER |
|
capability. |
|
|
|
|
|
User Guide |
|
========== |
|
|
|
Include the PCI Express AER Root Driver into the Linux Kernel |
|
------------------------------------------------------------- |
|
|
|
The PCI Express AER Root driver is a Root Port service driver attached |
|
to the PCI Express Port Bus driver. If a user wants to use it, the driver |
|
has to be compiled. Option CONFIG_PCIEAER supports this capability. It |
|
depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and |
|
CONFIG_PCIEAER = y. |
|
|
|
Load PCI Express AER Root Driver |
|
-------------------------------- |
|
|
|
Some systems have AER support in firmware. Enabling Linux AER support at |
|
the same time the firmware handles AER may result in unpredictable |
|
behavior. Therefore, Linux does not handle AER events unless the firmware |
|
grants AER control to the OS via the ACPI _OSC method. See the PCI FW 3.0 |
|
Specification for details regarding _OSC usage. |
|
|
|
AER error output |
|
---------------- |
|
|
|
When a PCIe AER error is captured, an error message will be output to |
|
console. If it's a correctable error, it is output as a warning. |
|
Otherwise, it is printed as an error. So users could choose different |
|
log level to filter out correctable error messages. |
|
|
|
Below shows an example:: |
|
|
|
0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID) |
|
0000:50:00.0: device [8086:0329] error status/mask=00100000/00000000 |
|
0000:50:00.0: [20] Unsupported Request (First) |
|
0000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100 |
|
|
|
In the example, 'Requester ID' means the ID of the device who sends |
|
the error message to root port. Pls. refer to pci express specs for |
|
other fields. |
|
|
|
AER Statistics / Counters |
|
------------------------- |
|
|
|
When PCIe AER errors are captured, the counters / statistics are also exposed |
|
in the form of sysfs attributes which are documented at |
|
Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats |
|
|
|
Developer Guide |
|
=============== |
|
|
|
To enable AER aware support requires a software driver to configure |
|
the AER capability structure within its device and to provide callbacks. |
|
|
|
To support AER better, developers need understand how AER does work |
|
firstly. |
|
|
|
PCI Express errors are classified into two types: correctable errors |
|
and uncorrectable errors. This classification is based on the impacts |
|
of those errors, which may result in degraded performance or function |
|
failure. |
|
|
|
Correctable errors pose no impacts on the functionality of the |
|
interface. The PCI Express protocol can recover without any software |
|
intervention or any loss of data. These errors are detected and |
|
corrected by hardware. Unlike correctable errors, uncorrectable |
|
errors impact functionality of the interface. Uncorrectable errors |
|
can cause a particular transaction or a particular PCI Express link |
|
to be unreliable. Depending on those error conditions, uncorrectable |
|
errors are further classified into non-fatal errors and fatal errors. |
|
Non-fatal errors cause the particular transaction to be unreliable, |
|
but the PCI Express link itself is fully functional. Fatal errors, on |
|
the other hand, cause the link to be unreliable. |
|
|
|
When AER is enabled, a PCI Express device will automatically send an |
|
error message to the PCIe root port above it when the device captures |
|
an error. The Root Port, upon receiving an error reporting message, |
|
internally processes and logs the error message in its PCI Express |
|
capability structure. Error information being logged includes storing |
|
the error reporting agent's requestor ID into the Error Source |
|
Identification Registers and setting the error bits of the Root Error |
|
Status Register accordingly. If AER error reporting is enabled in Root |
|
Error Command Register, the Root Port generates an interrupt if an |
|
error is detected. |
|
|
|
Note that the errors as described above are related to the PCI Express |
|
hierarchy and links. These errors do not include any device specific |
|
errors because device specific errors will still get sent directly to |
|
the device driver. |
|
|
|
Configure the AER capability structure |
|
-------------------------------------- |
|
|
|
AER aware drivers of PCI Express component need change the device |
|
control registers to enable AER. They also could change AER registers, |
|
including mask and severity registers. Helper function |
|
pci_enable_pcie_error_reporting could be used to enable AER. See |
|
section 3.3. |
|
|
|
Provide callbacks |
|
----------------- |
|
|
|
callback reset_link to reset pci express link |
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
|
|
This callback is used to reset the pci express physical link when a |
|
fatal error happens. The root port aer service driver provides a |
|
default reset_link function, but different upstream ports might |
|
have different specifications to reset pci express link, so all |
|
upstream ports should provide their own reset_link functions. |
|
|
|
Section 3.2.2.2 provides more detailed info on when to call |
|
reset_link. |
|
|
|
PCI error-recovery callbacks |
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
|
|
The PCI Express AER Root driver uses error callbacks to coordinate |
|
with downstream device drivers associated with a hierarchy in question |
|
when performing error recovery actions. |
|
|
|
Data struct pci_driver has a pointer, err_handler, to point to |
|
pci_error_handlers who consists of a couple of callback function |
|
pointers. AER driver follows the rules defined in |
|
pci-error-recovery.txt except pci express specific parts (e.g. |
|
reset_link). Pls. refer to pci-error-recovery.txt for detailed |
|
definitions of the callbacks. |
|
|
|
Below sections specify when to call the error callback functions. |
|
|
|
Correctable errors |
|
~~~~~~~~~~~~~~~~~~ |
|
|
|
Correctable errors pose no impacts on the functionality of |
|
the interface. The PCI Express protocol can recover without any |
|
software intervention or any loss of data. These errors do not |
|
require any recovery actions. The AER driver clears the device's |
|
correctable error status register accordingly and logs these errors. |
|
|
|
Non-correctable (non-fatal and fatal) errors |
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
|
|
If an error message indicates a non-fatal error, performing link reset |
|
at upstream is not required. The AER driver calls error_detected(dev, |
|
pci_channel_io_normal) to all drivers associated within a hierarchy in |
|
question. for example:: |
|
|
|
EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort |
|
|
|
If Upstream port A captures an AER error, the hierarchy consists of |
|
Downstream port B and EndPoint. |
|
|
|
A driver may return PCI_ERS_RESULT_CAN_RECOVER, |
|
PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on |
|
whether it can recover or the AER driver calls mmio_enabled as next. |
|
|
|
If an error message indicates a fatal error, kernel will broadcast |
|
error_detected(dev, pci_channel_io_frozen) to all drivers within |
|
a hierarchy in question. Then, performing link reset at upstream is |
|
necessary. As different kinds of devices might use different approaches |
|
to reset link, AER port service driver is required to provide the |
|
function to reset link via callback parameter of pcie_do_recovery() |
|
function. If reset_link is not NULL, recovery function will use it |
|
to reset the link. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER |
|
and reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes |
|
to mmio_enabled. |
|
|
|
helper functions |
|
---------------- |
|
:: |
|
|
|
int pci_enable_pcie_error_reporting(struct pci_dev *dev); |
|
|
|
pci_enable_pcie_error_reporting enables the device to send error |
|
messages to root port when an error is detected. Note that devices |
|
don't enable the error reporting by default, so device drivers need |
|
call this function to enable it. |
|
|
|
:: |
|
|
|
int pci_disable_pcie_error_reporting(struct pci_dev *dev); |
|
|
|
pci_disable_pcie_error_reporting disables the device to send error |
|
messages to root port when an error is detected. |
|
|
|
:: |
|
|
|
int pci_aer_clear_nonfatal_status(struct pci_dev *dev);` |
|
|
|
pci_aer_clear_nonfatal_status clears non-fatal errors in the uncorrectable |
|
error status register. |
|
|
|
Frequent Asked Questions |
|
------------------------ |
|
|
|
Q: |
|
What happens if a PCI Express device driver does not provide an |
|
error recovery handler (pci_driver->err_handler is equal to NULL)? |
|
|
|
A: |
|
The devices attached with the driver won't be recovered. If the |
|
error is fatal, kernel will print out warning messages. Please refer |
|
to section 3 for more information. |
|
|
|
Q: |
|
What happens if an upstream port service driver does not provide |
|
callback reset_link? |
|
|
|
A: |
|
Fatal error recovery will fail if the errors are reported by the |
|
upstream ports who are attached by the service driver. |
|
|
|
Q: |
|
How does this infrastructure deal with driver that is not PCI |
|
Express aware? |
|
|
|
A: |
|
This infrastructure calls the error callback functions of the |
|
driver when an error happens. But if the driver is not aware of |
|
PCI Express, the device might not report its own errors to root |
|
port. |
|
|
|
Q: |
|
What modifications will that driver need to make it compatible |
|
with the PCI Express AER Root driver? |
|
|
|
A: |
|
It could call the helper functions to enable AER in devices and |
|
cleanup uncorrectable status register. Pls. refer to section 3.3. |
|
|
|
|
|
Software error injection |
|
======================== |
|
|
|
Debugging PCIe AER error recovery code is quite difficult because it |
|
is hard to trigger real hardware errors. Software based error |
|
injection can be used to fake various kinds of PCIe errors. |
|
|
|
First you should enable PCIe AER software error injection in kernel |
|
configuration, that is, following item should be in your .config. |
|
|
|
CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m |
|
|
|
After reboot with new kernel or insert the module, a device file named |
|
/dev/aer_inject should be created. |
|
|
|
Then, you need a user space tool named aer-inject, which can be gotten |
|
from: |
|
|
|
https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/ |
|
|
|
More information about aer-inject can be found in the document comes |
|
with its source code.
|
|
|