mirror of https://github.com/Qortal/Brooklyn
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
252 lines
9.2 KiB
252 lines
9.2 KiB
.. SPDX-License-Identifier: GPL-2.0 |
|
|
|
============= |
|
Devlink DPIPE |
|
============= |
|
|
|
Background |
|
========== |
|
|
|
While performing the hardware offloading process, much of the hardware |
|
specifics cannot be presented. These details are useful for debugging, and |
|
``devlink-dpipe`` provides a standardized way to provide visibility into the |
|
offloading process. |
|
|
|
For example, the routing longest prefix match (LPM) algorithm used by the |
|
Linux kernel may differ from the hardware implementation. The pipeline debug |
|
API (DPIPE) is aimed at providing the user visibility into the ASIC's |
|
pipeline in a generic way. |
|
|
|
The hardware offload process is expected to be done in a way that the user |
|
should not be able to distinguish between the hardware vs. software |
|
implementation. In this process, hardware specifics are neglected. In |
|
reality those details can have lots of meaning and should be exposed in some |
|
standard way. |
|
|
|
This problem is made even more complex when one wishes to offload the |
|
control path of the whole networking stack to a switch ASIC. Due to |
|
differences in the hardware and software models some processes cannot be |
|
represented correctly. |
|
|
|
One example is the kernel's LPM algorithm which in many cases differs |
|
greatly to the hardware implementation. The configuration API is the same, |
|
but one cannot rely on the Forward Information Base (FIB) to look like the |
|
Level Path Compression trie (LPC-trie) in hardware. |
|
|
|
In many situations trying to analyze systems failure solely based on the |
|
kernel's dump may not be enough. By combining this data with complementary |
|
information about the underlying hardware, this debugging can be made |
|
easier; additionally, the information can be useful when debugging |
|
performance issues. |
|
|
|
Overview |
|
======== |
|
|
|
The ``devlink-dpipe`` interface closes this gap. The hardware's pipeline is |
|
modeled as a graph of match/action tables. Each table represents a specific |
|
hardware block. This model is not new, first being used by the P4 language. |
|
|
|
Traditionally it has been used as an alternative model for hardware |
|
configuration, but the ``devlink-dpipe`` interface uses it for visibility |
|
purposes as a standard complementary tool. The system's view from |
|
``devlink-dpipe`` should change according to the changes done by the |
|
standard configuration tools. |
|
|
|
For example, it’s quite common to implement Access Control Lists (ACL) |
|
using Ternary Content Addressable Memory (TCAM). The TCAM memory can be |
|
divided into TCAM regions. Complex TC filters can have multiple rules with |
|
different priorities and different lookup keys. On the other hand hardware |
|
TCAM regions have a predefined lookup key. Offloading the TC filter rules |
|
using TCAM engine can result in multiple TCAM regions being interconnected |
|
in a chain (which may affect the data path latency). In response to a new TC |
|
filter new tables should be created describing those regions. |
|
|
|
Model |
|
===== |
|
|
|
The ``DPIPE`` model introduces several objects: |
|
|
|
* headers |
|
* tables |
|
* entries |
|
|
|
A ``header`` describes packet formats and provides names for fields within |
|
the packet. A ``table`` describes hardware blocks. An ``entry`` describes |
|
the actual content of a specific table. |
|
|
|
The hardware pipeline is not port specific, but rather describes the whole |
|
ASIC. Thus it is tied to the top of the ``devlink`` infrastructure. |
|
|
|
Drivers can register and unregister tables at run time, in order to support |
|
dynamic behavior. This dynamic behavior is mandatory for describing hardware |
|
blocks like TCAM regions which can be allocated and freed dynamically. |
|
|
|
``devlink-dpipe`` generally is not intended for configuration. The exception |
|
is hardware counting for a specific table. |
|
|
|
The following commands are used to obtain the ``dpipe`` objects from |
|
userspace: |
|
|
|
* ``table_get``: Receive a table's description. |
|
* ``headers_get``: Receive a device's supported headers. |
|
* ``entries_get``: Receive a table's current entries. |
|
* ``counters_set``: Enable or disable counters on a table. |
|
|
|
Table |
|
----- |
|
|
|
The driver should implement the following operations for each table: |
|
|
|
* ``matches_dump``: Dump the supported matches. |
|
* ``actions_dump``: Dump the supported actions. |
|
* ``entries_dump``: Dump the actual content of the table. |
|
* ``counters_set_update``: Synchronize hardware with counters enabled or |
|
disabled. |
|
|
|
Header/Field |
|
------------ |
|
|
|
In a similar way to P4 headers and fields are used to describe a table's |
|
behavior. There is a slight difference between the standard protocol headers |
|
and specific ASIC metadata. The protocol headers should be declared in the |
|
``devlink`` core API. On the other hand ASIC meta data is driver specific |
|
and should be defined in the driver. Additionally, each driver-specific |
|
devlink documentation file should document the driver-specific ``dpipe`` |
|
headers it implements. The headers and fields are identified by enumeration. |
|
|
|
In order to provide further visibility some ASIC metadata fields could be |
|
mapped to kernel objects. For example, internal router interface indexes can |
|
be directly mapped to the net device ifindex. FIB table indexes used by |
|
different Virtual Routing and Forwarding (VRF) tables can be mapped to |
|
internal routing table indexes. |
|
|
|
Match |
|
----- |
|
|
|
Matches are kept primitive and close to hardware operation. Match types like |
|
LPM are not supported due to the fact that this is exactly a process we wish |
|
to describe in full detail. Example of matches: |
|
|
|
* ``field_exact``: Exact match on a specific field. |
|
* ``field_exact_mask``: Exact match on a specific field after masking. |
|
* ``field_range``: Match on a specific range. |
|
|
|
The id's of the header and the field should be specified in order to |
|
identify the specific field. Furthermore, the header index should be |
|
specified in order to distinguish multiple headers of the same type in a |
|
packet (tunneling). |
|
|
|
Action |
|
------ |
|
|
|
Similar to match, the actions are kept primitive and close to hardware |
|
operation. For example: |
|
|
|
* ``field_modify``: Modify the field value. |
|
* ``field_inc``: Increment the field value. |
|
* ``push_header``: Add a header. |
|
* ``pop_header``: Remove a header. |
|
|
|
Entry |
|
----- |
|
|
|
Entries of a specific table can be dumped on demand. Each eentry is |
|
identified with an index and its properties are described by a list of |
|
match/action values and specific counter. By dumping the tables content the |
|
interactions between tables can be resolved. |
|
|
|
Abstraction Example |
|
=================== |
|
|
|
The following is an example of the abstraction model of the L3 part of |
|
Mellanox Spectrum ASIC. The blocks are described in the order they appear in |
|
the pipeline. The table sizes in the following examples are not real |
|
hardware sizes and are provided for demonstration purposes. |
|
|
|
LPM |
|
--- |
|
|
|
The LPM algorithm can be implemented as a list of hash tables. Each hash |
|
table contains routes with the same prefix length. The root of the list is |
|
/32, and in case of a miss the hardware will continue to the next hash |
|
table. The depth of the search will affect the data path latency. |
|
|
|
In case of a hit the entry contains information about the next stage of the |
|
pipeline which resolves the MAC address. The next stage can be either local |
|
host table for directly connected routes, or adjacency table for next-hops. |
|
The ``meta.lpm_prefix`` field is used to connect two LPM tables. |
|
|
|
.. code:: |
|
|
|
table lpm_prefix_16 { |
|
size: 4096, |
|
counters_enabled: true, |
|
match: { meta.vr_id: exact, |
|
ipv4.dst_addr: exact_mask, |
|
ipv6.dst_addr: exact_mask, |
|
meta.lpm_prefix: exact }, |
|
action: { meta.adj_index: set, |
|
meta.adj_group_size: set, |
|
meta.rif_port: set, |
|
meta.lpm_prefix: set }, |
|
} |
|
|
|
Local Host |
|
---------- |
|
|
|
In the case of local routes the LPM lookup already resolves the egress |
|
router interface (RIF), yet the exact MAC address is not known. The local |
|
host table is a hash table combining the output interface id with |
|
destination IP address as a key. The result is the MAC address. |
|
|
|
.. code:: |
|
|
|
table local_host { |
|
size: 4096, |
|
counters_enabled: true, |
|
match: { meta.rif_port: exact, |
|
ipv4.dst_addr: exact}, |
|
action: { ethernet.daddr: set } |
|
} |
|
|
|
Adjacency |
|
--------- |
|
|
|
In case of remote routes this table does the ECMP. The LPM lookup results in |
|
ECMP group size and index that serves as a global offset into this table. |
|
Concurrently a hash of the packet is generated. Based on the ECMP group size |
|
and the packet's hash a local offset is generated. Multiple LPM entries can |
|
point to the same adjacency group. |
|
|
|
.. code:: |
|
|
|
table adjacency { |
|
size: 4096, |
|
counters_enabled: true, |
|
match: { meta.adj_index: exact, |
|
meta.adj_group_size: exact, |
|
meta.packet_hash_index: exact }, |
|
action: { ethernet.daddr: set, |
|
meta.erif: set } |
|
} |
|
|
|
ERIF |
|
---- |
|
|
|
In case the egress RIF and destination MAC have been resolved by previous |
|
tables this table does multiple operations like TTL decrease and MTU check. |
|
Then the decision of forward/drop is taken and the port L3 statistics are |
|
updated based on the packet's type (broadcast, unicast, multicast). |
|
|
|
.. code:: |
|
|
|
table erif { |
|
size: 800, |
|
counters_enabled: true, |
|
match: { meta.rif_port: exact, |
|
meta.is_l3_unicast: exact, |
|
meta.is_l3_broadcast: exact, |
|
meta.is_l3_multicast, exact }, |
|
action: { meta.l3_drop: set, |
|
meta.l3_forward: set } |
|
}
|
|
|