forked from Qortal/Brooklyn
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
180 lines
8.0 KiB
180 lines
8.0 KiB
============================= |
|
Per-task statistics interface |
|
============================= |
|
|
|
|
|
Taskstats is a netlink-based interface for sending per-task and |
|
per-process statistics from the kernel to userspace. |
|
|
|
Taskstats was designed for the following benefits: |
|
|
|
- efficiently provide statistics during lifetime of a task and on its exit |
|
- unified interface for multiple accounting subsystems |
|
- extensibility for use by future accounting patches |
|
|
|
Terminology |
|
----------- |
|
|
|
"pid", "tid" and "task" are used interchangeably and refer to the standard |
|
Linux task defined by struct task_struct. per-pid stats are the same as |
|
per-task stats. |
|
|
|
"tgid", "process" and "thread group" are used interchangeably and refer to the |
|
tasks that share an mm_struct i.e. the traditional Unix process. Despite the |
|
use of tgid, there is no special treatment for the task that is thread group |
|
leader - a process is deemed alive as long as it has any task belonging to it. |
|
|
|
Usage |
|
----- |
|
|
|
To get statistics during a task's lifetime, userspace opens a unicast netlink |
|
socket (NETLINK_GENERIC family) and sends commands specifying a pid or a tgid. |
|
The response contains statistics for a task (if pid is specified) or the sum of |
|
statistics for all tasks of the process (if tgid is specified). |
|
|
|
To obtain statistics for tasks which are exiting, the userspace listener |
|
sends a register command and specifies a cpumask. Whenever a task exits on |
|
one of the cpus in the cpumask, its per-pid statistics are sent to the |
|
registered listener. Using cpumasks allows the data received by one listener |
|
to be limited and assists in flow control over the netlink interface and is |
|
explained in more detail below. |
|
|
|
If the exiting task is the last thread exiting its thread group, |
|
an additional record containing the per-tgid stats is also sent to userspace. |
|
The latter contains the sum of per-pid stats for all threads in the thread |
|
group, both past and present. |
|
|
|
getdelays.c is a simple utility demonstrating usage of the taskstats interface |
|
for reporting delay accounting statistics. Users can register cpumasks, |
|
send commands and process responses, listen for per-tid/tgid exit data, |
|
write the data received to a file and do basic flow control by increasing |
|
receive buffer sizes. |
|
|
|
Interface |
|
--------- |
|
|
|
The user-kernel interface is encapsulated in include/linux/taskstats.h |
|
|
|
To avoid this documentation becoming obsolete as the interface evolves, only |
|
an outline of the current version is given. taskstats.h always overrides the |
|
description here. |
|
|
|
struct taskstats is the common accounting structure for both per-pid and |
|
per-tgid data. It is versioned and can be extended by each accounting subsystem |
|
that is added to the kernel. The fields and their semantics are defined in the |
|
taskstats.h file. |
|
|
|
The data exchanged between user and kernel space is a netlink message belonging |
|
to the NETLINK_GENERIC family and using the netlink attributes interface. |
|
The messages are in the format:: |
|
|
|
+----------+- - -+-------------+-------------------+ |
|
| nlmsghdr | Pad | genlmsghdr | taskstats payload | |
|
+----------+- - -+-------------+-------------------+ |
|
|
|
|
|
The taskstats payload is one of the following three kinds: |
|
|
|
1. Commands: Sent from user to kernel. Commands to get data on |
|
a pid/tgid consist of one attribute, of type TASKSTATS_CMD_ATTR_PID/TGID, |
|
containing a u32 pid or tgid in the attribute payload. The pid/tgid denotes |
|
the task/process for which userspace wants statistics. |
|
|
|
Commands to register/deregister interest in exit data from a set of cpus |
|
consist of one attribute, of type |
|
TASKSTATS_CMD_ATTR_REGISTER/DEREGISTER_CPUMASK and contain a cpumask in the |
|
attribute payload. The cpumask is specified as an ascii string of |
|
comma-separated cpu ranges e.g. to listen to exit data from cpus 1,2,3,5,7,8 |
|
the cpumask would be "1-3,5,7-8". If userspace forgets to deregister interest |
|
in cpus before closing the listening socket, the kernel cleans up its interest |
|
set over time. However, for the sake of efficiency, an explicit deregistration |
|
is advisable. |
|
|
|
2. Response for a command: sent from the kernel in response to a userspace |
|
command. The payload is a series of three attributes of type: |
|
|
|
a) TASKSTATS_TYPE_AGGR_PID/TGID : attribute containing no payload but indicates |
|
a pid/tgid will be followed by some stats. |
|
|
|
b) TASKSTATS_TYPE_PID/TGID: attribute whose payload is the pid/tgid whose stats |
|
are being returned. |
|
|
|
c) TASKSTATS_TYPE_STATS: attribute with a struct taskstats as payload. The |
|
same structure is used for both per-pid and per-tgid stats. |
|
|
|
3. New message sent by kernel whenever a task exits. The payload consists of a |
|
series of attributes of the following type: |
|
|
|
a) TASKSTATS_TYPE_AGGR_PID: indicates next two attributes will be pid+stats |
|
b) TASKSTATS_TYPE_PID: contains exiting task's pid |
|
c) TASKSTATS_TYPE_STATS: contains the exiting task's per-pid stats |
|
d) TASKSTATS_TYPE_AGGR_TGID: indicates next two attributes will be tgid+stats |
|
e) TASKSTATS_TYPE_TGID: contains tgid of process to which task belongs |
|
f) TASKSTATS_TYPE_STATS: contains the per-tgid stats for exiting task's process |
|
|
|
|
|
per-tgid stats |
|
-------------- |
|
|
|
Taskstats provides per-process stats, in addition to per-task stats, since |
|
resource management is often done at a process granularity and aggregating task |
|
stats in userspace alone is inefficient and potentially inaccurate (due to lack |
|
of atomicity). |
|
|
|
However, maintaining per-process, in addition to per-task stats, within the |
|
kernel has space and time overheads. To address this, the taskstats code |
|
accumulates each exiting task's statistics into a process-wide data structure. |
|
When the last task of a process exits, the process level data accumulated also |
|
gets sent to userspace (along with the per-task data). |
|
|
|
When a user queries to get per-tgid data, the sum of all other live threads in |
|
the group is added up and added to the accumulated total for previously exited |
|
threads of the same thread group. |
|
|
|
Extending taskstats |
|
------------------- |
|
|
|
There are two ways to extend the taskstats interface to export more |
|
per-task/process stats as patches to collect them get added to the kernel |
|
in future: |
|
|
|
1. Adding more fields to the end of the existing struct taskstats. Backward |
|
compatibility is ensured by the version number within the |
|
structure. Userspace will use only the fields of the struct that correspond |
|
to the version its using. |
|
|
|
2. Defining separate statistic structs and using the netlink attributes |
|
interface to return them. Since userspace processes each netlink attribute |
|
independently, it can always ignore attributes whose type it does not |
|
understand (because it is using an older version of the interface). |
|
|
|
|
|
Choosing between 1. and 2. is a matter of trading off flexibility and |
|
overhead. If only a few fields need to be added, then 1. is the preferable |
|
path since the kernel and userspace don't need to incur the overhead of |
|
processing new netlink attributes. But if the new fields expand the existing |
|
struct too much, requiring disparate userspace accounting utilities to |
|
unnecessarily receive large structures whose fields are of no interest, then |
|
extending the attributes structure would be worthwhile. |
|
|
|
Flow control for taskstats |
|
-------------------------- |
|
|
|
When the rate of task exits becomes large, a listener may not be able to keep |
|
up with the kernel's rate of sending per-tid/tgid exit data leading to data |
|
loss. This possibility gets compounded when the taskstats structure gets |
|
extended and the number of cpus grows large. |
|
|
|
To avoid losing statistics, userspace should do one or more of the following: |
|
|
|
- increase the receive buffer sizes for the netlink sockets opened by |
|
listeners to receive exit data. |
|
|
|
- create more listeners and reduce the number of cpus being listened to by |
|
each listener. In the extreme case, there could be one listener for each cpu. |
|
Users may also consider setting the cpu affinity of the listener to the subset |
|
of cpus to which it listens, especially if they are listening to just one cpu. |
|
|
|
Despite these measures, if the userspace receives ENOBUFS error messages |
|
indicated overflow of receive buffers, it should take measures to handle the |
|
loss of data.
|
|
|