mirror of https://github.com/Qortal/Brooklyn
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
495 lines
23 KiB
495 lines
23 KiB
.. _numa_memory_policy: |
|
|
|
================== |
|
NUMA Memory Policy |
|
================== |
|
|
|
What is NUMA Memory Policy? |
|
============================ |
|
|
|
In the Linux kernel, "memory policy" determines from which node the kernel will |
|
allocate memory in a NUMA system or in an emulated NUMA system. Linux has |
|
supported platforms with Non-Uniform Memory Access architectures since 2.4.?. |
|
The current memory policy support was added to Linux 2.6 around May 2004. This |
|
document attempts to describe the concepts and APIs of the 2.6 memory policy |
|
support. |
|
|
|
Memory policies should not be confused with cpusets |
|
(``Documentation/admin-guide/cgroup-v1/cpusets.rst``) |
|
which is an administrative mechanism for restricting the nodes from which |
|
memory may be allocated by a set of processes. Memory policies are a |
|
programming interface that a NUMA-aware application can take advantage of. When |
|
both cpusets and policies are applied to a task, the restrictions of the cpuset |
|
takes priority. See :ref:`Memory Policies and cpusets <mem_pol_and_cpusets>` |
|
below for more details. |
|
|
|
Memory Policy Concepts |
|
====================== |
|
|
|
Scope of Memory Policies |
|
------------------------ |
|
|
|
The Linux kernel supports _scopes_ of memory policy, described here from |
|
most general to most specific: |
|
|
|
System Default Policy |
|
this policy is "hard coded" into the kernel. It is the policy |
|
that governs all page allocations that aren't controlled by |
|
one of the more specific policy scopes discussed below. When |
|
the system is "up and running", the system default policy will |
|
use "local allocation" described below. However, during boot |
|
up, the system default policy will be set to interleave |
|
allocations across all nodes with "sufficient" memory, so as |
|
not to overload the initial boot node with boot-time |
|
allocations. |
|
|
|
Task/Process Policy |
|
this is an optional, per-task policy. When defined for a |
|
specific task, this policy controls all page allocations made |
|
by or on behalf of the task that aren't controlled by a more |
|
specific scope. If a task does not define a task policy, then |
|
all page allocations that would have been controlled by the |
|
task policy "fall back" to the System Default Policy. |
|
|
|
The task policy applies to the entire address space of a task. Thus, |
|
it is inheritable, and indeed is inherited, across both fork() |
|
[clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task |
|
to establish the task policy for a child task exec()'d from an |
|
executable image that has no awareness of memory policy. See the |
|
:ref:`Memory Policy APIs <memory_policy_apis>` section, |
|
below, for an overview of the system call |
|
that a task may use to set/change its task/process policy. |
|
|
|
In a multi-threaded task, task policies apply only to the thread |
|
[Linux kernel task] that installs the policy and any threads |
|
subsequently created by that thread. Any sibling threads existing |
|
at the time a new task policy is installed retain their current |
|
policy. |
|
|
|
A task policy applies only to pages allocated after the policy is |
|
installed. Any pages already faulted in by the task when the task |
|
changes its task policy remain where they were allocated based on |
|
the policy at the time they were allocated. |
|
|
|
.. _vma_policy: |
|
|
|
VMA Policy |
|
A "VMA" or "Virtual Memory Area" refers to a range of a task's |
|
virtual address space. A task may define a specific policy for a range |
|
of its virtual address space. See the |
|
:ref:`Memory Policy APIs <memory_policy_apis>` section, |
|
below, for an overview of the mbind() system call used to set a VMA |
|
policy. |
|
|
|
A VMA policy will govern the allocation of pages that back |
|
this region of the address space. Any regions of the task's |
|
address space that don't have an explicit VMA policy will fall |
|
back to the task policy, which may itself fall back to the |
|
System Default Policy. |
|
|
|
VMA policies have a few complicating details: |
|
|
|
* VMA policy applies ONLY to anonymous pages. These include |
|
pages allocated for anonymous segments, such as the task |
|
stack and heap, and any regions of the address space |
|
mmap()ed with the MAP_ANONYMOUS flag. If a VMA policy is |
|
applied to a file mapping, it will be ignored if the mapping |
|
used the MAP_SHARED flag. If the file mapping used the |
|
MAP_PRIVATE flag, the VMA policy will only be applied when |
|
an anonymous page is allocated on an attempt to write to the |
|
mapping-- i.e., at Copy-On-Write. |
|
|
|
* VMA policies are shared between all tasks that share a |
|
virtual address space--a.k.a. threads--independent of when |
|
the policy is installed; and they are inherited across |
|
fork(). However, because VMA policies refer to a specific |
|
region of a task's address space, and because the address |
|
space is discarded and recreated on exec*(), VMA policies |
|
are NOT inheritable across exec(). Thus, only NUMA-aware |
|
applications may use VMA policies. |
|
|
|
* A task may install a new VMA policy on a sub-range of a |
|
previously mmap()ed region. When this happens, Linux splits |
|
the existing virtual memory area into 2 or 3 VMAs, each with |
|
it's own policy. |
|
|
|
* By default, VMA policy applies only to pages allocated after |
|
the policy is installed. Any pages already faulted into the |
|
VMA range remain where they were allocated based on the |
|
policy at the time they were allocated. However, since |
|
2.6.16, Linux supports page migration via the mbind() system |
|
call, so that page contents can be moved to match a newly |
|
installed policy. |
|
|
|
Shared Policy |
|
Conceptually, shared policies apply to "memory objects" mapped |
|
shared into one or more tasks' distinct address spaces. An |
|
application installs shared policies the same way as VMA |
|
policies--using the mbind() system call specifying a range of |
|
virtual addresses that map the shared object. However, unlike |
|
VMA policies, which can be considered to be an attribute of a |
|
range of a task's address space, shared policies apply |
|
directly to the shared object. Thus, all tasks that attach to |
|
the object share the policy, and all pages allocated for the |
|
shared object, by any task, will obey the shared policy. |
|
|
|
As of 2.6.22, only shared memory segments, created by shmget() or |
|
mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared |
|
policy support was added to Linux, the associated data structures were |
|
added to hugetlbfs shmem segments. At the time, hugetlbfs did not |
|
support allocation at fault time--a.k.a lazy allocation--so hugetlbfs |
|
shmem segments were never "hooked up" to the shared policy support. |
|
Although hugetlbfs segments now support lazy allocation, their support |
|
for shared policy has not been completed. |
|
|
|
As mentioned above in :ref:`VMA policies <vma_policy>` section, |
|
allocations of page cache pages for regular files mmap()ed |
|
with MAP_SHARED ignore any VMA policy installed on the virtual |
|
address range backed by the shared file mapping. Rather, |
|
shared page cache pages, including pages backing private |
|
mappings that have not yet been written by the task, follow |
|
task policy, if any, else System Default Policy. |
|
|
|
The shared policy infrastructure supports different policies on subset |
|
ranges of the shared object. However, Linux still splits the VMA of |
|
the task that installs the policy for each range of distinct policy. |
|
Thus, different tasks that attach to a shared memory segment can have |
|
different VMA configurations mapping that one shared object. This |
|
can be seen by examining the /proc/<pid>/numa_maps of tasks sharing |
|
a shared memory region, when one task has installed shared policy on |
|
one or more ranges of the region. |
|
|
|
Components of Memory Policies |
|
----------------------------- |
|
|
|
A NUMA memory policy consists of a "mode", optional mode flags, and |
|
an optional set of nodes. The mode determines the behavior of the |
|
policy, the optional mode flags determine the behavior of the mode, |
|
and the optional set of nodes can be viewed as the arguments to the |
|
policy behavior. |
|
|
|
Internally, memory policies are implemented by a reference counted |
|
structure, struct mempolicy. Details of this structure will be |
|
discussed in context, below, as required to explain the behavior. |
|
|
|
NUMA memory policy supports the following 4 behavioral modes: |
|
|
|
Default Mode--MPOL_DEFAULT |
|
This mode is only used in the memory policy APIs. Internally, |
|
MPOL_DEFAULT is converted to the NULL memory policy in all |
|
policy scopes. Any existing non-default policy will simply be |
|
removed when MPOL_DEFAULT is specified. As a result, |
|
MPOL_DEFAULT means "fall back to the next most specific policy |
|
scope." |
|
|
|
For example, a NULL or default task policy will fall back to the |
|
system default policy. A NULL or default vma policy will fall |
|
back to the task policy. |
|
|
|
When specified in one of the memory policy APIs, the Default mode |
|
does not use the optional set of nodes. |
|
|
|
It is an error for the set of nodes specified for this policy to |
|
be non-empty. |
|
|
|
MPOL_BIND |
|
This mode specifies that memory must come from the set of |
|
nodes specified by the policy. Memory will be allocated from |
|
the node in the set with sufficient free memory that is |
|
closest to the node where the allocation takes place. |
|
|
|
MPOL_PREFERRED |
|
This mode specifies that the allocation should be attempted |
|
from the single node specified in the policy. If that |
|
allocation fails, the kernel will search other nodes, in order |
|
of increasing distance from the preferred node based on |
|
information provided by the platform firmware. |
|
|
|
Internally, the Preferred policy uses a single node--the |
|
preferred_node member of struct mempolicy. When the internal |
|
mode flag MPOL_F_LOCAL is set, the preferred_node is ignored |
|
and the policy is interpreted as local allocation. "Local" |
|
allocation policy can be viewed as a Preferred policy that |
|
starts at the node containing the cpu where the allocation |
|
takes place. |
|
|
|
It is possible for the user to specify that local allocation |
|
is always preferred by passing an empty nodemask with this |
|
mode. If an empty nodemask is passed, the policy cannot use |
|
the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags |
|
described below. |
|
|
|
MPOL_INTERLEAVED |
|
This mode specifies that page allocations be interleaved, on a |
|
page granularity, across the nodes specified in the policy. |
|
This mode also behaves slightly differently, based on the |
|
context where it is used: |
|
|
|
For allocation of anonymous pages and shared memory pages, |
|
Interleave mode indexes the set of nodes specified by the |
|
policy using the page offset of the faulting address into the |
|
segment [VMA] containing the address modulo the number of |
|
nodes specified by the policy. It then attempts to allocate a |
|
page, starting at the selected node, as if the node had been |
|
specified by a Preferred policy or had been selected by a |
|
local allocation. That is, allocation will follow the per |
|
node zonelist. |
|
|
|
For allocation of page cache pages, Interleave mode indexes |
|
the set of nodes specified by the policy using a node counter |
|
maintained per task. This counter wraps around to the lowest |
|
specified node after it reaches the highest specified node. |
|
This will tend to spread the pages out over the nodes |
|
specified by the policy based on the order in which they are |
|
allocated, rather than based on any page offset into an |
|
address range or file. During system boot up, the temporary |
|
interleaved system default policy works in this mode. |
|
|
|
NUMA memory policy supports the following optional mode flags: |
|
|
|
MPOL_F_STATIC_NODES |
|
This flag specifies that the nodemask passed by |
|
the user should not be remapped if the task or VMA's set of allowed |
|
nodes changes after the memory policy has been defined. |
|
|
|
Without this flag, any time a mempolicy is rebound because of a |
|
change in the set of allowed nodes, the node (Preferred) or |
|
nodemask (Bind, Interleave) is remapped to the new set of |
|
allowed nodes. This may result in nodes being used that were |
|
previously undesired. |
|
|
|
With this flag, if the user-specified nodes overlap with the |
|
nodes allowed by the task's cpuset, then the memory policy is |
|
applied to their intersection. If the two sets of nodes do not |
|
overlap, the Default policy is used. |
|
|
|
For example, consider a task that is attached to a cpuset with |
|
mems 1-3 that sets an Interleave policy over the same set. If |
|
the cpuset's mems change to 3-5, the Interleave will now occur |
|
over nodes 3, 4, and 5. With this flag, however, since only node |
|
3 is allowed from the user's nodemask, the "interleave" only |
|
occurs over that node. If no nodes from the user's nodemask are |
|
now allowed, the Default behavior is used. |
|
|
|
MPOL_F_STATIC_NODES cannot be combined with the |
|
MPOL_F_RELATIVE_NODES flag. It also cannot be used for |
|
MPOL_PREFERRED policies that were created with an empty nodemask |
|
(local allocation). |
|
|
|
MPOL_F_RELATIVE_NODES |
|
This flag specifies that the nodemask passed |
|
by the user will be mapped relative to the set of the task or VMA's |
|
set of allowed nodes. The kernel stores the user-passed nodemask, |
|
and if the allowed nodes changes, then that original nodemask will |
|
be remapped relative to the new set of allowed nodes. |
|
|
|
Without this flag (and without MPOL_F_STATIC_NODES), anytime a |
|
mempolicy is rebound because of a change in the set of allowed |
|
nodes, the node (Preferred) or nodemask (Bind, Interleave) is |
|
remapped to the new set of allowed nodes. That remap may not |
|
preserve the relative nature of the user's passed nodemask to its |
|
set of allowed nodes upon successive rebinds: a nodemask of |
|
1,3,5 may be remapped to 7-9 and then to 1-3 if the set of |
|
allowed nodes is restored to its original state. |
|
|
|
With this flag, the remap is done so that the node numbers from |
|
the user's passed nodemask are relative to the set of allowed |
|
nodes. In other words, if nodes 0, 2, and 4 are set in the user's |
|
nodemask, the policy will be effected over the first (and in the |
|
Bind or Interleave case, the third and fifth) nodes in the set of |
|
allowed nodes. The nodemask passed by the user represents nodes |
|
relative to task or VMA's set of allowed nodes. |
|
|
|
If the user's nodemask includes nodes that are outside the range |
|
of the new set of allowed nodes (for example, node 5 is set in |
|
the user's nodemask when the set of allowed nodes is only 0-3), |
|
then the remap wraps around to the beginning of the nodemask and, |
|
if not already set, sets the node in the mempolicy nodemask. |
|
|
|
For example, consider a task that is attached to a cpuset with |
|
mems 2-5 that sets an Interleave policy over the same set with |
|
MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the |
|
interleave now occurs over nodes 3,5-7. If the cpuset's mems |
|
then change to 0,2-3,5, then the interleave occurs over nodes |
|
0,2-3,5. |
|
|
|
Thanks to the consistent remapping, applications preparing |
|
nodemasks to specify memory policies using this flag should |
|
disregard their current, actual cpuset imposed memory placement |
|
and prepare the nodemask as if they were always located on |
|
memory nodes 0 to N-1, where N is the number of memory nodes the |
|
policy is intended to manage. Let the kernel then remap to the |
|
set of memory nodes allowed by the task's cpuset, as that may |
|
change over time. |
|
|
|
MPOL_F_RELATIVE_NODES cannot be combined with the |
|
MPOL_F_STATIC_NODES flag. It also cannot be used for |
|
MPOL_PREFERRED policies that were created with an empty nodemask |
|
(local allocation). |
|
|
|
Memory Policy Reference Counting |
|
================================ |
|
|
|
To resolve use/free races, struct mempolicy contains an atomic reference |
|
count field. Internal interfaces, mpol_get()/mpol_put() increment and |
|
decrement this reference count, respectively. mpol_put() will only free |
|
the structure back to the mempolicy kmem cache when the reference count |
|
goes to zero. |
|
|
|
When a new memory policy is allocated, its reference count is initialized |
|
to '1', representing the reference held by the task that is installing the |
|
new policy. When a pointer to a memory policy structure is stored in another |
|
structure, another reference is added, as the task's reference will be dropped |
|
on completion of the policy installation. |
|
|
|
During run-time "usage" of the policy, we attempt to minimize atomic operations |
|
on the reference count, as this can lead to cache lines bouncing between cpus |
|
and NUMA nodes. "Usage" here means one of the following: |
|
|
|
1) querying of the policy, either by the task itself [using the get_mempolicy() |
|
API discussed below] or by another task using the /proc/<pid>/numa_maps |
|
interface. |
|
|
|
2) examination of the policy to determine the policy mode and associated node |
|
or node lists, if any, for page allocation. This is considered a "hot |
|
path". Note that for MPOL_BIND, the "usage" extends across the entire |
|
allocation process, which may sleep during page reclaimation, because the |
|
BIND policy nodemask is used, by reference, to filter ineligible nodes. |
|
|
|
We can avoid taking an extra reference during the usages listed above as |
|
follows: |
|
|
|
1) we never need to get/free the system default policy as this is never |
|
changed nor freed, once the system is up and running. |
|
|
|
2) for querying the policy, we do not need to take an extra reference on the |
|
target task's task policy nor vma policies because we always acquire the |
|
task's mm's mmap_lock for read during the query. The set_mempolicy() and |
|
mbind() APIs [see below] always acquire the mmap_lock for write when |
|
installing or replacing task or vma policies. Thus, there is no possibility |
|
of a task or thread freeing a policy while another task or thread is |
|
querying it. |
|
|
|
3) Page allocation usage of task or vma policy occurs in the fault path where |
|
we hold them mmap_lock for read. Again, because replacing the task or vma |
|
policy requires that the mmap_lock be held for write, the policy can't be |
|
freed out from under us while we're using it for page allocation. |
|
|
|
4) Shared policies require special consideration. One task can replace a |
|
shared memory policy while another task, with a distinct mmap_lock, is |
|
querying or allocating a page based on the policy. To resolve this |
|
potential race, the shared policy infrastructure adds an extra reference |
|
to the shared policy during lookup while holding a spin lock on the shared |
|
policy management structure. This requires that we drop this extra |
|
reference when we're finished "using" the policy. We must drop the |
|
extra reference on shared policies in the same query/allocation paths |
|
used for non-shared policies. For this reason, shared policies are marked |
|
as such, and the extra reference is dropped "conditionally"--i.e., only |
|
for shared policies. |
|
|
|
Because of this extra reference counting, and because we must lookup |
|
shared policies in a tree structure under spinlock, shared policies are |
|
more expensive to use in the page allocation path. This is especially |
|
true for shared policies on shared memory regions shared by tasks running |
|
on different NUMA nodes. This extra overhead can be avoided by always |
|
falling back to task or system default policy for shared memory regions, |
|
or by prefaulting the entire shared memory region into memory and locking |
|
it down. However, this might not be appropriate for all applications. |
|
|
|
.. _memory_policy_apis: |
|
|
|
Memory Policy APIs |
|
================== |
|
|
|
Linux supports 3 system calls for controlling memory policy. These APIS |
|
always affect only the calling task, the calling task's address space, or |
|
some shared object mapped into the calling task's address space. |
|
|
|
.. note:: |
|
the headers that define these APIs and the parameter data types for |
|
user space applications reside in a package that is not part of the |
|
Linux kernel. The kernel system call interfaces, with the 'sys\_' |
|
prefix, are defined in <linux/syscalls.h>; the mode and flag |
|
definitions are defined in <linux/mempolicy.h>. |
|
|
|
Set [Task] Memory Policy:: |
|
|
|
long set_mempolicy(int mode, const unsigned long *nmask, |
|
unsigned long maxnode); |
|
|
|
Set's the calling task's "task/process memory policy" to mode |
|
specified by the 'mode' argument and the set of nodes defined by |
|
'nmask'. 'nmask' points to a bit mask of node ids containing at least |
|
'maxnode' ids. Optional mode flags may be passed by combining the |
|
'mode' argument with the flag (for example: MPOL_INTERLEAVE | |
|
MPOL_F_STATIC_NODES). |
|
|
|
See the set_mempolicy(2) man page for more details |
|
|
|
|
|
Get [Task] Memory Policy or Related Information:: |
|
|
|
long get_mempolicy(int *mode, |
|
const unsigned long *nmask, unsigned long maxnode, |
|
void *addr, int flags); |
|
|
|
Queries the "task/process memory policy" of the calling task, or the |
|
policy or location of a specified virtual address, depending on the |
|
'flags' argument. |
|
|
|
See the get_mempolicy(2) man page for more details |
|
|
|
|
|
Install VMA/Shared Policy for a Range of Task's Address Space:: |
|
|
|
long mbind(void *start, unsigned long len, int mode, |
|
const unsigned long *nmask, unsigned long maxnode, |
|
unsigned flags); |
|
|
|
mbind() installs the policy specified by (mode, nmask, maxnodes) as a |
|
VMA policy for the range of the calling task's address space specified |
|
by the 'start' and 'len' arguments. Additional actions may be |
|
requested via the 'flags' argument. |
|
|
|
See the mbind(2) man page for more details. |
|
|
|
Memory Policy Command Line Interface |
|
==================================== |
|
|
|
Although not strictly part of the Linux implementation of memory policy, |
|
a command line tool, numactl(8), exists that allows one to: |
|
|
|
+ set the task policy for a specified program via set_mempolicy(2), fork(2) and |
|
exec(2) |
|
|
|
+ set the shared policy for a shared memory segment via mbind(2) |
|
|
|
The numactl(8) tool is packaged with the run-time version of the library |
|
containing the memory policy system call wrappers. Some distributions |
|
package the headers and compile-time libraries in a separate development |
|
package. |
|
|
|
.. _mem_pol_and_cpusets: |
|
|
|
Memory Policies and cpusets |
|
=========================== |
|
|
|
Memory policies work within cpusets as described above. For memory policies |
|
that require a node or set of nodes, the nodes are restricted to the set of |
|
nodes whose memories are allowed by the cpuset constraints. If the nodemask |
|
specified for the policy contains nodes that are not allowed by the cpuset and |
|
MPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes |
|
specified for the policy and the set of nodes with memory is used. If the |
|
result is the empty set, the policy is considered invalid and cannot be |
|
installed. If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped |
|
onto and folded into the task's set of allowed nodes as previously described. |
|
|
|
The interaction of memory policies and cpusets can be problematic when tasks |
|
in two cpusets share access to a memory region, such as shared memory segments |
|
created by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and |
|
any of the tasks install shared policy on the region, only nodes whose |
|
memories are allowed in both cpusets may be used in the policies. Obtaining |
|
this information requires "stepping outside" the memory policy APIs to use the |
|
cpuset information and requires that one know in what cpusets other task might |
|
be attaching to the shared region. Furthermore, if the cpusets' allowed |
|
memory sets are disjoint, "local" allocation is the only valid policy.
|
|
|