mirror of https://github.com/Qortal/Brooklyn
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
152 lines
6.5 KiB
152 lines
6.5 KiB
.. SPDX-License-Identifier: GPL-2.0 |
|
|
|
============= |
|
Kernel Stacks |
|
============= |
|
|
|
Kernel stacks on x86-64 bit |
|
=========================== |
|
|
|
Most of the text from Keith Owens, hacked by AK |
|
|
|
x86_64 page size (PAGE_SIZE) is 4K. |
|
|
|
Like all other architectures, x86_64 has a kernel stack for every |
|
active thread. These thread stacks are THREAD_SIZE (2*PAGE_SIZE) big. |
|
These stacks contain useful data as long as a thread is alive or a |
|
zombie. While the thread is in user space the kernel stack is empty |
|
except for the thread_info structure at the bottom. |
|
|
|
In addition to the per thread stacks, there are specialized stacks |
|
associated with each CPU. These stacks are only used while the kernel |
|
is in control on that CPU; when a CPU returns to user space the |
|
specialized stacks contain no useful data. The main CPU stacks are: |
|
|
|
* Interrupt stack. IRQ_STACK_SIZE |
|
|
|
Used for external hardware interrupts. If this is the first external |
|
hardware interrupt (i.e. not a nested hardware interrupt) then the |
|
kernel switches from the current task to the interrupt stack. Like |
|
the split thread and interrupt stacks on i386, this gives more room |
|
for kernel interrupt processing without having to increase the size |
|
of every per thread stack. |
|
|
|
The interrupt stack is also used when processing a softirq. |
|
|
|
Switching to the kernel interrupt stack is done by software based on a |
|
per CPU interrupt nest counter. This is needed because x86-64 "IST" |
|
hardware stacks cannot nest without races. |
|
|
|
x86_64 also has a feature which is not available on i386, the ability |
|
to automatically switch to a new stack for designated events such as |
|
double fault or NMI, which makes it easier to handle these unusual |
|
events on x86_64. This feature is called the Interrupt Stack Table |
|
(IST). There can be up to 7 IST entries per CPU. The IST code is an |
|
index into the Task State Segment (TSS). The IST entries in the TSS |
|
point to dedicated stacks; each stack can be a different size. |
|
|
|
An IST is selected by a non-zero value in the IST field of an |
|
interrupt-gate descriptor. When an interrupt occurs and the hardware |
|
loads such a descriptor, the hardware automatically sets the new stack |
|
pointer based on the IST value, then invokes the interrupt handler. If |
|
the interrupt came from user mode, then the interrupt handler prologue |
|
will switch back to the per-thread stack. If software wants to allow |
|
nested IST interrupts then the handler must adjust the IST values on |
|
entry to and exit from the interrupt handler. (This is occasionally |
|
done, e.g. for debug exceptions.) |
|
|
|
Events with different IST codes (i.e. with different stacks) can be |
|
nested. For example, a debug interrupt can safely be interrupted by an |
|
NMI. arch/x86_64/kernel/entry.S::paranoidentry adjusts the stack |
|
pointers on entry to and exit from all IST events, in theory allowing |
|
IST events with the same code to be nested. However in most cases, the |
|
stack size allocated to an IST assumes no nesting for the same code. |
|
If that assumption is ever broken then the stacks will become corrupt. |
|
|
|
The currently assigned IST stacks are: |
|
|
|
* ESTACK_DF. EXCEPTION_STKSZ (PAGE_SIZE). |
|
|
|
Used for interrupt 8 - Double Fault Exception (#DF). |
|
|
|
Invoked when handling one exception causes another exception. Happens |
|
when the kernel is very confused (e.g. kernel stack pointer corrupt). |
|
Using a separate stack allows the kernel to recover from it well enough |
|
in many cases to still output an oops. |
|
|
|
* ESTACK_NMI. EXCEPTION_STKSZ (PAGE_SIZE). |
|
|
|
Used for non-maskable interrupts (NMI). |
|
|
|
NMI can be delivered at any time, including when the kernel is in the |
|
middle of switching stacks. Using IST for NMI events avoids making |
|
assumptions about the previous state of the kernel stack. |
|
|
|
* ESTACK_DB. EXCEPTION_STKSZ (PAGE_SIZE). |
|
|
|
Used for hardware debug interrupts (interrupt 1) and for software |
|
debug interrupts (INT3). |
|
|
|
When debugging a kernel, debug interrupts (both hardware and |
|
software) can occur at any time. Using IST for these interrupts |
|
avoids making assumptions about the previous state of the kernel |
|
stack. |
|
|
|
To handle nested #DB correctly there exist two instances of DB stacks. On |
|
#DB entry the IST stackpointer for #DB is switched to the second instance |
|
so a nested #DB starts from a clean stack. The nested #DB switches |
|
the IST stackpointer to a guard hole to catch triple nesting. |
|
|
|
* ESTACK_MCE. EXCEPTION_STKSZ (PAGE_SIZE). |
|
|
|
Used for interrupt 18 - Machine Check Exception (#MC). |
|
|
|
MCE can be delivered at any time, including when the kernel is in the |
|
middle of switching stacks. Using IST for MCE events avoids making |
|
assumptions about the previous state of the kernel stack. |
|
|
|
For more details see the Intel IA32 or AMD AMD64 architecture manuals. |
|
|
|
|
|
Printing backtraces on x86 |
|
========================== |
|
|
|
The question about the '?' preceding function names in an x86 stacktrace |
|
keeps popping up, here's an indepth explanation. It helps if the reader |
|
stares at print_context_stack() and the whole machinery in and around |
|
arch/x86/kernel/dumpstack.c. |
|
|
|
Adapted from Ingo's mail, Message-ID: <[email protected]>: |
|
|
|
We always scan the full kernel stack for return addresses stored on |
|
the kernel stack(s) [1]_, from stack top to stack bottom, and print out |
|
anything that 'looks like' a kernel text address. |
|
|
|
If it fits into the frame pointer chain, we print it without a question |
|
mark, knowing that it's part of the real backtrace. |
|
|
|
If the address does not fit into our expected frame pointer chain we |
|
still print it, but we print a '?'. It can mean two things: |
|
|
|
- either the address is not part of the call chain: it's just stale |
|
values on the kernel stack, from earlier function calls. This is |
|
the common case. |
|
|
|
- or it is part of the call chain, but the frame pointer was not set |
|
up properly within the function, so we don't recognize it. |
|
|
|
This way we will always print out the real call chain (plus a few more |
|
entries), regardless of whether the frame pointer was set up correctly |
|
or not - but in most cases we'll get the call chain right as well. The |
|
entries printed are strictly in stack order, so you can deduce more |
|
information from that as well. |
|
|
|
The most important property of this method is that we _never_ lose |
|
information: we always strive to print _all_ addresses on the stack(s) |
|
that look like kernel text addresses, so if debug information is wrong, |
|
we still print out the real call chain as well - just with more question |
|
marks than ideal. |
|
|
|
.. [1] For things like IRQ and IST stacks, we also scan those stacks, in |
|
the right order, and try to cross from one stack into another |
|
reconstructing the call chain. This works most of the time.
|
|
|