forked from Qortal/Brooklyn
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
198 lines
8.9 KiB
198 lines
8.9 KiB
============================================================= |
|
An ad-hoc collection of notes on IA64 MCA and INIT processing |
|
============================================================= |
|
|
|
Feel free to update it with notes about any area that is not clear. |
|
|
|
--- |
|
|
|
MCA/INIT are completely asynchronous. They can occur at any time, when |
|
the OS is in any state. Including when one of the cpus is already |
|
holding a spinlock. Trying to get any lock from MCA/INIT state is |
|
asking for deadlock. Also the state of structures that are protected |
|
by locks is indeterminate, including linked lists. |
|
|
|
--- |
|
|
|
The complicated ia64 MCA process. All of this is mandated by Intel's |
|
specification for ia64 SAL, error recovery and unwind, it is not as |
|
if we have a choice here. |
|
|
|
* MCA occurs on one cpu, usually due to a double bit memory error. |
|
This is the monarch cpu. |
|
|
|
* SAL sends an MCA rendezvous interrupt (which is a normal interrupt) |
|
to all the other cpus, the slaves. |
|
|
|
* Slave cpus that receive the MCA interrupt call down into SAL, they |
|
end up spinning disabled while the MCA is being serviced. |
|
|
|
* If any slave cpu was already spinning disabled when the MCA occurred |
|
then it cannot service the MCA interrupt. SAL waits ~20 seconds then |
|
sends an unmaskable INIT event to the slave cpus that have not |
|
already rendezvoused. |
|
|
|
* Because MCA/INIT can be delivered at any time, including when the cpu |
|
is down in PAL in physical mode, the registers at the time of the |
|
event are _completely_ undefined. In particular the MCA/INIT |
|
handlers cannot rely on the thread pointer, PAL physical mode can |
|
(and does) modify TP. It is allowed to do that as long as it resets |
|
TP on return. However MCA/INIT events expose us to these PAL |
|
internal TP changes. Hence curr_task(). |
|
|
|
* If an MCA/INIT event occurs while the kernel was running (not user |
|
space) and the kernel has called PAL then the MCA/INIT handler cannot |
|
assume that the kernel stack is in a fit state to be used. Mainly |
|
because PAL may or may not maintain the stack pointer internally. |
|
Because the MCA/INIT handlers cannot trust the kernel stack, they |
|
have to use their own, per-cpu stacks. The MCA/INIT stacks are |
|
preformatted with just enough task state to let the relevant handlers |
|
do their job. |
|
|
|
* Unlike most other architectures, the ia64 struct task is embedded in |
|
the kernel stack[1]. So switching to a new kernel stack means that |
|
we switch to a new task as well. Because various bits of the kernel |
|
assume that current points into the struct task, switching to a new |
|
stack also means a new value for current. |
|
|
|
* Once all slaves have rendezvoused and are spinning disabled, the |
|
monarch is entered. The monarch now tries to diagnose the problem |
|
and decide if it can recover or not. |
|
|
|
* Part of the monarch's job is to look at the state of all the other |
|
tasks. The only way to do that on ia64 is to call the unwinder, |
|
as mandated by Intel. |
|
|
|
* The starting point for the unwind depends on whether a task is |
|
running or not. That is, whether it is on a cpu or is blocked. The |
|
monarch has to determine whether or not a task is on a cpu before it |
|
knows how to start unwinding it. The tasks that received an MCA or |
|
INIT event are no longer running, they have been converted to blocked |
|
tasks. But (and its a big but), the cpus that received the MCA |
|
rendezvous interrupt are still running on their normal kernel stacks! |
|
|
|
* To distinguish between these two cases, the monarch must know which |
|
tasks are on a cpu and which are not. Hence each slave cpu that |
|
switches to an MCA/INIT stack, registers its new stack using |
|
set_curr_task(), so the monarch can tell that the _original_ task is |
|
no longer running on that cpu. That gives us a decent chance of |
|
getting a valid backtrace of the _original_ task. |
|
|
|
* MCA/INIT can be nested, to a depth of 2 on any cpu. In the case of a |
|
nested error, we want diagnostics on the MCA/INIT handler that |
|
failed, not on the task that was originally running. Again this |
|
requires set_curr_task() so the MCA/INIT handlers can register their |
|
own stack as running on that cpu. Then a recursive error gets a |
|
trace of the failing handler's "task". |
|
|
|
[1] |
|
My (Keith Owens) original design called for ia64 to separate its |
|
struct task and the kernel stacks. Then the MCA/INIT data would be |
|
chained stacks like i386 interrupt stacks. But that required |
|
radical surgery on the rest of ia64, plus extra hard wired TLB |
|
entries with its associated performance degradation. David |
|
Mosberger vetoed that approach. Which meant that separate kernel |
|
stacks meant separate "tasks" for the MCA/INIT handlers. |
|
|
|
--- |
|
|
|
INIT is less complicated than MCA. Pressing the nmi button or using |
|
the equivalent command on the management console sends INIT to all |
|
cpus. SAL picks one of the cpus as the monarch and the rest are |
|
slaves. All the OS INIT handlers are entered at approximately the same |
|
time. The OS monarch prints the state of all tasks and returns, after |
|
which the slaves return and the system resumes. |
|
|
|
At least that is what is supposed to happen. Alas there are broken |
|
versions of SAL out there. Some drive all the cpus as monarchs. Some |
|
drive them all as slaves. Some drive one cpu as monarch, wait for that |
|
cpu to return from the OS then drive the rest as slaves. Some versions |
|
of SAL cannot even cope with returning from the OS, they spin inside |
|
SAL on resume. The OS INIT code has workarounds for some of these |
|
broken SAL symptoms, but some simply cannot be fixed from the OS side. |
|
|
|
--- |
|
|
|
The scheduler hooks used by ia64 (curr_task, set_curr_task) are layer |
|
violations. Unfortunately MCA/INIT start off as massive layer |
|
violations (can occur at _any_ time) and they build from there. |
|
|
|
At least ia64 makes an attempt at recovering from hardware errors, but |
|
it is a difficult problem because of the asynchronous nature of these |
|
errors. When processing an unmaskable interrupt we sometimes need |
|
special code to cope with our inability to take any locks. |
|
|
|
--- |
|
|
|
How is ia64 MCA/INIT different from x86 NMI? |
|
|
|
* x86 NMI typically gets delivered to one cpu. MCA/INIT gets sent to |
|
all cpus. |
|
|
|
* x86 NMI cannot be nested. MCA/INIT can be nested, to a depth of 2 |
|
per cpu. |
|
|
|
* x86 has a separate struct task which points to one of multiple kernel |
|
stacks. ia64 has the struct task embedded in the single kernel |
|
stack, so switching stack means switching task. |
|
|
|
* x86 does not call the BIOS so the NMI handler does not have to worry |
|
about any registers having changed. MCA/INIT can occur while the cpu |
|
is in PAL in physical mode, with undefined registers and an undefined |
|
kernel stack. |
|
|
|
* i386 backtrace is not very sensitive to whether a process is running |
|
or not. ia64 unwind is very, very sensitive to whether a process is |
|
running or not. |
|
|
|
--- |
|
|
|
What happens when MCA/INIT is delivered what a cpu is running user |
|
space code? |
|
|
|
The user mode registers are stored in the RSE area of the MCA/INIT on |
|
entry to the OS and are restored from there on return to SAL, so user |
|
mode registers are preserved across a recoverable MCA/INIT. Since the |
|
OS has no idea what unwind data is available for the user space stack, |
|
MCA/INIT never tries to backtrace user space. Which means that the OS |
|
does not bother making the user space process look like a blocked task, |
|
i.e. the OS does not copy pt_regs and switch_stack to the user space |
|
stack. Also the OS has no idea how big the user space RSE and memory |
|
stacks are, which makes it too risky to copy the saved state to a user |
|
mode stack. |
|
|
|
--- |
|
|
|
How do we get a backtrace on the tasks that were running when MCA/INIT |
|
was delivered? |
|
|
|
mca.c:::ia64_mca_modify_original_stack(). That identifies and |
|
verifies the original kernel stack, copies the dirty registers from |
|
the MCA/INIT stack's RSE to the original stack's RSE, copies the |
|
skeleton struct pt_regs and switch_stack to the original stack, fills |
|
in the skeleton structures from the PAL minstate area and updates the |
|
original stack's thread.ksp. That makes the original stack look |
|
exactly like any other blocked task, i.e. it now appears to be |
|
sleeping. To get a backtrace, just start with thread.ksp for the |
|
original task and unwind like any other sleeping task. |
|
|
|
--- |
|
|
|
How do we identify the tasks that were running when MCA/INIT was |
|
delivered? |
|
|
|
If the previous task has been verified and converted to a blocked |
|
state, then sos->prev_task on the MCA/INIT stack is updated to point to |
|
the previous task. You can look at that field in dumps or debuggers. |
|
To help distinguish between the handler and the original tasks, |
|
handlers have _TIF_MCA_INIT set in thread_info.flags. |
|
|
|
The sos data is always in the MCA/INIT handler stack, at offset |
|
MCA_SOS_OFFSET. You can get that value from mca_asm.h or calculate it |
|
as KERNEL_STACK_SIZE - sizeof(struct pt_regs) - sizeof(struct |
|
ia64_sal_os_state), with 16 byte alignment for all structures. |
|
|
|
Also the comm field of the MCA/INIT task is modified to include the pid |
|
of the original task, for humans to use. For example, a comm field of |
|
'MCA 12159' means that pid 12159 was running when the MCA was |
|
delivered.
|
|
|