mirror of https://github.com/Qortal/Brooklyn
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
357 lines
14 KiB
357 lines
14 KiB
.. SPDX-License-Identifier: GPL-2.0 |
|
|
|
=============================== |
|
Kernel level exception handling |
|
=============================== |
|
|
|
Commentary by Joerg Pommnitz <[email protected]> |
|
|
|
When a process runs in kernel mode, it often has to access user |
|
mode memory whose address has been passed by an untrusted program. |
|
To protect itself the kernel has to verify this address. |
|
|
|
In older versions of Linux this was done with the |
|
int verify_area(int type, const void * addr, unsigned long size) |
|
function (which has since been replaced by access_ok()). |
|
|
|
This function verified that the memory area starting at address |
|
'addr' and of size 'size' was accessible for the operation specified |
|
in type (read or write). To do this, verify_read had to look up the |
|
virtual memory area (vma) that contained the address addr. In the |
|
normal case (correctly working program), this test was successful. |
|
It only failed for a few buggy programs. In some kernel profiling |
|
tests, this normally unneeded verification used up a considerable |
|
amount of time. |
|
|
|
To overcome this situation, Linus decided to let the virtual memory |
|
hardware present in every Linux-capable CPU handle this test. |
|
|
|
How does this work? |
|
|
|
Whenever the kernel tries to access an address that is currently not |
|
accessible, the CPU generates a page fault exception and calls the |
|
page fault handler:: |
|
|
|
void exc_page_fault(struct pt_regs *regs, unsigned long error_code) |
|
|
|
in arch/x86/mm/fault.c. The parameters on the stack are set up by |
|
the low level assembly glue in arch/x86/entry/entry_32.S. The parameter |
|
regs is a pointer to the saved registers on the stack, error_code |
|
contains a reason code for the exception. |
|
|
|
exc_page_fault() first obtains the inaccessible address from the CPU |
|
control register CR2. If the address is within the virtual address |
|
space of the process, the fault probably occurred, because the page |
|
was not swapped in, write protected or something similar. However, |
|
we are interested in the other case: the address is not valid, there |
|
is no vma that contains this address. In this case, the kernel jumps |
|
to the bad_area label. |
|
|
|
There it uses the address of the instruction that caused the exception |
|
(i.e. regs->eip) to find an address where the execution can continue |
|
(fixup). If this search is successful, the fault handler modifies the |
|
return address (again regs->eip) and returns. The execution will |
|
continue at the address in fixup. |
|
|
|
Where does fixup point to? |
|
|
|
Since we jump to the contents of fixup, fixup obviously points |
|
to executable code. This code is hidden inside the user access macros. |
|
I have picked the get_user() macro defined in arch/x86/include/asm/uaccess.h |
|
as an example. The definition is somewhat hard to follow, so let's peek at |
|
the code generated by the preprocessor and the compiler. I selected |
|
the get_user() call in drivers/char/sysrq.c for a detailed examination. |
|
|
|
The original code in sysrq.c line 587:: |
|
|
|
get_user(c, buf); |
|
|
|
The preprocessor output (edited to become somewhat readable):: |
|
|
|
( |
|
{ |
|
long __gu_err = - 14 , __gu_val = 0; |
|
const __typeof__(*( ( buf ) )) *__gu_addr = ((buf)); |
|
if (((((0 + current_set[0])->tss.segment) == 0x18 ) || |
|
(((sizeof(*(buf))) <= 0xC0000000UL) && |
|
((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf))))))) |
|
do { |
|
__gu_err = 0; |
|
switch ((sizeof(*(buf)))) { |
|
case 1: |
|
__asm__ __volatile__( |
|
"1: mov" "b" " %2,%" "b" "1\n" |
|
"2:\n" |
|
".section .fixup,\"ax\"\n" |
|
"3: movl %3,%0\n" |
|
" xor" "b" " %" "b" "1,%" "b" "1\n" |
|
" jmp 2b\n" |
|
".section __ex_table,\"a\"\n" |
|
" .align 4\n" |
|
" .long 1b,3b\n" |
|
".text" : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *) |
|
( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )) ; |
|
break; |
|
case 2: |
|
__asm__ __volatile__( |
|
"1: mov" "w" " %2,%" "w" "1\n" |
|
"2:\n" |
|
".section .fixup,\"ax\"\n" |
|
"3: movl %3,%0\n" |
|
" xor" "w" " %" "w" "1,%" "w" "1\n" |
|
" jmp 2b\n" |
|
".section __ex_table,\"a\"\n" |
|
" .align 4\n" |
|
" .long 1b,3b\n" |
|
".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *) |
|
( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )); |
|
break; |
|
case 4: |
|
__asm__ __volatile__( |
|
"1: mov" "l" " %2,%" "" "1\n" |
|
"2:\n" |
|
".section .fixup,\"ax\"\n" |
|
"3: movl %3,%0\n" |
|
" xor" "l" " %" "" "1,%" "" "1\n" |
|
" jmp 2b\n" |
|
".section __ex_table,\"a\"\n" |
|
" .align 4\n" " .long 1b,3b\n" |
|
".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *) |
|
( __gu_addr )) ), "i"(- 14 ), "0"(__gu_err)); |
|
break; |
|
default: |
|
(__gu_val) = __get_user_bad(); |
|
} |
|
} while (0) ; |
|
((c)) = (__typeof__(*((buf))))__gu_val; |
|
__gu_err; |
|
} |
|
); |
|
|
|
WOW! Black GCC/assembly magic. This is impossible to follow, so let's |
|
see what code gcc generates:: |
|
|
|
> xorl %edx,%edx |
|
> movl current_set,%eax |
|
> cmpl $24,788(%eax) |
|
> je .L1424 |
|
> cmpl $-1073741825,64(%esp) |
|
> ja .L1423 |
|
> .L1424: |
|
> movl %edx,%eax |
|
> movl 64(%esp),%ebx |
|
> #APP |
|
> 1: movb (%ebx),%dl /* this is the actual user access */ |
|
> 2: |
|
> .section .fixup,"ax" |
|
> 3: movl $-14,%eax |
|
> xorb %dl,%dl |
|
> jmp 2b |
|
> .section __ex_table,"a" |
|
> .align 4 |
|
> .long 1b,3b |
|
> .text |
|
> #NO_APP |
|
> .L1423: |
|
> movzbl %dl,%esi |
|
|
|
The optimizer does a good job and gives us something we can actually |
|
understand. Can we? The actual user access is quite obvious. Thanks |
|
to the unified address space we can just access the address in user |
|
memory. But what does the .section stuff do????? |
|
|
|
To understand this we have to look at the final kernel:: |
|
|
|
> objdump --section-headers vmlinux |
|
> |
|
> vmlinux: file format elf32-i386 |
|
> |
|
> Sections: |
|
> Idx Name Size VMA LMA File off Algn |
|
> 0 .text 00098f40 c0100000 c0100000 00001000 2**4 |
|
> CONTENTS, ALLOC, LOAD, READONLY, CODE |
|
> 1 .fixup 000016bc c0198f40 c0198f40 00099f40 2**0 |
|
> CONTENTS, ALLOC, LOAD, READONLY, CODE |
|
> 2 .rodata 0000f127 c019a5fc c019a5fc 0009b5fc 2**2 |
|
> CONTENTS, ALLOC, LOAD, READONLY, DATA |
|
> 3 __ex_table 000015c0 c01a9724 c01a9724 000aa724 2**2 |
|
> CONTENTS, ALLOC, LOAD, READONLY, DATA |
|
> 4 .data 0000ea58 c01abcf0 c01abcf0 000abcf0 2**4 |
|
> CONTENTS, ALLOC, LOAD, DATA |
|
> 5 .bss 00018e21 c01ba748 c01ba748 000ba748 2**2 |
|
> ALLOC |
|
> 6 .comment 00000ec4 00000000 00000000 000ba748 2**0 |
|
> CONTENTS, READONLY |
|
> 7 .note 00001068 00000ec4 00000ec4 000bb60c 2**0 |
|
> CONTENTS, READONLY |
|
|
|
There are obviously 2 non standard ELF sections in the generated object |
|
file. But first we want to find out what happened to our code in the |
|
final kernel executable:: |
|
|
|
> objdump --disassemble --section=.text vmlinux |
|
> |
|
> c017e785 <do_con_write+c1> xorl %edx,%edx |
|
> c017e787 <do_con_write+c3> movl 0xc01c7bec,%eax |
|
> c017e78c <do_con_write+c8> cmpl $0x18,0x314(%eax) |
|
> c017e793 <do_con_write+cf> je c017e79f <do_con_write+db> |
|
> c017e795 <do_con_write+d1> cmpl $0xbfffffff,0x40(%esp,1) |
|
> c017e79d <do_con_write+d9> ja c017e7a7 <do_con_write+e3> |
|
> c017e79f <do_con_write+db> movl %edx,%eax |
|
> c017e7a1 <do_con_write+dd> movl 0x40(%esp,1),%ebx |
|
> c017e7a5 <do_con_write+e1> movb (%ebx),%dl |
|
> c017e7a7 <do_con_write+e3> movzbl %dl,%esi |
|
|
|
The whole user memory access is reduced to 10 x86 machine instructions. |
|
The instructions bracketed in the .section directives are no longer |
|
in the normal execution path. They are located in a different section |
|
of the executable file:: |
|
|
|
> objdump --disassemble --section=.fixup vmlinux |
|
> |
|
> c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax |
|
> c0199ffa <.fixup+10ba> xorb %dl,%dl |
|
> c0199ffc <.fixup+10bc> jmp c017e7a7 <do_con_write+e3> |
|
|
|
And finally:: |
|
|
|
> objdump --full-contents --section=__ex_table vmlinux |
|
> |
|
> c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0 ................ |
|
> c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0 ................ |
|
> c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0 ................ |
|
|
|
or in human readable byte order:: |
|
|
|
> c01aa7c4 c017c093 c0199fe0 c017c097 c017c099 ................ |
|
> c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................ |
|
^^^^^^^^^^^^^^^^^ |
|
this is the interesting part! |
|
> c01aa7e4 c0180a08 c019a001 c0180a0a c019a004 ................ |
|
|
|
What happened? The assembly directives:: |
|
|
|
.section .fixup,"ax" |
|
.section __ex_table,"a" |
|
|
|
told the assembler to move the following code to the specified |
|
sections in the ELF object file. So the instructions:: |
|
|
|
3: movl $-14,%eax |
|
xorb %dl,%dl |
|
jmp 2b |
|
|
|
ended up in the .fixup section of the object file and the addresses:: |
|
|
|
.long 1b,3b |
|
|
|
ended up in the __ex_table section of the object file. 1b and 3b |
|
are local labels. The local label 1b (1b stands for next label 1 |
|
backward) is the address of the instruction that might fault, i.e. |
|
in our case the address of the label 1 is c017e7a5: |
|
the original assembly code: > 1: movb (%ebx),%dl |
|
and linked in vmlinux : > c017e7a5 <do_con_write+e1> movb (%ebx),%dl |
|
|
|
The local label 3 (backwards again) is the address of the code to handle |
|
the fault, in our case the actual value is c0199ff5: |
|
the original assembly code: > 3: movl $-14,%eax |
|
and linked in vmlinux : > c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax |
|
|
|
If the fixup was able to handle the exception, control flow may be returned |
|
to the instruction after the one that triggered the fault, ie. local label 2b. |
|
|
|
The assembly code:: |
|
|
|
> .section __ex_table,"a" |
|
> .align 4 |
|
> .long 1b,3b |
|
|
|
becomes the value pair:: |
|
|
|
> c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................ |
|
^this is ^this is |
|
1b 3b |
|
|
|
c017e7a5,c0199ff5 in the exception table of the kernel. |
|
|
|
So, what actually happens if a fault from kernel mode with no suitable |
|
vma occurs? |
|
|
|
#. access to invalid address:: |
|
|
|
> c017e7a5 <do_con_write+e1> movb (%ebx),%dl |
|
#. MMU generates exception |
|
#. CPU calls exc_page_fault() |
|
#. exc_page_fault() calls do_user_addr_fault() |
|
#. do_user_addr_fault() calls kernelmode_fixup_or_oops() |
|
#. kernelmode_fixup_or_oops() calls fixup_exception() (regs->eip == c017e7a5); |
|
#. fixup_exception() calls search_exception_tables() |
|
#. search_exception_tables() looks up the address c017e7a5 in the |
|
exception table (i.e. the contents of the ELF section __ex_table) |
|
and returns the address of the associated fault handle code c0199ff5. |
|
#. fixup_exception() modifies its own return address to point to the fault |
|
handle code and returns. |
|
#. execution continues in the fault handling code. |
|
#. a) EAX becomes -EFAULT (== -14) |
|
b) DL becomes zero (the value we "read" from user space) |
|
c) execution continues at local label 2 (address of the |
|
instruction immediately after the faulting user access). |
|
|
|
The steps 8a to 8c in a certain way emulate the faulting instruction. |
|
|
|
That's it, mostly. If you look at our example, you might ask why |
|
we set EAX to -EFAULT in the exception handler code. Well, the |
|
get_user() macro actually returns a value: 0, if the user access was |
|
successful, -EFAULT on failure. Our original code did not test this |
|
return value, however the inline assembly code in get_user() tries to |
|
return -EFAULT. GCC selected EAX to return this value. |
|
|
|
NOTE: |
|
Due to the way that the exception table is built and needs to be ordered, |
|
only use exceptions for code in the .text section. Any other section |
|
will cause the exception table to not be sorted correctly, and the |
|
exceptions will fail. |
|
|
|
Things changed when 64-bit support was added to x86 Linux. Rather than |
|
double the size of the exception table by expanding the two entries |
|
from 32-bits to 64 bits, a clever trick was used to store addresses |
|
as relative offsets from the table itself. The assembly code changed |
|
from:: |
|
|
|
.long 1b,3b |
|
to: |
|
.long (from) - . |
|
.long (to) - . |
|
|
|
and the C-code that uses these values converts back to absolute addresses |
|
like this:: |
|
|
|
ex_insn_addr(const struct exception_table_entry *x) |
|
{ |
|
return (unsigned long)&x->insn + x->insn; |
|
} |
|
|
|
In v4.6 the exception table entry was expanded with a new field "handler". |
|
This is also 32-bits wide and contains a third relative function |
|
pointer which points to one of: |
|
|
|
1) ``int ex_handler_default(const struct exception_table_entry *fixup)`` |
|
This is legacy case that just jumps to the fixup code |
|
|
|
2) ``int ex_handler_fault(const struct exception_table_entry *fixup)`` |
|
This case provides the fault number of the trap that occurred at |
|
entry->insn. It is used to distinguish page faults from machine |
|
check. |
|
|
|
More functions can easily be added. |
|
|
|
CONFIG_BUILDTIME_TABLE_SORT allows the __ex_table section to be sorted post |
|
link of the kernel image, via a host utility scripts/sorttable. It will set the |
|
symbol main_extable_sort_needed to 0, avoiding sorting the __ex_table section |
|
at boot time. With the exception table sorted, at runtime when an exception |
|
occurs we can quickly lookup the __ex_table entry via binary search. |
|
|
|
This is not just a boot time optimization, some architectures require this |
|
table to be sorted in order to handle exceptions relatively early in the boot |
|
process. For example, i386 makes use of this form of exception handling before |
|
paging support is even enabled!
|
|
|