Re: git: 95f773e59482 - main - i386 copyout_fast: improve detection of a fault on accessing userspace
- Reply: Tijl Coosemans : "Re: git: 95f773e59482 - main - i386 copyout_fast: improve detection of a fault on accessing userspace"
- Reply: Tijl Coosemans : "Re: git: 95f773e59482 - main - i386 copyout_fast: improve detection of a fault on accessing userspace"
- In reply to: Konstantin Belousov : "Re: git: 95f773e59482 - main - i386 copyout_fast: improve detection of a fault on accessing userspace"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Thu, 08 Sep 2022 12:49:09 UTC
On Wed, 7 Sep 2022 23:18:32 +0300 Konstantin Belousov <kostikbel@gmail.com> wrote: > On Wed, Sep 07, 2022 at 06:38:04PM +0200, Tijl Coosemans wrote: >> So interrupts must have been reenabled somehow, probably by the page >> fault handler, and this allows context switches and then another process >> can call copyout or copyin and corrupt the trampoline stack and >> copyout_buf. > I do not see where the interrupts could be reenabled in copyout_fast path, > without or with page fault on the userspace access. The problem is not with userspace page faults. Those are treated specially by the page fault handler in exception.s causing copyout_fast and copyin_fast to return immediately with EFAULT so copyout and copyin fall back to doing a slow copy. The problem is with page faults on the kernel space accesses. Before this commit they were also treated specially, and now they are not. Now the page fault handler in exception.s calls trap() which calls trap_pfault() etc. > Also I am not sure what should your patch demonstrate. My patch adds a trylock after cli and an unlock before sti. If the trylock fails panic("") is called. I put the lock at the base of the trampoline stack so it's per cpu. The code between cli and sti is supposed to be uninterruptible so the trylock should never fail, but it does! Somehow there's a context switch to another thread that calls copyout or copyin while the first thread was still busy copying when it was switched out. This corrupts the trampoline stack and/or copyout_buf of the first thread. If this isn't caused by interrupts being reenabled then the context switch must be happening in some other way. And this must be happening because page faults on kernel space accesses call trap() now. > Also, pmap_trm_alloc() puts guard page between consequent allocations, > so trampoline stack overflow must be very careful to not really overflow, > but just touch enough bytes to give the effect. I no longer think stack overflow happens. I tried increasing the trampoline stack to 4 pages (=KSTACK_PAGES) and the panic still looked the same, only with stack addresses shifted by 3 pages. > But lets check the hypothesis. If interrupts are enabled somehow, then > processor would execute interrupt/fault handler on the same stack, which > is trampoline stack and not the kstack. I added an INVARIANTS check that > verifies that both trap() and syscall() use kstack. Could you, please, > fetch my ast branch and see what it the outcome? I have not done this yet, but I don't think this will ever trigger. For copyin_fast, interrupts are guaranteed to be disabled for copying from user to copyout_buf. In your ast branch subsequent copying from copyout_buf to kernel space runs on the kstack so any interrupts here won't use the trampoline stack. For copyout_fast, page faults when copying from the kernel to copyout_buf must be extremely rare because the kernel just prepared the data to be copied. And without page faults interrupts stay disabled for subsequent copying from copyout_buf to user space. I've never seen the problem with copyout_fast, only with copyin_fast.