Re: pondering pi futexes

From: Konstantin Belousov <kostikbel_at_gmail.com>
Date: Fri, 02 Jul 2021 10:51:21 UTC
On Fri, Jul 02, 2021 at 12:53:01AM +0300, Dmitry Chagin wrote:
> On Mon, Jun 28, 2021 at 02:20:25PM +0300, Konstantin Belousov wrote:
> > On Sun, Jun 27, 2021 at 10:39:35PM +0300, Dmitry Chagin wrote:
> > > 
> > > Hi,
> > > some time ago I have changed Linuxulator futexes from sx lock to mtx.
> > > sx was used as it allows copyin/copyout with sx lock held.
> > > to use mtx I have changed the code like:
> > > 1. lock mtx;
> > > 2. disable_pagefaults;
> > > 3. copyin()
> > > 4. enable_pagefaults;
> > > 5. if error
> > >    - unlock mtx;
> > >    copyin();
> > >    if error == 0 goto 1.
> > > 
> > > it works (needto replace copyin() by fueword32()), but pondering pi futexes
> > > imlementation, I see that it is not possible to drop the futex lock on a return
> > > from msleep() path. 
> > > 
> > > below a simplified FUTEX_LOCK_PI operation, where on enter to the kernel current thread:
> > > 
> > > 0. acquire futex lock (which is mtx)
> > > 1. cmpset(0 -> current thread TID), return (0) on success;
> > > 2. fetch() from futex *uaddr (for TID of owner):
> > >    - check EDEADLK case (the futex word at *uaddr is already locked by the caller);
> > >    - check that is no waiters on *uaddr exists which is waiting via FUTEX_WAIT or
> > >      FUTEX_WAIT_BITSET, return (EINVAL) if so;
> > >    - cmpset(TID -> (FUTEX_WAITERS|TID));
> > >       - on error, the futex owner changed in user-space, repeat from 1.
> > > 3. Here we have: the owner, one waiter (current thread) and 0 or more waiters
> > >    sleeping on a waiting_proc. FUTEX_WAITERS bit is set, so any new waiters go to
> > >    the kernel and owner should unlock futex via the FUTEX_UNLOCK_PI op;
> > > 4. Try to find the thread which is associated with the owner’s TID:
> > >    - on error, something bad happened, owner died? Clean owner state link?
> > >      return (ESRCH). Or if no other waiters? Check this...
> > >    - on success:
> > >       - save owner state link to the struct futex (save priority);
> > >       - check the owner's priority, bump it if needed;
> > >       - put the current thread to the waiters list in descending priority order;
> > >       - change priority of all waiters if needed;
> > >       - msleep on a futex waiting_proc; come back with futex lock held;
> > >          - restore own priority? If last waiter?; [ponders..]
> > >          - on timeout return (ETIMEDOUT);
> > >          - the current thread is the new owner:
> > > bah!!    - store() the owner TID to *uaddr; [check what should I do on error..]
> > >          - release futex lock;
> > >          - return (0).
> > > 
> > > is it possible to hold *uaddr page to prevent page faults?
> > 
> > I did not followed exact algorithm you trying to describe.  Still, I can make
> > two points which could be useful for formulation of the working solution.
> > 
> > 1. Umtx AKA FreeBSD native implementation of something very similar to
> > futex, has a concept of the umtx queue chain. The chain owns the mutex
> > lock used for 'fast' ops, but for situations like accesses to userspace,
> > we 'busy' the umtxq chain. De-facto busy state is the hand-rolled
> > sleepable lock, with usual interlocking against chain mutex, and
> > msleep/wakeup inter-thread notifications.
> > 
> reading umtx impl I see umtxq_sleep drop lock (PDROP bit is set) and, in case
> of a signal, reacquire lock and breaks from loop. is there a possible
> small window for wakeup loss? as ERESTART returned, or the caller should
> to take care of this? Or I missing something?
> Im trying to reuse the umtx code for futexes, at least queue chain.
> Thanks in advance!

What is the race you see there?  If we get EINTR/ERESTART from msleep,
the loop inside umtxq_sleep() is not re-entered, and we do not sleep waiting
for the chain to become unbusy anymore.  Even if there is a wakeup on the
chain, before or after we acquired the lock, it does not matter for us
since we are not going to sleep on it anymore.

Yes, it is up to the caller to decide what to do with EINTR/ERESTART.  Some
umtx ops are specified to not react abruptly to signals, they should finish
the op anyway.  For such ops, specific checks for stops and exit requests
are still needed.

> 
> > 2. If you insist on accessing userspace pages while owning non-sleepable
> > locks, or even sleepable locks which do not compose cleanly with whole
> > VM+VFS+network (the later due to NFS) locking protocols, we do have a
> > technique that can be used.  Althought it is complicated and sometimes
> > slow.
> > 
> > You already noted half of it above, namely disable of the page faults and
> > copyin_nofault-like operations.  Second half is prefaulting, which is
> > done by vm_fault_quick_hold_pages().  There are two ways you can use it.
> > 
> > Either you hold the target userspace page, then remap it into KVA and
> > access userspace object through the kernel mapping.  This assumes that
> > the object fits into page, also it does not interact with parallel forks
> > and CoW, which you either have to accept to recheck after.  The usual
> > races, e.g. userspace remapping the held (actually wired) page under us,
> > is there, of course, but it is a race anyway.
> > 
> > Or, you hold the page but still access it using copyin_nofault.  Then any
> > reported error from copyin becomes fatal instead of an indicator to retry.
> > Error can happen for the same reason, because userspace could unmap or
> > remap this address under us.
> > 
> > Umtx implementation only uses busy on chains, might be because _nofault+
> > vm_fault_quick_hold() appeared later.