Re: pondering pi futexes

From: Dmitry Chagin <dchagin_at_freebsd.org>
Date: Thu, 01 Jul 2021 21:53:01 UTC
On Mon, Jun 28, 2021 at 02:20:25PM +0300, Konstantin Belousov wrote:
> On Sun, Jun 27, 2021 at 10:39:35PM +0300, Dmitry Chagin wrote:
> > 
> > Hi,
> > some time ago I have changed Linuxulator futexes from sx lock to mtx.
> > sx was used as it allows copyin/copyout with sx lock held.
> > to use mtx I have changed the code like:
> > 1. lock mtx;
> > 2. disable_pagefaults;
> > 3. copyin()
> > 4. enable_pagefaults;
> > 5. if error
> >    - unlock mtx;
> >    copyin();
> >    if error == 0 goto 1.
> > 
> > it works (needto replace copyin() by fueword32()), but pondering pi futexes
> > imlementation, I see that it is not possible to drop the futex lock on a return
> > from msleep() path. 
> > 
> > below a simplified FUTEX_LOCK_PI operation, where on enter to the kernel current thread:
> > 
> > 0. acquire futex lock (which is mtx)
> > 1. cmpset(0 -> current thread TID), return (0) on success;
> > 2. fetch() from futex *uaddr (for TID of owner):
> >    - check EDEADLK case (the futex word at *uaddr is already locked by the caller);
> >    - check that is no waiters on *uaddr exists which is waiting via FUTEX_WAIT or
> >      FUTEX_WAIT_BITSET, return (EINVAL) if so;
> >    - cmpset(TID -> (FUTEX_WAITERS|TID));
> >       - on error, the futex owner changed in user-space, repeat from 1.
> > 3. Here we have: the owner, one waiter (current thread) and 0 or more waiters
> >    sleeping on a waiting_proc. FUTEX_WAITERS bit is set, so any new waiters go to
> >    the kernel and owner should unlock futex via the FUTEX_UNLOCK_PI op;
> > 4. Try to find the thread which is associated with the owner’s TID:
> >    - on error, something bad happened, owner died? Clean owner state link?
> >      return (ESRCH). Or if no other waiters? Check this...
> >    - on success:
> >       - save owner state link to the struct futex (save priority);
> >       - check the owner's priority, bump it if needed;
> >       - put the current thread to the waiters list in descending priority order;
> >       - change priority of all waiters if needed;
> >       - msleep on a futex waiting_proc; come back with futex lock held;
> >          - restore own priority? If last waiter?; [ponders..]
> >          - on timeout return (ETIMEDOUT);
> >          - the current thread is the new owner:
> > bah!!    - store() the owner TID to *uaddr; [check what should I do on error..]
> >          - release futex lock;
> >          - return (0).
> > 
> > is it possible to hold *uaddr page to prevent page faults?
> 
> I did not followed exact algorithm you trying to describe.  Still, I can make
> two points which could be useful for formulation of the working solution.
> 
> 1. Umtx AKA FreeBSD native implementation of something very similar to
> futex, has a concept of the umtx queue chain. The chain owns the mutex
> lock used for 'fast' ops, but for situations like accesses to userspace,
> we 'busy' the umtxq chain. De-facto busy state is the hand-rolled
> sleepable lock, with usual interlocking against chain mutex, and
> msleep/wakeup inter-thread notifications.
> 
reading umtx impl I see umtxq_sleep drop lock (PDROP bit is set) and, in case
of a signal, reacquire lock and breaks from loop. is there a possible
small window for wakeup loss? as ERESTART returned, or the caller should
to take care of this? Or I missing something?
Im trying to reuse the umtx code for futexes, at least queue chain.
Thanks in advance!

> 2. If you insist on accessing userspace pages while owning non-sleepable
> locks, or even sleepable locks which do not compose cleanly with whole
> VM+VFS+network (the later due to NFS) locking protocols, we do have a
> technique that can be used.  Althought it is complicated and sometimes
> slow.
> 
> You already noted half of it above, namely disable of the page faults and
> copyin_nofault-like operations.  Second half is prefaulting, which is
> done by vm_fault_quick_hold_pages().  There are two ways you can use it.
> 
> Either you hold the target userspace page, then remap it into KVA and
> access userspace object through the kernel mapping.  This assumes that
> the object fits into page, also it does not interact with parallel forks
> and CoW, which you either have to accept to recheck after.  The usual
> races, e.g. userspace remapping the held (actually wired) page under us,
> is there, of course, but it is a race anyway.
> 
> Or, you hold the page but still access it using copyin_nofault.  Then any
> reported error from copyin becomes fatal instead of an indicator to retry.
> Error can happen for the same reason, because userspace could unmap or
> remap this address under us.
> 
> Umtx implementation only uses busy on chains, might be because _nofault+
> vm_fault_quick_hold() appeared later.