Re: git: 5b353925ff61 - main - vnode read(2)/write(2): acquire rangelock regardless of do_vn_io_fault()

From: Konstantin Belousov <kostikbel_at_gmail.com>
Date: Fri, 28 Jul 2023 13:37:07 UTC
On Fri, Jul 28, 2023 at 02:17:51AM +0200, Mateusz Guzik wrote:
> On 7/25/23, Konstantin Belousov <kib@freebsd.org> wrote:
> > The branch main has been updated by kib:
> >
> > URL:
> > https://cgit.FreeBSD.org/src/commit/?id=5b353925ff61b9ddb97bb453ba75278b578ed7d9
> >
> > commit 5b353925ff61b9ddb97bb453ba75278b578ed7d9
> > Author:     Konstantin Belousov <kib@FreeBSD.org>
> > AuthorDate: 2023-07-23 15:55:50 +0000
> > Commit:     Konstantin Belousov <kib@FreeBSD.org>
> > CommitDate: 2023-07-24 22:02:59 +0000
> >
> >     vnode read(2)/write(2): acquire rangelock regardless of
> > do_vn_io_fault()
> >
> >     To ensure atomicity of reads against parallel writes and truncates,
> >     vnode lock was not enough at least since introduction of vn_io_fault().
> >     That code only take rangelock when it was possible that vn_read() and
> >     vn_write() could drop the vnode lock.
> >
> >     At least since the introduction of VOP_READ_PGCACHE() which generally
> >     does not lock the vnode at all, rangelocks become required even
> >     for filesystems that do not need vn_io_fault() workaround.  For
> >     instance, tmpfs.
> >
> 
> Is there a bug with pgcache reads disabled (as in when the vnode lock
> is held for reads?)
> 
> Note this patch adds 2 lock trips which were previously not present,
> which has to slow things down single-threaded, but I did not bother
> measuring that part.
> 
> As this adds to vnode-wide *lock* acquires this has to very negatively
> affect scalability.
> 
> This time around I ran: ./readseek3_processes -t 10 (10 workers
> reading from *disjoint* offsets from the same vnode. this in principle
> can scale perfectly)
> 
> I observed a 90% drop in performance:
> before: total:25723459 ops/s
> after: total:2455794 ops/s
> 
> Going to an unpatched kernel and disabling pgcache reads instead:
> disabled: total:6522480 ops/s
> 
> or about 2.6x of performance of the current kernel
> 
> In other words I think the thing to do here is to revert the patch and
> instead flip pgcache reads to off by default until a better fix can be
> implemented.

The rangelock purpose is to ensure atomicity of reads in presence of
writes.  In other words, taking the rangelock there is architecturally
right.  Also, it fixes issues with truncation that are not fixable with
the vnode lock on tmpfs vnodes anyway.

That said, disabling pgcache vop on tmpfs means that the regular read vop
is always used, which takes the vnode lock around reads.  So I doubt that
the changed disposition would gain much in your test.

The proper future fix would be to improve scalability of the rangelocks,
whose naive stop-gap implementation I did initially in time of current-7
or -8 was not changed at all.

BTW, it seems that file offset locks are no longer needed, but I need to
recheck it.  This should shave off four atomics on read and write path.