Re: git: 5b353925ff61 - main - vnode read(2)/write(2): acquire rangelock regardless of do_vn_io_fault()

From: Mateusz Guzik <mjguzik_at_gmail.com>
Date: Mon, 31 Jul 2023 02:28:37 UTC
On 7/28/23, Konstantin Belousov <kostikbel@gmail.com> wrote:
> On Fri, Jul 28, 2023 at 02:17:51AM +0200, Mateusz Guzik wrote:
>> On 7/25/23, Konstantin Belousov <kib@freebsd.org> wrote:
>> > The branch main has been updated by kib:
>> >
>> > URL:
>> > https://cgit.FreeBSD.org/src/commit/?id=5b353925ff61b9ddb97bb453ba75278b578ed7d9
>> >
>> > commit 5b353925ff61b9ddb97bb453ba75278b578ed7d9
>> > Author:     Konstantin Belousov <kib@FreeBSD.org>
>> > AuthorDate: 2023-07-23 15:55:50 +0000
>> > Commit:     Konstantin Belousov <kib@FreeBSD.org>
>> > CommitDate: 2023-07-24 22:02:59 +0000
>> >
>> >     vnode read(2)/write(2): acquire rangelock regardless of
>> > do_vn_io_fault()
>> >
>> >     To ensure atomicity of reads against parallel writes and truncates,
>> >     vnode lock was not enough at least since introduction of
>> > vn_io_fault().
>> >     That code only take rangelock when it was possible that vn_read()
>> > and
>> >     vn_write() could drop the vnode lock.
>> >
>> >     At least since the introduction of VOP_READ_PGCACHE() which
>> > generally
>> >     does not lock the vnode at all, rangelocks become required even
>> >     for filesystems that do not need vn_io_fault() workaround.  For
>> >     instance, tmpfs.
>> >
>>
>> Is there a bug with pgcache reads disabled (as in when the vnode lock
>> is held for reads?)
>>
>> Note this patch adds 2 lock trips which were previously not present,
>> which has to slow things down single-threaded, but I did not bother
>> measuring that part.
>>
>> As this adds to vnode-wide *lock* acquires this has to very negatively
>> affect scalability.
>>
>> This time around I ran: ./readseek3_processes -t 10 (10 workers
>> reading from *disjoint* offsets from the same vnode. this in principle
>> can scale perfectly)
>>
>> I observed a 90% drop in performance:
>> before: total:25723459 ops/s
>> after: total:2455794 ops/s
>>
>> Going to an unpatched kernel and disabling pgcache reads instead:
>> disabled: total:6522480 ops/s
>>
>> or about 2.6x of performance of the current kernel
>>
>> In other words I think the thing to do here is to revert the patch and
>> instead flip pgcache reads to off by default until a better fix can be
>> implemented.
>
> The rangelock purpose is to ensure atomicity of reads in presence of
> writes.  In other words, taking the rangelock there is architecturally
> right.  Also, it fixes issues with truncation that are not fixable with
> the vnode lock on tmpfs vnodes anyway.
>

How come? I see vn_truncate xlocking the vnode across the entire
thing. If all tmpfs reads slock and all writes xlock the vnode,
everyone gets protection against truncation as is.

> That said, disabling pgcache vop on tmpfs means that the regular read vop
> is always used, which takes the vnode lock around reads.  So I doubt that
> the changed disposition would gain much in your test.
>

I pasted a benchmark number without pgcache and without the change,
clearly showing that rangelock + pgcache loses big time to mere vnode
locking.

> The proper future fix would be to improve scalability of the rangelocks,
> whose naive stop-gap implementation I did initially in time of current-7
> or -8 was not changed at all.
>

I agree rangelocks should be fixed. But in the meantime, the patch at
hand seems to be too heavy handed to fix the issue (and disabling
pgcache reads would do it just fine instead).

> BTW, it seems that file offset locks are no longer needed, but I need to
> recheck it.  This should shave off four atomics on read and write path.
>


-- 
Mateusz Guzik <mjguzik gmail.com>