Re: Speed improvements in ZFS

From: Alexander Leidinger <Alexander_at_Leidinger.net>
Date: Mon, 28 Aug 2023 20:33:48 UTC
Am 2023-08-22 18:59, schrieb Mateusz Guzik:
> On 8/22/23, Alexander Leidinger <Alexander@leidinger.net> wrote:
>> Am 2023-08-21 10:53, schrieb Konstantin Belousov:
>>> On Mon, Aug 21, 2023 at 08:19:28AM +0200, Alexander Leidinger wrote:
>>>> Am 2023-08-20 23:17, schrieb Konstantin Belousov:
>>>> > On Sun, Aug 20, 2023 at 11:07:08PM +0200, Mateusz Guzik wrote:
>>>> > > On 8/20/23, Alexander Leidinger <Alexander@leidinger.net> wrote:
>>>> > > > Am 2023-08-20 22:02, schrieb Mateusz Guzik:
>>>> > > >> On 8/20/23, Alexander Leidinger <Alexander@leidinger.net> wrote:
>>>> > > >>> Am 2023-08-20 19:10, schrieb Mateusz Guzik:
>>>> > > >>>> On 8/18/23, Alexander Leidinger <Alexander@leidinger.net>
>>>> > > >>>> wrote:
>>>> > > >>>
>>>> > > >>>>> I have a 51MB text file, compressed to about 1MB. Are you
>>>> > > >>>>> interested
>>>> > > >>>>> to
>>>> > > >>>>> get it?
>>>> > > >>>>>
>>>> > > >>>>
>>>> > > >>>> Your problem is not the vnode limit, but nullfs.
>>>> > > >>>>
>>>> > > >>>> https://people.freebsd.org/~mjg/netchild-periodic-find.svg
>>>> > > >>>
>>>> > > >>> 122 nullfs mounts on this system. And every jail I setup has
>>>> > > >>> several
>>>> > > >>> null mounts. One basesystem mounted into every jail, and then
>>>> > > >>> shared
>>>> > > >>> ports (packages/distfiles/ccache) across all of them.
>>>> > > >>>
>>>> > > >>>> First, some of the contention is notorious VI_LOCK in order to
>>>> > > >>>> do
>>>> > > >>>> anything.
>>>> > > >>>>
>>>> > > >>>> But more importantly the mind-boggling off-cpu time comes from
>>>> > > >>>> exclusive locking which should not be there to begin with -- as
>>>> > > >>>> in
>>>> > > >>>> that xlock in stat should be a slock.
>>>> > > >>>>
>>>> > > >>>> Maybe I'm going to look into it later.
>>>> > > >>>
>>>> > > >>> That would be fantastic.
>>>> > > >>>
>>>> > > >>
>>>> > > >> I did a quick test, things are shared locked as expected.
>>>> > > >>
>>>> > > >> However, I found the following:
>>>> > > >>         if ((xmp->nullm_flags & NULLM_CACHE) != 0) {
>>>> > > >>                 mp->mnt_kern_flag |=
>>>> > > >> lowerrootvp->v_mount->mnt_kern_flag &
>>>> > > >>                     (MNTK_SHARED_WRITES | MNTK_LOOKUP_SHARED |
>>>> > > >>                     MNTK_EXTENDED_SHARED);
>>>> > > >>         }
>>>> > > >>
>>>> > > >> are you using the "nocache" option? it has a side effect of
>>>> > > >> xlocking
>>>> > > >
>>>> > > > I use noatime, noexec, nosuid, nfsv4acls. I do NOT use nocache.
>>>> > > >
>>>> > >
>>>> > > If you don't have "nocache" on null mounts, then I don't see how
>>>> > > this
>>>> > > could happen.
>>>> >
>>>> > There is also MNTK_NULL_NOCACHE on lower fs, which is currently set
>>>> > for
>>>> > fuse and nfs at least.
>>>> 
>>>> 11 of those 122 nullfs mounts are ZFS datasets which are also NFS
>>>> exported.
>>>> 6 of those nullfs mounts are also exported via Samba. The NFS 
>>>> exports
>>>> shouldn't be needed anymore, I will remove them.
>>> By nfs I meant nfs client, not nfs exports.
>> 
>> No NFS client mounts anywhere on this system. So where is this 
>> exclusive
>> lock coming from then...
>> This is a ZFS system. 2 pools: one for the root, one for anything I 
>> need
>> space for. Both pools reside on the same disks. The root pool is a 
>> 3-way
>> mirror, the "space-pool" is a 5-disk raidz2. All jails are on the
>> space-pool. The jails are all basejail-style jails.
>> 
> 
> While I don't see why xlocking happens, you should be able to dtrace
> or printf your way into finding out.

dtrace looks to me like a faster approach to get to the root than 
printf... my first naive try is to detect exclusive locks. I'm not 100% 
sure I got it right, but at least dtrace doesn't complain about it:
---snip---
#pragma D option dynvarsize=32m

fbt:nullfs:null_lock:entry
/args[0]->a_flags & 0x080000 != 0/
{
         stack();
}
---snip---

In which direction should I look with dtrace if this works in tonights 
run of periodic? I don't have enough knowledge about VFS to come up with 
some immediate ideas.

Bye,
Alexander.

-- 
http://www.Leidinger.net Alexander@Leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.org    netchild@FreeBSD.org  : PGP 0x8F31830F9F2772BF