Re: 13-stable NFS server hang

From: Rick Macklem <rick.macklem_at_gmail.com>
Date: Sun, 03 Mar 2024 21:17:30 UTC
On Sat, Mar 2, 2024 at 8:28 PM Garrett Wollman <wollman@bimajority.org> wrote:
>
>
> I wrote previously:
> > PID    TID COMM                TDNAME              KSTACK
> > 997 108481 nfsd                nfsd: master        mi_switch sleepq_timedwait _sleep nfsv4_lock nfsrvd_dorpc nfssvc_program svc_run_internal svc_run nfsrvd_nfsd nfssvc_nfsd sys_nfssvc amd64_syscall fast_syscall_common
> > 997 960918 nfsd                nfsd: service       mi_switch sleepq_timedwait _sleep nfsv4_lock nfsrv_setclient nfsrvd_exchangeid nfsrvd_dorpc nfssvc_program svc_run_internal svc_thread_start fork_exit fork_trampoline
> > 997 962232 nfsd                nfsd: service       mi_switch _cv_wait txg_wait_synced_impl txg_wait_synced dmu_offset_next zfs_holey zfs_freebsd_ioctl vn_generic_copy_file_range vop_stdcopy_file_range VOP_COPY_FILE_RANGE vn_copy_file_range nfsrvd_copy_file_range nfsrvd_dorpc nfssvc_program svc_run_internal svc_thread_start fork_exit fork_trampoline
>
> I spent some time this evening looking at this last stack trace, and
> stumbled across the following comment in
> sys/contrib/openzfs/module/zfs/dmu.c:
>
> | /*
> |  * Enable/disable forcing txg sync when dirty checking for holes with lseek().
> |  * By default this is enabled to ensure accurate hole reporting, it can result
> |  * in a significant performance penalty for lseek(SEEK_HOLE) heavy workloads.
> |  * Disabling this option will result in holes never being reported in dirty
> |  * files which is always safe.
> |  */
> | int zfs_dmu_offset_next_sync = 1;
>
> I believe this explains why vn_copy_file_range sometimes takes much
> longer than a second: our servers often have lots of data waiting to
> be written to disk, and if the file being copied was recently modified
> (and so is dirty), this might take several seconds.  I've set
> vfs.zfs.dmu_offset_next_sync=0 on the server that was hurting the most
> and am watching to see if we have more freezes.
>
> If this does the trick, then I can delay deploying a new kernel until
> April, after my upcoming vacation.
Interesting. Please let us know how it goes.

And enjoy your vacation, rick

>
> -GAWollman
>