Re: Did something change with ZFS and vnode caching?
- Reply: Jonathan Chen : "Re: Did something change with ZFS and vnode caching?"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sun, 10 Sep 2023 02:28:04 UTC
<<On Fri, 1 Sep 2023 01:04:56 +0200, Mateusz Guzik <mjguzik@gmail.com> said: > zfs lock arrays are a known problem, bumping them is definitely an option. This is the thing I tried next. It took a few attempts (mostly I think due to my errors) but I'm now running with 512 (instead of 64) and plan to deploy 1024 soon, as the results are significant: while we still see significant loads and kmem pressure during the backup window, backups are able to complete some 5 to 8 hours sooner, and nfsd remains responsive. > dtrace is rather funky with stack unwinding sometimes, hence possibly > misplaced lock_delay. > What you should do here is recompile the kernel with LOCK_PROFILING. > Then: > sysctl debug.lock.prof.contested_only=1 > sysctl debug.lock.prof.enable=1 > And finally sysctl debug.lock.prof.stats > out.lockprof I do not know if I will get around to doing this, since my users have a limited tolerance for outages that I'm probably nearing the end of, but it does seem likely to be the next step if we continue to have problems. Last night I was able to get a new dtrace capture, and I made a flame graph *excluding* stacks involving sleepq_catch_signals (which indicate threads that are idle). I redid the previous flame graph with a similar filter. Unfortunately, this cannot be an entirely apples-to-apples comparison, because the traces ran for different times and both the NFS clients do different (unpredictable) work every night. Last night's trace was conducted during a period when four backup processes were running simultaneously, each with up to 110 threads, but a much shorter capture overall than the previous trace. Last week: <https://people.csail.mit.edu/wollman/dtrace-both-2r.svg> Yesterday: <https://people.csail.mit.edu/wollman/dtrace-both-11r.svg> The one thing that stands out to me is that _sx_xlock_hard *barely* shows up in the yesterday's graph -- it's there, but you have to know where to search for it. On the other hand, lock_delay is still there, and still missing its immediate caller's stack frame, and the vnode list mutex is still obviously quite contended. The fact that __mtx_lock_hard is relatively a larger fraction of zfs_zget now suggests that increating the number of ZFS object locks has substantially reduced the amount of self-contention the backup client creates during its scan. Tonight, I took a similar trace on a stock 13.2-RELEASE system: <https://people.csail.mit.edu/wollman/dtrace-14.svg>. Note that this system has very different activity patterns, and is much newer and higher capacity but has a different (capacity-optimized) zpool setup; at times, there were as many as eight backup processes running, and on this machine it takes about 21 hours to complete nightly incrementals. What stands out, aside from the additional time waiting for I/O to complete, is the appearance of rms_rlock_fallback. The path for this is the ZFS_ENTER macro called at ZFS vnode entry points to interlock with unmount operations. Once I have the new kernel deployed on this server (and 15 others) I'll be able to collect more data and see if it's worth investigating those lock_delay() stacks. -GAWollman