Re: Did something change with ZFS and vnode caching?
- In reply to: Garrett Wollman : "Re: Did something change with ZFS and vnode caching?"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Fri, 01 Sep 2023 21:22:23 UTC
On Thu, Aug 31, 2023 at 12:05 PM Garrett Wollman <wollman@bimajority.org> wrote: > > <<On Thu, 24 Aug 2023 11:21:59 -0400, Garrett Wollman <wollman@bimajority.org> said: > > > Any suggestions on what we should monitor or try to adjust? I remember you mentioning that you tried increasing kern.maxvnodes but I was wondering if you've tried bumping it way up (like 10X what it currently is)? You could try decreasing the max nfsd threads (--maxthreads command line option for nfsd. That would at least limit the # of vnodes used by the nfsd. rick > > To bring everyone up to speed: earlier this month we upgraded our NFS > servers from 12.4 to 13.2 and found that our backup system was > absolutely destroying NFS performance, which had not happened before. > > With some pointers from mjg@ and the thread relating to ZFS > performance on current@ I built a stable/13 kernel > (b5a5a06fc012d27c6937776bff8469ea465c3873) and installed it on one of > our NFS servers for testing, then removed the band-aid on our backup > system and allowed it to go as parallel as it wanted. > > Unfortunately, we do not control the scheduling of backup jobs, so > it's difficult to tell whether the changes made any difference. Each > backup job does a parallel breadth-first traversal of a given > filesystem, using as many as 150 threads per job (the backup client > auto-scales itself), and we sometimes see as many as eight jobs > running in parallel on one file server. (There are 17, soon to be 18, > file servers.) > > When the performance of NFS's backing store goes to hell, the NFS > server is not able to put back-pressure on the clients hard enough to > stop them from writing, and eventually the server runs out of 4k jumbo > mbufs and crashes. This at least is a known failure mode, going back > a decade. Before it gets to this point, the NFS server also > auto-scales itself, so it's in competition with the backup client over > who can create the most threads and ultimately allocate the most > vnodes. > > Last night, while I was watching, the first dozen or so backups went > fine, with no impact to NFS performance, until the backup server > decided to schedule scans of two, and then three, parallel scans of > filesystems containing about 35 million files each. These tend to > take an hour or four, depending on how much changed data is identified > during the scane, but most of the time it's just sitting in a > readdir()/fstatat() loop with a shared work queue for parallelism. > (That's my interpretation based on its activity; we do not have source > code.) > > Once these scans were underway, I observed the same symptoms as on > releng/13.2, with lots of lock contention and the vnlru process > running almost constantly (95% CPU, so most of a core on this > 20-core/40-thread server). From our monitoring, the server was > recycling about 35k vnodes per second during this period. I wasn't > monitoring these statistics before so I don't have historical > comparisons. My working assumption, such as it is, is that the switch > from OpnSolaris ZFS to OpenZFS in 13.x moved some bottlenecks around > so that the backup client previously got tangled higher up in the ZFS > code and now can put real pressure on the vnode allocator. > > During the hour that the three backup clients were running, I was able > to run mjg@'s dtrace script and generate a flame graph, which is > viewable at <https://people.csail.mit.edu/wollman/dtrace-terad.2.svg>. > This just shows what the backup clients themselves are doing, and not > what's going on in the vnlru or nfsd processes. You can ignore all > the umtx stacks since that's just coordination between the threads in > the backup client. > > On the "oncpu" side, the trace captures a lot of time spent spinning > in lock_delay(), although I don't see where the alleged call site > acquires any locks, so there must have been some inlining. On the > "offcpu" side, it's clear that there's still a lot of time spent > sleeping on vnode_list_mtx in the vnode allocation pathway, both > directly from vn_alloc_hard() and also from vnlru_free_impl() after > the mutex is dropped and then needs to be reacquired. > > In ZFS, there's also a substantial number of waits (shown as > sx_xlock_hard stack frames), in both the easy case (a free vnode was > readily available) and the hard case where vn_alloc_hard() calls > vnlru_free_impl() and eventually zfs_inactive() to reclaim a vnode. > Looking into the implementation, I noted that ZFS uses a 64-entry hash > lock for this, and I'm wondering if there's an issue with false > sharing. Can anyone with ZFS experience speak to that? If I > increased ZFS_OBJ_MTX_SZ to 128 or 256, would it be likely to hurt > something else (other than memory usage)? Do we even know that the > low-order 6 bits of ZFS object IDs are actually uniformly distributed? > > -GAWollman > >