Re: Kernel panics with vfs.nfsd.enable_locallocks=1 and nfs clients doing hdf5 file operations
Date: Tue, 01 Oct 2024 00:10:03 UTC
On Wed, Aug 21, 2024 at 8:02 AM Matthew L. Dailey <Matthew.L.Dailey@dartmouth.edu> wrote: > > Hi Rick, > > Done - https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=280978 Just fyi for everyone, the bugzilla PR now has a patch that has been committed to main as eb345e05ac66. Early indications are that it fixes the race that was causing this problem. Although testing is still in progress, I committed it so that it can be MFC'd to stable/14 in time for 14.2. Thanks go to Matt for reporting this and testing the patch, rick > > Thanks! > > -Matt > > On 8/21/24 10:45 AM, Rick Macklem wrote: > > Please create a PR for this and include at least > > one backtrace. I will try and figure out how > > locallocks could cause it. > > > > I suspect few use locallocks=1. > > > > rick > > > > On Wed, Aug 21, 2024 at 7:29 AM Matthew L. Dailey > > <Matthew.L.Dailey@dartmouth.edu <mailto:Matthew.L.Dailey@dartmouth.edu>> > > wrote: > > > > Hi all, > > > > I posted messages to the this list back in February and March > > (https://lists.freebsd.org/archives/freebsd-current/2024-February/005546.html <https://lists.freebsd.org/archives/freebsd-current/2024-February/005546.html>) > > regarding kernel panics we were having with nfs clients doing hdf5 file > > operations. After a hiatus in troubleshooting, I had more time this > > summer and have found the cause - the vfs.nfsd.enable_locallocks sysctl. > > > > When this is set to 1, we can induce either a panic or hung nfs server > > (more rarely) usually within a few hours, but sometimes within several > > days to a week. We have replicated this on 13.0 through 15.0-CURRENT > > (20240725-82283cad12a4-271360). With this set to 0 (default), we are > > unable to replicate the issue, even after several weeks of 24/7 hdf5 > > file operations. > > > > One other side-effect of these panics is that on a few occasions it has > > corrupted the root zpool beyond repair. This makes sense since kernel > > memory is getting corrupted, but obviously makes this issue more > > impactful. > > > > I'm hoping this is enough information to start narrowing down this > > issue. We are specifically using this sysctl because we are also > > serving > > files via samba and want to ensure consistent locking. > > > > I have provided some core dumps and backtraces previously, but am happy > > to provide more as needed. I also have a writeup of exactly how to > > reproduce this that I can send directly to anyone who is interested. > > > > Thanks so much for any and all help with this tricky problem. I'm happy > > to do whatever I can to help get this squashed. > > > > Best, > > Matt > >