nfsd kernel threads won't die via SIGKILL
Rick Macklem
rmacklem at uoguelph.ca
Wed Jun 27 01:05:12 UTC 2018
Konstantin Belousov wrote:
On Mon, Jun 25, 2018 at 02:04:32AM +0000, Rick Macklem wrote:
> Konstantin Belousov wrote:
> >On Sat, Jun 23, 2018 at 09:03:02PM +0000, Rick Macklem wrote:
> >> During testing of the pNFS server I have been frequently killing/restarting the nfsd.
> >> Once in a while, the "slave" nfsd process doesn't terminate and a "ps axHl" shows:
> >> 0 48889 1 0 20 0 5884 812 svcexit D - 0:00.01 nfsd: server
> >> 0 48889 1 0 40 0 5884 812 rpcsvc I - 0:00.00 nfsd: server
> >> ... more of the same
> >> 0 48889 1 0 40 0 5884 812 rpcsvc I - 0:00.00 nfsd: server
> >> 0 48889 1 0 -8 0 5884 812 rpcsvc I - 1:51.78 nfsd: server
> >> 0 48889 1 0 -8 0 5884 812 rpcsvc I - 2:27.75 nfsd: server
> >>
> >> You can see that the top thread (the one that was created with the process) is
> >> stuck in "D" on "svcexit".
> >> The rest of the threads are still servicing NFS RPCs.
[lots of stuff snipped]
>Signals are put onto a signal queue between a time where the signal is
>generated until the thread actually consumes it. I.e. the signal queue
>is a container for the signals which are not yet acted upon. There is
>one signal queue per process, and one signal queue for each thread
>belonging to the process. When you signal the process, the signal is
>put into some thread' signal queue, where the only criteria for the
>selection of the thread is that the signal is not blocked. Since
>SIGKILL is never blocked, it is put anywhere.
>
>Until signal is delivered by cursig()/postsig() loop, typically at the
>AST handler, the only consequence of its presence are the EINTR/ERESTART
>errors returned from the PCATCH-enabled sleeps.
Ok, now I think I understand how this works. Thanks a lot for the explanation.
> >Your description at the start of the message of the behaviour after
> >SIGKILL, where other threads continued to serve RPCs, exactly matches
> >above explanation. You need to add some global 'stop' flag, if it is not
I looked at the code and there is already basically a "global stop flag".
It's done by setting the sg_state variable to CLOSING for all thread groups
in a function called svc_exit(). (I missed this when I looked before, so I
didn't understand how all the threads normally terminate.)
So, when I looked at svc_run_internal(), there is a loop while (state != closing)
that calls cv_wait_sig()/cv_timedwait_sig() and when these return EINTR/ERESTART
the call to svc_exit() is done to make the threads all return from the function.
--> The only way in can get into the broken situation I see sometimes is if the
top thread (called "ismaster" by the code) somehow returns from
svc_run_internal() without calling svc_exit(), so that the state isn't set to
"closing".
Turns out there is only one place this can happen. It's this line:
if (grp->sg_threadcount > grp->sg_maxthreads)
break;
I wouldn't have thought that sg_threadcount would have become ">" than
sg_maxthreads, but when I looked at the output of "ps" that I pasted into
the first message, there are 33 threads. (When I started the nfsd, I specified
32 threads, so I think it did the "break;" at this place to get out of the loop
and return from svc_run_internal() without calling svc_exit().)
I think changing the above line to:
if (!ismaster && grp->sg_threadcount > grp->sg_maxthreads)
will fix it.
I'll test this and see if I can get it to fail.
Thanks again for your help, rick
More information about the freebsd-current
mailing list