debugging frequent kernel panics on 8.2-RELEASE

Jamie Gritton jamie at FreeBSD.org
Sun Aug 21 07:26:40 UTC 2011


On 08/20/11 19:19, Steven Hartland wrote:
> ----- Original Message ----- From: "Andriy Gapon" <avg at FreeBSD.org>
>
>> on 20/08/2011 23:24 Steven Hartland said the following:
>>> ----- Original Message ----- From: "Steven Hartland"
>>>> Looking through the code I believe I may have noticed a scenario
>>>> which could
>>>> trigger the problem.
>>>>
>>>> Given the following code:-
>>>>
>>>> static void
>>>> prison_deref(struct prison *pr, int flags)
>>>> {
>>>> struct prison *ppr, *tpr;
>>>> int vfslocked;
>>>>
>>>> if (!(flags & PD_LOCKED))
>>>> mtx_lock(&pr->pr_mtx);
>>>> /* Decrement the user references in a separate loop. */
>>>> if (flags & PD_DEUREF) {
>>>> for (tpr = pr;; tpr = tpr->pr_parent) {
>>>> if (tpr != pr)
>>>> mtx_lock(&tpr->pr_mtx);
>>>> if (--tpr->pr_uref > 0)
>>>> break;
>>>> KASSERT(tpr != &prison0, ("prison0 pr_uref=0"));
>>>> mtx_unlock(&tpr->pr_mtx);
>>>> }
>>>> /* Done if there were only user references to remove. */
>>>> if (!(flags & PD_DEREF)) {
>>>> mtx_unlock(&tpr->pr_mtx);
>>>> if (flags & PD_LIST_SLOCKED)
>>>> sx_sunlock(&allprison_lock);
>>>> else if (flags & PD_LIST_XLOCKED)
>>>> sx_xunlock(&allprison_lock);
>>>> return;
>>>> }
>>>> if (tpr != pr) {
>>>> mtx_unlock(&tpr->pr_mtx);
>>>> mtx_lock(&pr->pr_mtx);
>>>> }
>>>> }
>>>>
>>>> If you take a scenario of a simple one level prison setup running a
>>>> single
>>>> process
>>>> where a prison has just been stopped.
>>>>
>>>> In the above code pr_uref of the processes prison is decremented. As
>>>> this is the
>>>> last process then pr_uref will hit 0 and the loop continues instead
>>>> of breaking
>>>> early.
>>>>
>>>> Now at the end of the loop iteration the mtx is unlocked so other
>>>> process can
>>>> now manipulate the jail, this is where I think the problem may be.
>>>>
>>>> If we now have another process come in and attach to the jail but
>>>> then instantly
>>>> exit, this process may allow another kernel thread to hit this same
>>>> bit of code
>>>> and so two process for the same prison get into the section which
>>>> decrements
>>>> prison0's pr_uref, instead of only one.
>>>>
>>>> In essence I think we can get the following flow where 1# = process1
>>>> and 2# = process2
>>>> 1#1. prison1.pr_uref = 1 (single process jail)
>>>> 1#2. prison_deref( prison1,...
>>>> 1#3. prison1.pr_uref-- (prison1.pr_uref = 0)
>>>> 1#3. prison1.mtx_unlock <-- this now allows others to change
>>>> prison1.pr_uref
>>>> 1#3. prison0.pr_uref--
>>>> 2#1. process1.attach( prison1 ) (prison1.pr_uref = 1)
>>>> 2#2. process1.exit
>>>> 2#3. prison_deref( prison1,...
>>>> 2#4. prison1.pr_uref-- (prison1.pr_uref = 0)
>>>> 2#5. prison1.mtx_unlock <-- this now allows others to change
>>>> prison1.pr_uref
>>>> 2#5. prison0.pr_uref-- (prison1.pr_ref has now been decremented
>>>> twice by prison1)
>>>>
>>>> It seems like the action on the parent prison to decrement the
>>>> pr_uref is
>>>> happening too early, while the jail can still be used and without
>>>> the lock on
>>>> the child jails mtx, so causing a race condition.
>>>>
>>>> I think the fix is to the move the decrement of parent prison
>>>> pr_uref's down
>>>> so it only takes place if the jail is "really" being removed. Either
>>>> that or
>>>> to change the locking semantics so that once the lock is aquired in
>>>> this
>>>> prison_deref its not unlocked until the function completes.
>>>>
>>>> What do people think?
>>>
>>> After reviewing the changes to prison_deref in commit which added
>>> hierarchical
>>> jails, the removal of the lock by the inital loop on the passed in
>>> prison may
>>> be unintentional.
>>> http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/kern_jail.c.diff?r1=1.101;r2=1.102;f=h
>>>
>>>
>>>
>>> If so the following may be all that's needed to fix this issue:-
>>>
>>> diff -u sys/kern/kern_jail.c.orig sys/kern/kern_jail.c
>>> --- sys/kern/kern_jail.c.orig 2011-08-20 21:17:14.856618854 +0100
>>> +++ sys/kern/kern_jail.c 2011-08-20 21:18:35.307201425 +0100
>>> @@ -2455,7 +2455,8 @@
>>> if (--tpr->pr_uref > 0)
>>> break;
>>> KASSERT(tpr != &prison0, ("prison0 pr_uref=0"));
>>> - mtx_unlock(&tpr->pr_mtx);
>>> + if (tpr != pr)
>>> + mtx_unlock(&tpr->pr_mtx);
>>> }
>>> /* Done if there were only user references to remove. */
>>> if (!(flags & PD_DEREF)) {
>>
>> Not sure if this would fly as is - please double check the later block
>> where
>> pr->pr_mtx is re-locked.
>
> Your right, and its actually more complex than that. Although changing
> it to
> not unlock in the middle of prison_deref fixes that race condition it
> doesn't
> prevent pr_uref being incorrectly decremented each time the jail gets into
> the dying state, which is really the problem we are seeing.
>
> If hierarchical prisons are used there seems to be an additional problem
> where the counter of all prisons in the hierarchy are decremented, but as
> far as I can tell only the immediate parent is ever incremented, so another
> reference problem there as well I think.
>
> The following patch I believe fixes both of these issues.
>
> I've testing with debug added and confirmed prison0's pr_uref is maintained
> correctly even when a jail hits dying state multiple times.
>
> It essentially reverts the changes to the "if (flags & PD_DEUREF)" by
> 192895 and moves it to after the jail has been actually removed.
>
> diff -u sys/kern/kern_jail.c.orig sys/kern/kern_jail.c
> --- sys/kern/kern_jail.c.orig 2011-08-20 21:17:14.856618854 +0100
> +++ sys/kern/kern_jail.c 2011-08-21 01:56:58.429894825 +0100
> @@ -2449,27 +2449,16 @@
> mtx_lock(&pr->pr_mtx);
> /* Decrement the user references in a separate loop. */
> if (flags & PD_DEUREF) {
> - for (tpr = pr;; tpr = tpr->pr_parent) {
> - if (tpr != pr)
> - mtx_lock(&tpr->pr_mtx);
> - if (--tpr->pr_uref > 0)
> - break;
> - KASSERT(tpr != &prison0, ("prison0 pr_uref=0"));
> - mtx_unlock(&tpr->pr_mtx);
> - }
> + pr->pr_uref--;
> /* Done if there were only user references to remove. */
> if (!(flags & PD_DEREF)) {
> - mtx_unlock(&tpr->pr_mtx);
> + mtx_unlock(&pr->pr_mtx);
> if (flags & PD_LIST_SLOCKED)
> sx_sunlock(&allprison_lock);
> else if (flags & PD_LIST_XLOCKED)
> sx_xunlock(&allprison_lock);
> return;
> }
> - if (tpr != pr) {
> - mtx_unlock(&tpr->pr_mtx);
> - mtx_lock(&pr->pr_mtx);
> - }
> }
>
> for (;;) {
> @@ -2525,6 +2514,8 @@
> /* Removing a prison frees a reference on its parent. */
> pr = ppr;
> mtx_lock(&pr->pr_mtx);
> + /* Ensure user reference added on create is removed */
> + pr->pr_uref--;
> flags = PD_DEREF;
> }
> }
>
> Jamie from what I can tell you where the original committer of hierarchical
> prisons and this in the following svn change set so would really appreciate
> your feedback on this.
> http://svnweb.freebsd.org/base?view=revision&revision=192895

The problem isn't with the conditional locking of tpr in prison_deref.
That locking is actually correct, and there's no race condition. The
trouble lies in the resurrection of dead jails, as Andriy has noted
(though not just attaching, but also by setting its persist flag causes
the same problem).

There are two possible fixes to this. One is the patch you've given,
which only decrements a parent jail's pr_uref when the child jail
completely goes away (as opposed to when it loses its last uref). This
provides symmetry with the current way pr_uref is incremented on the
parent, which is only when a jail is created.

The other fix is to increment a parent's pr_uref when a jail is
resurrected, which will match the current logic in prison_deref. I like
the external semantics of this solution: a jail isn't visible if it is
not persistent and has no processes and no *visible* sub-jails, as
opposed to having no sub-jails at all. But this solution ends up pretty
complicated - there are a few places where pr_uref is incremented, where
I might need to increment parent jails' pr_uref as well, much like the
current tpr loop in prison_deref decrements them.

Your solution removes code instead of adding it, which is generally a
good thing. While it does change the semantics of pr_uref in the
hierarchical case at least from what I thought it was, those semantics
haven't been working properly anyway.

Bjoern, I'm adding you to the CC list for this because the whole pr_uref
thing was your idea (though it was pr_nprocs at the time), so you might
care about the hierarchical semantics of it - or you may not. Also, this
is a panic-inducing bug in current and may interest you for that reason.

- Jamie


More information about the freebsd-stable mailing list