Race between arptimer() and lle removal [WAS: panic in arptimer in r289937]

Hans Petter Selasky hps at selasky.org
Fri Dec 11 11:14:30 UTC 2015


On 12/11/15 11:12, Alexander V. Chernikov wrote:
> 11.12.2015, 12:15, "Hans Petter Selasky" <hps at selasky.org>:
>> Hi,
>>
>> Pulling the nail out of the haystack hopefully.
>>
>>>>   Any ideas on where next to look?
>>
>> Adrian: In your dump aswell I see:
>>
>> la_flags = 1
>>
>> That means there was a race calling arptimer() and removing the "lle".
> Yes. The interesting part here is why lle is removed. There are quite a few reasons: either interface address deleted or interface going down, or explicit delete request.
> That's why I asked Adrian about interface stuff (and haven't got a reply).
>>
>> Alexander: Can you comment on the following patch:
>>
>>   > Index: netinet/if_ether.c
>>   > ===================================================================
>>   > --- netinet/if_ether.c (revision 291256)
>>   > +++ netinet/if_ether.c (working copy)
>>   > @@ -185,7 +185,13 @@
>>   > LLE_WUNLOCK(lle);
>>   > return;
>>   > }
>>   > - ifp = lle->lle_tbl->llt_ifp;
>>   > + if (lle->la_flags & LLE_LINKED) {
>>   > + ifp = lle->lle_tbl->llt_ifp;
>>   > + } else {
>>   > + /* XXX RACE entry has been freed */
>>   > + llentry_free(lle);
>>   > + return;
>>   > + }
>>   > CURVNET_SET(ifp->if_vnet);
>>   >
>>   > if ((lle->la_flags & LLE_DELETED) == 0) {
>>
>> We need a check in arptimer() that the lle is still linked before
> Yes, I had exactly that approach in mind. (And nd6_llinfo_timer() needs the same fix).
> So, would you commit it or should I?
>> proceeding, in there from what I can see. Because the callback is not
>> protected by a mutex, it is not atomically stopped by callout_stop().

Hi,

Talking to Randall offlist, I see some more trouble. Let's get 
everything straight before making a fix. There is one more race I see:

The start of the arptimer() callback looks like this:

 > static void
 > arptimer(void *arg)
 > {
POINT0
 >         LLE_WLOCK(lle);
 >         if (callout_pending(&lle->lle_timer)) {
POINT1
 >                 LLE_WUNLOCK(lle);
 >                 return;
 >         }

The code starting the callback looks like this:

 >                 LLE_ADDREF(la);
 >                 la->la_expire = time_uptime;
 >                 canceled = callout_reset(&la->lle_timer, hz * 
V_arpt_down,
 >                     arptimer, la);
 >                 if (canceled)
 >                         LLE_REMREF(la);

Which can be written like this:

 > la->la_expire = time_uptime;
 > canceled = callout_reset(&la->lle_timer, hz * V_arpt_down,
 >     arptimer, la);
 > if (canceled == 0)
 >     LLE_ADDREF(la);

In case we are at POINT0 in arptimer, callout_reset() will not be able 
to cancel the callout and will return 0. We should also drop one ref at 
POINT1, or rewrite the code a bit, which might need Randall's help in 
the callout subsystem area.

--HPS


More information about the freebsd-net mailing list