NFS-related hang in 5.4?
Eirik Øverby
ltning at anduin.net
Mon Jun 20 12:12:38 GMT 2005
On 20. jun. 2005, at 10.38, Robert Watson wrote:
>
> On Mon, 20 Jun 2005, Eirik Øverby wrote:
>
>
>
>>> Hmm. Looks like a bug in dummynet. ipfw should not be directly
>>> re- injecting UDP traffic back into the input path from an
>>> outbound path, or it risks re-entering, generating lock order
>>> problems, etc. It should be getting dropped into the netisr queue
>>> to be processed from the netisr context.
>>>
>>>
>>
>> This problem would exist across all 5.4 installations, both i386
>> and amd64? Would it depend on heavy load, or could it
>> theoretically happen at any time when there's traffic? All three
>> of my fbsd5 servers (dual opteron, dual p3-1ghz, dual p3-700mhz)
>> are experiencing random hangs with ~a few weeks between,
>> impression is that if running single-cpu mode they are all stable.
>> All using dummynet in a comparable manner. Ideas?
>>
>>
>
> Yes. Basically, the network stack avoids recursion in processing
> for "complicated" packets by deferring processing an offending
> packet to a thread called the 'netisr'. Whenever the stack reaches
> a possible recursion point on a packet, it's supposed to queue the
> packet for processing 'later' in a per-protocol queue, unwind, and
> then when the netisr runs, pick up and continue processing. In the
> stack trace you provide, dummynet appears to immediately
> immediately invoke the in-bound network path from the out-bound
> network path, walking back into the network stack from the outbound
> path. This is generally forbidden, for a variety of reasons:
>
> - We do allow the in-bound path to call the out-bound path, so that
> protocols like TCP, and services like NFS can turn around packets
> without a context switch. If further recursion is permitted, the
> stack
> may overflow.
>
> - Both paths may hold network stack locks over calls in either
> direction
> -- specifically, we allow protocol locks to be held over calls
> into the
> socket layer, as the protocol layer drives operation; if a recursive
> call is made, deadlocks can occur due to violating the lock
> order. This
> is what is happening in your case.
>
> Pretty much all network code is entirely architecture-independent,
> so bugs typically span architectures, although race conditions can
> sometimes be hard to reproduce if they require precise timing and
> multiple processors.
>
So I'm lucky to have seen this one... Great ;)
>>> Is it possible to configure dummynet out of your configuration,
>>> and see if the problem goes away?
>>>
>>>
>>
>> I'm running a test right now, will let you know in the morning.
>>
>>
>
> Thanks.
>
I know enough not to call this a "confirmation", but disabling
dummynet did indeed allow me to finish the backup. I never made it
past 15GBs before, now the full 19GB tar.gz file is done, and the
boxes are both still running. The funny thing is - I only disabled
dummynet on one of the boxes now - the source of the backup, the box
that pushes data. The other box has pretty much 100% the same setup,
and is also i386. But as traffic shaping can only happen on outgoing
packets, I suppose that makes sense.
I can try re-running the test again if you wish, in order to gain
more statistics. It's just too bad it takes a while ;)
/Eirik
>
> Robert N M Watson
>
More information about the freebsd-stable
mailing list