NFS-related hang in 5.4?
Robert Watson
rwatson at FreeBSD.org
Mon Jun 20 08:35:43 GMT 2005
On Mon, 20 Jun 2005, Eirik Øverby wrote:
>> Hmm. Looks like a bug in dummynet. ipfw should not be directly re-
>> injecting UDP traffic back into the input path from an outbound path,
>> or it risks re-entering, generating lock order problems, etc. It should
>> be getting dropped into the netisr queue to be processed from the
>> netisr context.
>
> This problem would exist across all 5.4 installations, both i386 and
> amd64? Would it depend on heavy load, or could it theoretically happen
> at any time when there's traffic? All three of my fbsd5 servers (dual
> opteron, dual p3-1ghz, dual p3-700mhz) are experiencing random hangs
> with ~a few weeks between, impression is that if running single-cpu mode
> they are all stable. All using dummynet in a comparable manner. Ideas?
Yes. Basically, the network stack avoids recursion in processing for
"complicated" packets by deferring processing an offending packet to a
thread called the 'netisr'. Whenever the stack reaches a possible
recursion point on a packet, it's supposed to queue the packet for
processing 'later' in a per-protocol queue, unwind, and then when the
netisr runs, pick up and continue processing. In the stack trace you
provide, dummynet appears to immediately immediately invoke the in-bound
network path from the out-bound network path, walking back into the
network stack from the outbound path. This is generally forbidden, for a
variety of reasons:
- We do allow the in-bound path to call the out-bound path, so that
protocols like TCP, and services like NFS can turn around packets
without a context switch. If further recursion is permitted, the stack
may overflow.
- Both paths may hold network stack locks over calls in either direction
-- specifically, we allow protocol locks to be held over calls into the
socket layer, as the protocol layer drives operation; if a recursive
call is made, deadlocks can occur due to violating the lock order. This
is what is happening in your case.
Pretty much all network code is entirely architecture-independent, so bugs
typically span architectures, although race conditions can sometimes be
hard to reproduce if they require precise timing and multiple processors.
>> Is it possible to configure dummynet out of your configuration, and see
>> if the problem goes away?
>
> I'm running a test right now, will let you know in the morning.
Thanks.
Robert N M Watson
More information about the freebsd-stable
mailing list