NFE adapter 'hangs'
Melissa Jenkins
melissa-freebsd at littlebluecar.co.uk
Fri Oct 15 12:59:59 UTC 2010
On 4 Sep 2010, at 01:53, Pyun YongHyeon wrote:
> On Fri, Sep 03, 2010 at 07:59:26AM +0100, Melissa Jenkins wrote:
>>
>> Thank you for your very quick response :)
>>
>
> [...]
>
>>> Also I'd like to know whether both RX and TX are dead or only one
>>> RX/TX path is hung. Can you see incoming traffic with tcpdump when
>>> you think the controller is in stuck?
>>
>> Yes, though not very much. The traffic to 4800 is every second so you can see in the following trace when it stops
>>
>> 07:10:42.287163 IP 192.168.1.203 > 224.0.0.240: pfsync 108
>> 07:10:42.911995
>> 07:10:43.112073 STP 802.1d, Config, Flags [Topology change], bridge-id 8000.c4:7d:4f:a9:ac:30.8008, length 43
>> 07:10:43.148659 IP 192.168.1.203.57026 > 192.168.1.255.4800: UDP, length 60
>> 07:10:43.148684 IP 172.31.1.203 > 172.31.1.129: GREv0, length 92: IP 192.168.1.203.57026 > 192.168.1.129.4800: UDP, length 60
>> 07:10:43.148689 IP 172.31.1.203 > 172.31.1.129: GREv0, length 92: IP 192.168.1.203.57026 > 192.168.1.1.4800: UDP, length 60
>> 07:10:43.148918 IP 192.168.1.213.40677 > 192.168.1.255.4800: UDP, length 48
>
> [...]
>
>> a bit later on, still broken, a slight odd message:
>> 07:11:43.079720 IP 172.31.1.129 > 172.31.1.213: GREv0, length 52: IP 192.168.1.129.60446 > 192.168.1.213.179: tcp 12 [bad hdr length 16 - too short, < 20]
>> 07:11:44.210794 IP 172.31.1.129 > 172.31.1.203: GREv0, length 84: IP 192.168.1.129.64744 > 192.168.1.203.4800: UDP, length 52
>> 07:11:44.210831 IP 172.31.1.129 > 172.31.1.213: GREv0, length 84: IP 192.168.1.129.64744 > 192.168.1.213.4800: UDP, length 52
>>
>> Now this really is odd, I don't recognise either of those MAC addresses, though the SQL shown is used on this machine (
>> 07:12:13.054393 45:43:54:20:41:63 > 00:00:03:53:45:4c, ethertype Unknown (0x6374), length 60:
>> 0x0000: 556e 6971 7565 4964 2046 524f 4d20 7261 UniqueId.FROM.ra
>> 0x0010: 6461 6363 7420 2057 4845 5245 2043 616c dacct..WHERE.Cal
>> 0x0020: 6c69 6e67 5374 6174 696f 6e49 6420 lingStationId.
>
> Hmm, it seems you're using really complex setup. It's very hard to
> narrow down guilty ones under these environments. Could you setup
> simple network configuration that reproduces the issue? One of
> possible cause would be wrong(garbled) data might be passed up to
> upper stack. But I have no idea why you see GRE packets with
> truncated TCP header(172.31.1.129 > 172.31.1.213).
> How about disabling TX/RX checksum offloading as well as TSO?
>
> [...]
>
>>
>> I then restarted the interface (nfe down/up, route restart)
>>
>> From dmesg at the time (slight obfuscated)
>> Sep 3 07:10:19 manch2 bgpd[89612]: neighbor XX: received notification: HoldTimer expired, unknown subcode 0
>> Sep 3 07:10:49 manch2 bgpd[89612]: neighbor XX connect: Host is down
>> # at this point I took the interface down & up and reloaded the routing tables
>> Sep 3 07:12:07 manch2 kernel: carp0: link state changed to DOWN
>> Sep 3 07:12:07 manch2 kernel: carp0: link state changed to DOWN
>> Sep 3 07:12:07 manch2 kernel: nfe0: link state changed to DOWN
>> Sep 3 07:12:07 manch2 kernel: carp0: link state changed to DOWN
>> Sep 3 07:12:11 manch2 kernel: nfe0: link state changed to UP
>> Sep 3 07:12:11 manch2 kernel: carp0: link state changed to DOWN
>> Sep 3 07:12:14 manch2 kernel: carp0: link state changed to UP
>
> Hmm, it does not look right, carp0 showed link DOWN message four
> times in a row.
> By the way, are you using IPMI on MCP55? nfe(4) is not ready to
> handle MAC operation with IPMI.
Turning off tx & rc checksum offloading seems to have resolved the problem:
ifconfig nfe0 -txcsum -rxcsum
Seems to have stopped both the corruption and the interface hanging. I ran it for about 16 hours on the FreeBSD 8 box. It also appears to have fixed the problem on my FreeBSD 7 machine as well.
I didn't try turning off TSO.
Thank you for your suggestion & help!
Mel
More information about the freebsd-net
mailing list