mlx4en, timer irq @100%... (11.0 stuck on high network load ???)
Ben RUBSON
ben.rubson at gmail.com
Mon Aug 28 08:25:34 UTC 2017
> On 16 Aug 2017, at 11:02, Ben RUBSON <ben.rubson at gmail.com> wrote:
>
>> On 15 Aug 2017, at 23:33, Julien Charbon <jch at freebsd.org> wrote:
>>
>> On 8/11/17 11:32 AM, Ben RUBSON wrote:
>>>> On 08 Aug 2017, at 13:33, Julien Charbon <jch at freebsd.org> wrote:
>>>>
>>>> On 8/8/17 10:31 AM, Hans Petter Selasky wrote:
>>>>>
>>>>> Suggested fix attached.
>>>>
>>>> I agree we your conclusion. Just for the record, more precisely this
>>>> regression seems to have been introduced with:
>>>> (...)
>>>> Thus good catch, and your patch looks good. I am going to just verify
>>>> the other in_pcbrele_wlocked() calls in TCP stack.
>>>
>>> Julien, do you plan to make this fix reach 11.0-p12 ?
>>
>> I am checking if your issue is another flavor of the issue fixed by:
>>
>> https://svnweb.freebsd.org/base?view=revision&revision=307551
>> https://reviews.freebsd.org/D8211
>>
>> This fix in not in 11.0 but in 11.1. Currently I did not found how an
>> inp in INP_TIMEWAIT state can have been INP_FREED without having its tw
>> set to NULL already except the issue fixed by r307551.
>>
>> Thus could you try to apply this patch:
>>
>> https://github.com/freebsd/freebsd/commit/acb5bfda99b753d9ead3529d04f20087c5f7d0a0.patch
>>
>> and see if you can still reproduce this issue?
>
> Thank you for your answer Julien.
> Unfortunately, I'm not sure at all how to reproduce the issue.
> I have other servers which are 100% identical to this one, same workload,
> same some-months uptime, but they did not trigger the bug yet.
>
> If other network stack experts (I'm not) agree with your analysis,
> we could then certainly go further with D8211 / r307551.
>
> One thing that perhaps might help :
> # netstat -an | grep TIME_WAIT$ | wc -l
> 468
>
> Note that due to this running bug, sendmail has lots of difficulties to send outgoing mails.
> As soon as I run the above netstat command, I receive a lot of stacked mails (more than 20 this time).
> As if netstat was able to somehow help...
>
> Number of TIME_WAIT connections however does not decrease, but increases.
>
>> And in the spirit of r307551 fix and based on Hans patch I will also
>> propose to add a kernel log describing the issue instead of starting an
>> infinite loop when INVARIANT is not set.
>
> Which should then never be triggered :)
> Good idea I think !
What about :
D8211/r307551
+ Hans' patch
+ Julien's idea of a kernel log (sort of "We should not be here but we are")
And backporting all this to 11.0 (and so to 11.1 too) ?
As this bug can impact every FreeBSD machine / server,
leading to an unavailable / unreachable system (this is how mine ended),
sounds like it could inevitably be a good thing, for production stability purpose.
Thank you very much !
Ben
More information about the freebsd-net
mailing list