vmx0: watchdog timeout on queue 2, no interrupts on BSP
Patrick Kelsey
pkelsey at freebsd.org
Sun Jul 21 20:32:10 UTC 2019
> On Jul 21, 2019, at 4:17 PM, Andriy Gapon <avg at freebsd.org> wrote:
>
>> On 20/07/2019 20:08, Patrick Kelsey wrote:
>>
>>
>> On Fri, Jul 19, 2019 at 10:07 AM Andriy Gapon <avg at freebsd.org
>> <mailto:avg at freebsd.org>> wrote:
>>
>>
>> Recently we experienced a strange problem.
>> We noticed a lot of these messages in the logs:
>> vmx0: watchdog timeout on queue 2
>> (always queue 2)
>> Also, we noticed that connections to some end points did not work at all
>> while others worked without problems. I assume that that was because
>> specific flows got assigned to that queue 2.
>>
>> Further investigation has shown that none of interrupts assigned to the
>> BSP has ever fired (since boot, of course). That included vmx0:rx2 and
>> vmx0:tx2. But also interrupts for other drivers as well.
>>
>> Trying to get more information I rebooted the system and the problem
>> disappeared.
>>
>> Has anyone seen anything like that?
>> Any thoughts on possible causes?
>> Any suggestions what to check if/when the problem reoccurs?
>>
>> Thanks!
>>
>>
>> If you are running head at or after r347221 or stable/12 at or after
>> r349112, then this could be due to
>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239118 (see Comment 4
>> - short story is that an iflib change has broken the vmx driver).
>
> I am not sure if that bug could lead to all interrupts on the core
> getting disabled (for all drivers), and right at the boot time.
I am not sure either, but it’s the kind of bug that breaks the design of the vmx driver in such a way that its state can get corrupted to the point where the kernel can panic. I haven’t fully analyzed the potential scope of memory corruption / hardware state corruption that can occur (because the fix for the issue is already apparent), so I am freely considering it to include elements beyond the device and driver itself.
If you are saying that zero vmx queue interrupts have occurred anywhere in the system, then I would rule out any connection to this as a prerequisite for the corruption to occur is having at least one such interrupt.
-Patrick
More information about the freebsd-net
mailing list