Infiniband: Mellanox MT26418 in ethernet mode causes crash on shutdown

Sun Feb 24 00:23:25 UTC 2019

Hello,

When running a Mellanox MT26418 in ethernet mode, the kernel crashes
with the following stack trace on system shutdown:

> Fatal trap 12: page fault while in kernel mode
> cpuid = 0; apic id = 00
> fault virtual address   = 0x0
> fault code      = supervisor read data, page not present
> instruction pointer = 0x20:0xffffffff80e3f5f4
> stack pointer           = 0x28:0xfffffe064abec6e0
> frame pointer           = 0x28:0xfffffe064abec700
> code segment        = base 0x0, limit 0xfffff, type 0x1b
>             = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags    = interrupt enabled, resume, IOPL = 0
> current process     = 1 (init)
> trap number     = 12
> panic: page fault
> cpuid = 0
> KDB: stack backtrace:
> #0 0xffffffff80b4c5b7 at kdb_backtrace+0x67
> #1 0xffffffff80b05b57 at vpanic+0x177
> #2 0xffffffff80b059d3 at panic+0x43
> #3 0xffffffff8106efdf at trap_fatal+0x35f
> #4 0xffffffff8106f039 at trap_pfault+0x49
> #5 0xffffffff8106e807 at trap+0x2c7
> #6 0xffffffff8104f03c at calltrap+0x8
> #7 0xffffffff80e3fae2 at mlx4_en_stop_port+0x3d2
> #8 0xffffffff80e40ff6 at mlx4_en_destroy_netdev+0x1e6
> #9 0xffffffff80e3e47d at mlx4_en_remove+0xcd
> #10 0xffffffff80e1ab01 at mlx4_remove_device+0xb1
> #11 0xffffffff80e1b0b8 at mlx4_unregister_device+0x98
> #12 0xffffffff80e1c5c5 at mlx4_unload_one+0x85
> #13 0xffffffff80e23543 at mlx4_shutdown+0x83
> #14 0xffffffff80d6b6e9 at linux_pci_shutdown+0x39
> #15 0xffffffff80b4004a at bus_generic_shutdown+0x5a
> #16 0xffffffff80b4004a at bus_generic_shutdown+0x5a
> #17 0xffffffff80b4004a at bus_generic_shutdown+0x5a

I've traced the issue to the following lines of code in
sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c in mlx4_en_destroy_netdev():
>     /* Unregister device - this will close the port if it was up */
>     if (priv->registered) {
>         mutex_lock(&mdev->state_lock);
>         ether_ifdetach(dev);
>         mutex_unlock(&mdev->state_lock);
>    }>>     mutex_lock(&mdev->state_lock);
>     mlx4_en_stop_port(dev);
>     mutex_unlock(&mdev->state_lock);
> 

The issue is that mlx4_en_stop_port() follows the fcall chain below and
tries to fetch the MAC address of the device in mlx4_en_put_qp.
mlx4_en_destroy_netdev->mlx4_en_stop_port->mlx4_en_put_qp

The sequence above causes the kernel to choke because the MAC address
was freed in the previous call to ether_ifdetach in if_detach_internal
with the following call chain:
mlx4_en_destroy_netdev->ether_ifdetach->if_detach->if_detach_internal

I've written a small workaround that works on our test machine, although
I suspect this could potentially cause issues as we're destroying the
port before we destroy the interface. Please see the attached patch for
the workaround.

Cordially,
Andreas Kempe
Lysator ACS
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mlx_destroy_work_around.patch
Type: text/x-patch
Size: 831 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-net/attachments/20190224/56bc497e/attachment.bin>