Infiniband: Mellanox MT26418 in ethernet mode causes crash on shutdown
Hans Petter Selasky
hps at selasky.org
Sun Feb 24 09:57:23 UTC 2019
On 2/24/19 1:23 AM, Andreas Kempe wrote:
> Hello,
>
> When running a Mellanox MT26418 in ethernet mode, the kernel crashes
> with the following stack trace on system shutdown:
>
>> Fatal trap 12: page fault while in kernel mode
>> cpuid = 0; apic id = 00
>> fault virtual address = 0x0
>> fault code = supervisor read data, page not present
>> instruction pointer = 0x20:0xffffffff80e3f5f4
>> stack pointer = 0x28:0xfffffe064abec6e0
>> frame pointer = 0x28:0xfffffe064abec700
>> code segment = base 0x0, limit 0xfffff, type 0x1b
>> = DPL 0, pres 1, long 1, def32 0, gran 1
>> processor eflags = interrupt enabled, resume, IOPL = 0
>> current process = 1 (init)
>> trap number = 12
>> panic: page fault
>> cpuid = 0
>> KDB: stack backtrace:
>> #0 0xffffffff80b4c5b7 at kdb_backtrace+0x67
>> #1 0xffffffff80b05b57 at vpanic+0x177
>> #2 0xffffffff80b059d3 at panic+0x43
>> #3 0xffffffff8106efdf at trap_fatal+0x35f
>> #4 0xffffffff8106f039 at trap_pfault+0x49
>> #5 0xffffffff8106e807 at trap+0x2c7
>> #6 0xffffffff8104f03c at calltrap+0x8
>> #7 0xffffffff80e3fae2 at mlx4_en_stop_port+0x3d2
>> #8 0xffffffff80e40ff6 at mlx4_en_destroy_netdev+0x1e6
>> #9 0xffffffff80e3e47d at mlx4_en_remove+0xcd
>> #10 0xffffffff80e1ab01 at mlx4_remove_device+0xb1
>> #11 0xffffffff80e1b0b8 at mlx4_unregister_device+0x98
>> #12 0xffffffff80e1c5c5 at mlx4_unload_one+0x85
>> #13 0xffffffff80e23543 at mlx4_shutdown+0x83
>> #14 0xffffffff80d6b6e9 at linux_pci_shutdown+0x39
>> #15 0xffffffff80b4004a at bus_generic_shutdown+0x5a
>> #16 0xffffffff80b4004a at bus_generic_shutdown+0x5a
>> #17 0xffffffff80b4004a at bus_generic_shutdown+0x5a
>
> I've traced the issue to the following lines of code in
> sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c in mlx4_en_destroy_netdev():
>> /* Unregister device - this will close the port if it was up */
>> if (priv->registered) {
>> mutex_lock(&mdev->state_lock);
>> ether_ifdetach(dev);
>> mutex_unlock(&mdev->state_lock);
>> }>> mutex_lock(&mdev->state_lock);
>> mlx4_en_stop_port(dev);
>> mutex_unlock(&mdev->state_lock);
>>
>
> The issue is that mlx4_en_stop_port() follows the fcall chain below and
> tries to fetch the MAC address of the device in mlx4_en_put_qp.
> mlx4_en_destroy_netdev->mlx4_en_stop_port->mlx4_en_put_qp
>
> The sequence above causes the kernel to choke because the MAC address
> was freed in the previous call to ether_ifdetach in if_detach_internal
> with the following call chain:
> mlx4_en_destroy_netdev->ether_ifdetach->if_detach->if_detach_internal
>
> I've written a small workaround that works on our test machine, although
> I suspect this could potentially cause issues as we're destroying the
> port before we destroy the interface. Please see the attached patch for
> the workaround.
>
> Cordially,
> Andreas Kempe
> Lysator ACS
CC'ing FreeBSD-drivers at Mellanox.
Thank you for your patch. We'll have a look at it.
--HPS
More information about the freebsd-net
mailing list