Infiniband: Mellanox MT26418 in ethernet mode causes crash on shutdown

Sun Feb 24 09:57:23 UTC 2019

On 2/24/19 1:23 AM, Andreas Kempe wrote:
> Hello,
> 
> When running a Mellanox MT26418 in ethernet mode, the kernel crashes
> with the following stack trace on system shutdown:
> 
>> Fatal trap 12: page fault while in kernel mode
>> cpuid = 0; apic id = 00
>> fault virtual address   = 0x0
>> fault code      = supervisor read data, page not present
>> instruction pointer = 0x20:0xffffffff80e3f5f4
>> stack pointer           = 0x28:0xfffffe064abec6e0
>> frame pointer           = 0x28:0xfffffe064abec700
>> code segment        = base 0x0, limit 0xfffff, type 0x1b
>>              = DPL 0, pres 1, long 1, def32 0, gran 1
>> processor eflags    = interrupt enabled, resume, IOPL = 0
>> current process     = 1 (init)
>> trap number     = 12
>> panic: page fault
>> cpuid = 0
>> KDB: stack backtrace:
>> #0 0xffffffff80b4c5b7 at kdb_backtrace+0x67
>> #1 0xffffffff80b05b57 at vpanic+0x177
>> #2 0xffffffff80b059d3 at panic+0x43
>> #3 0xffffffff8106efdf at trap_fatal+0x35f
>> #4 0xffffffff8106f039 at trap_pfault+0x49
>> #5 0xffffffff8106e807 at trap+0x2c7
>> #6 0xffffffff8104f03c at calltrap+0x8
>> #7 0xffffffff80e3fae2 at mlx4_en_stop_port+0x3d2
>> #8 0xffffffff80e40ff6 at mlx4_en_destroy_netdev+0x1e6
>> #9 0xffffffff80e3e47d at mlx4_en_remove+0xcd
>> #10 0xffffffff80e1ab01 at mlx4_remove_device+0xb1
>> #11 0xffffffff80e1b0b8 at mlx4_unregister_device+0x98
>> #12 0xffffffff80e1c5c5 at mlx4_unload_one+0x85
>> #13 0xffffffff80e23543 at mlx4_shutdown+0x83
>> #14 0xffffffff80d6b6e9 at linux_pci_shutdown+0x39
>> #15 0xffffffff80b4004a at bus_generic_shutdown+0x5a
>> #16 0xffffffff80b4004a at bus_generic_shutdown+0x5a
>> #17 0xffffffff80b4004a at bus_generic_shutdown+0x5a
> 
> I've traced the issue to the following lines of code in
> sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c in mlx4_en_destroy_netdev():
>>      /* Unregister device - this will close the port if it was up */
>>      if (priv->registered) {
>>          mutex_lock(&mdev->state_lock);
>>          ether_ifdetach(dev);
>>          mutex_unlock(&mdev->state_lock);
>>     }>>     mutex_lock(&mdev->state_lock);
>>      mlx4_en_stop_port(dev);
>>      mutex_unlock(&mdev->state_lock);
>>
> 
> The issue is that mlx4_en_stop_port() follows the fcall chain below and
> tries to fetch the MAC address of the device in mlx4_en_put_qp.
> mlx4_en_destroy_netdev->mlx4_en_stop_port->mlx4_en_put_qp
> 
> The sequence above causes the kernel to choke because the MAC address
> was freed in the previous call to ether_ifdetach in if_detach_internal
> with the following call chain:
> mlx4_en_destroy_netdev->ether_ifdetach->if_detach->if_detach_internal
> 
> I've written a small workaround that works on our test machine, although
> I suspect this could potentially cause issues as we're destroying the
> port before we destroy the interface. Please see the attached patch for
> the workaround.
> 
> Cordially,
> Andreas Kempe
> Lysator ACS

CC'ing FreeBSD-drivers at Mellanox.

Thank you for your patch. We'll have a look at it.

--HPS