[PATCH]: ipoib with mlx4 initialisation ordering

Wed Feb 26 23:00:26 UTC 2020

On Wed, Feb 26, 2020 at 10:52:56PM +0100, Hans Petter Selasky wrote:
> On 2020-02-26 22:30, Andreas Kempe wrote:
> > Index: sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > ===================================================================
> > --- sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_main.c	(revision 356611)
> > +++ sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_main.c	(working copy)
> > @@ -1739,7 +1739,7 @@
> >   }
> >   module_init(ipoib_init_module);
> > -module_exit(ipoib_cleanup_module);
> > +module_exit_order(ipoib_cleanup_module, SI_ORDER_FOURTH);
> >   static int
> >   ipoib_evhand(module_t mod, int event, void *arg)
> > 
> 
> I haven't yet found time to reproduce this issue.
> 

No worries, there is absolutely no rush from my side. We can patch our
machines ourselves with the initial patch until some sort of solution
gets adopted upstream.

> Possibly you're right that the list order matters.
> 

I would also have guessed that the patch above would have solved the
issue. When the ipoib module is torn down, it should, as far as I can
tell from only reading the code, remove all the multicast groups.
Without hooking up the kernel debugger again, I can't say for sure why
it would still hang.

I'm providing the wall of text below in hopes it can help you or
anyone that wishes to debug this issue further.

The only reason I really said that the list ordering matters is that
mlx4_ib_remove calls ib_unregister_device which in turn walks the
client list in the reverse order. Printing each list element as the
list is iterated during shutdown yields the following client order
(the first client to be removed at the top of the list):

ib_unregister_device: ib_client->name = uverbs  
ib_unregister_device: ib_client->name = ucm
ib_unregister_device: ib_client->name = umad
ib_unregister_device: ib_client->name = cm
ib_unregister_device: ib_client->name = ib_multicast                                                
ib_unregister_device: ib_client->name = sa                                                          
ib_unregister_device: ib_client->name = mad
ib_unregister_device: ib_client->name = cma
ib_unregister_device: ib_client->name = ipoib                           
ib_unregister_device: ib_client->name = sdp

If the interface is up and running and has sent data when the machine
is shut down, it hangs on list index 4,
i.e. ib_unregister_device: ib_client->name = ib_multicast.
The reason it hangs is the wait in mcast_remove_one, see below:

sys/ofed/drivers/infiniband/core/ib_multicast.c:
> static void mcast_remove_one(struct ib_device *device, void *client_data)
> {
>     struct mcast_device *dev = client_data;
>     struct mcast_port *port;
>     int i;
> 
>     if (!dev)
>         return;
> 
>     ib_unregister_event_handler(&dev->event_handler);
>     flush_workqueue(mcast_wq);
> 
>     for (i = 0; i <= dev->end_port - dev->start_port; i++) {
>         if (rdma_cap_ib_mcast(device, dev->start_port + i)) {
>             port = &dev->port[i];
>             deref_port(port);
>             wait_for_completion(&port->comp);
>         }
>     }
> 
>     kfree(dev);
> }
>
> [...]
>
> tatic void deref_port(struct mcast_port *port)
> {
>     if (atomic_dec_and_test(&port->refcount))
>         complete(&port->comp);
> }

The crucial logic is in the deref_port(port) function call. It does a
check whether the reference counter for port is zero after
decrementing the count. If it is zero, complete is called on
port->comp.

In the case where the reference count is larger than zero after the
decrement, complete is never called and we hang forever on
wait_for_completion in mcast_remove_one.

By moving the initialisation of ipoib, we got it to be removed before
the ib_multicast client, causing the reference count to be exactly 1
going into mcast_remove_one, preventing the hang.

Cordially,
Andreas Kempe
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-infiniband/attachments/20200227/31acacec/attachment.sig>