lockmgr() in network driver context
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Tue, 01 Aug 2023 14:47:55 UTC
Hi all, I’m encountering i2c bus failures that need more graceful error handling. The driver operates its management threads in polling mode within callout context. As a hardware constraint, communication via said i2c bus can only happen for a single network port at a time and thus needs to be mtx_lock()'d. With this lock held, i2c transfers are initiated by the driver and the done event is signaled via an ISR. Since we cannot sleep both in callout (not allowed according to [1]) and mutex context (will panic), the second thread waiting on the bus will spin if it does not live on the same CPU[1], and the i2c transfer itself is handled via a DELAY() busy loop. If the bus fails, we have 2 cores spinning at 100%, causing connections to drop and all sorts of other carnage to happen. Reducing the timeout is not an option, as the transfers might take a while. Sleeping under a MTX_DEF is not possible, instant panic. However, if I replace the first mtx_lock() with a lockmgr() call that was initialized with LK_TIMELOCK | LK_CANRECURSE and a given timeout, I'm able to put the thread waiting on the comms bus to be available to sleep. If I replace the hardcoded DELAY()s further down the line during the i2c transfers with mtx_sleep() (and wakeup_one() in the ISR), no CPU cycles are wasted, this is also the case for normal operation, improving the current situation by a mile. As to the nature of these bus failures, they seem to happen very infrequently and nigh on impossible to reproduce on a production system, but they always seem to recover after a certain amount of time. To reproduce the issue, I have wired some physical switches to the i2c bus to simulate the outages. From a design perspective, the i2c bus should be a singleton and should be treated as such via the queueing of transfer tasks including metadata to link such transfers back to specific port ids, however, a change like this would require significant effort with a large risk of regressions, and I'm not even sure if sleeping in such a scenario would be considered a valid thing to do in FreeBSD context, so I would prefer sticking to the battle-hardened code. The documentation for lockmgr() tells me to stay away from it, but I seem to have little options left. Also, the solution seems to violate the FreeBSD guidelines for sleeping in callout context. Any thoughts? Thanks in advance. [1] https://man.freebsd.org/cgi/man.cgi?locking(9)