Upperlimit for bwait()

From: Kumara Babu <nkumarababu_at_gmail.com>
Date: Thu, 30 May 2024 05:53:35 UTC
Hello,

There have been a few incidents reported on Juniper devices with FreeBSD,
where buffer IO operations sleep for more than 30 mins. Theoretically, this
can happen due to faulty hardware or in virtual platforms due to faulty
connection between guest and host, filesystem corruption, too many buffer
IO operations, and/or host not responding due to various reasons. When that
happens, as this buffer IO writes hold a lock before going to sleep, the
threads waiting for that lock would starve for so long. There is no upper
limit for this bwait() as of now. If that wait goes beyond 30 mins for a
sleeping thread OR 15 mins for a thread blocked on turnstile, deadlkres
crashes the kernel assuming a possible deadlock.

We perhaps could gracefully handle such lengthy buffer IO operations by
adding a timeout in bwait() - like say 10 minutes. If the buffer IO is not
completed in a few mins, it probably would not complete forever and/or
would be slowing down the entire system. So it is better to stop such
faulty IO operations.

For now, since we had seen these instances only with BIO operations, I have
a patch to set this value only from bufwait(). Please find the patch
attached. I am not very sure if 10 mins is a good upper limit for all the
scenarios for bwait(). If it is, then we could just change msleep() in
bwait() to set a 10 mins upper limit by default.

Please let me know if this approach works for all the usecases - If not, is
there a better alternative ?  And is 10 mins okay for a timeout ?

Thanks and Regards,

Kumara