Re: Upperlimit for bwait()

From: Warner Losh <imp_at_bsdimp.com>
Date: Thu, 30 May 2024 13:53:59 UTC
On Thu, May 30, 2024 at 1:54 AM Kumara Babu <nkumarababu@gmail.com> wrote:

> Hello,
>
> There have been a few incidents reported on Juniper devices with FreeBSD,
> where buffer IO operations sleep for more than 30 mins. Theoretically, this
> can happen due to faulty hardware or in virtual platforms due to faulty
> connection between guest and host, filesystem corruption, too many buffer
> IO operations, and/or host not responding due to various reasons. When that
> happens, as this buffer IO writes hold a lock before going to sleep, the
> threads waiting for that lock would starve for so long. There is no upper
> limit for this bwait() as of now. If that wait goes beyond 30 mins for a
> sleeping thread OR 15 mins for a thread blocked on turnstile, deadlkres
> crashes the kernel assuming a possible deadlock.
>
Why isn't the I/O timing out? That's the real problem.

> We perhaps could gracefully handle such lengthy buffer IO operations by
> adding a timeout in bwait() - like say 10 minutes. If the buffer IO is not
> completed in a few mins, it probably would not complete forever and/or
> would be slowing down the entire system. So it is better to stop such
> faulty IO operations.
>
I think that's a terrible idea. Why aren't the I/Os timing out?

> For now, since we had seen these instances only with BIO operations, I
> have a patch to set this value only from bufwait(). Please find the patch
> attached. I am not very sure if 10 mins is a good upper limit for all the
> scenarios for bwait(). If it is, then we could just change msleep() in
> bwait() to set a 10 mins upper limit by default.
>
I never see this on any of the thousands of systems I've used.

> Please let me know if this approach works for all the usecases - If not,
> is there a better alternative ?  And is 10 mins okay for a timeout ?
>
Making sure that the I/Os timeout.

And by that, I mean doing what we do in CAM. All the SIMs ensure that
transactions posted to the device will timeout. Most of the SIMs create a
timeout per transaction which expire and complete the CCBs with a timeout,
which the periph drivers then see this status and will fail the I/O with a
timed out status (or maybe retries it a couple of times, depending on the
hardware and its recovery methods (eg is the timeout due to the drive, the
link, the HBA, etc will result in different recovery in the face of
timeouts). NVME nvd does similar things: A timeout will cause the nvme card
to be reset and we try again, but eventually fail.

One might also wonder why 30s is the timeout for most of the commands. I
get that 'special' commands might need a longer timeout, but we likely
should look at lowering this somewhat. 15s is almost certainly safe. 10s is
probably safe. 5s will work, but you start to get P99.9999 outliers on
popular completely working spinning rust, and P99.9 on marginal drives, so
it can be a bit tricky to change (we'll have to phase it in). That could
make things a bit better in terms of worse case recovery time.

So why aren't the I/O's timing out is the real question here.

Warner


> Thanks and Regards,
>
> Kumara
>