[Bug 229745] ahcich: CAM status: Command timeout

From: <bugzilla-noreply_at_freebsd.org>
Date: Thu, 08 Feb 2024 17:43:27 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=229745

--- Comment #76 from Warner Losh <imp@FreeBSD.org> ---
(In reply to Kevin Zheng from comment #75)
>The issue that I'm writing about is the system behavior. It seemed that all I/O  (or maybe just writes?) to the ZFS pool were stalled waiting of the disk to time out and reattach, despite the fact that I have other working mirror devices. It seems to me that one hardware issue with one disk shouldn't stall the whole pool.
>
> I'm not actually sure if this problem is happening on the ZFS level or in CAM or the SATA subsystem; if this happens again, what debugging steps would determine what the cause of this is?

Yes. So there's a few things going on here. First, ZFS has ordering issues that
it undertakes to enforce by scheduling some I/O (especially writes) only after
I/O it depends on has completed. The ZFS code ensures that the state of its log
is always in a reasonable state by this means. That means that if some I/O
hangs for a "long period of time (more than a second or five)." then that would
delay the I/O that depend on that completing as well. This could have the
effect of causing processes to hang waiting for that I/O to complete.

So while I'd agree that one misbehaving disk shouldn't hang the pool, I can see
how it might. How can ZFS know what to schedule, consistent with its desire to
keep the law consistent, if any disk could suddenly stop writing? Now, I'm not
a ZFS expert enough to know if one of its goals is to cope with this situation.
I'd check with the ZFS developers to see if they'd expect ZFS to not stall if
one disk stalls for a long time. ZFS does try to pipeline its stream of I/Os as
well, as much as possible, and one stalling disk interferes with that pipeline.

One way to mitigate this, however, could be to set the timeout down from 30s to
something smaller like 3-5s (ssd) or 8-12s (hdd). And the number of retires
down to 2 (it has to be greater than 1 for most controllers due to deficiencies
in their recovery protocols, which are kinda hard to fix). That could help keep
the hangs down from 90s down to more like 5-10s (ssd) or 15-20s (hdd) which
would be less noticeable in a wide range of workloads (though certainly not
all).

There may be ZFS specific tunings that you might be able to try if this happens
often. Maybe smaller (or paradoxically larger) I/Os by creating the pools with
a smaller logical block size (ashift). This might help align the I/O to the
physical NAND blocks better (hence maybe bigger is needed). Also partitioning
the drive such that it starts on a good LBA boundary (I often keep 1MB at the
start of disks unused because that's still < physical block sizes, but also a
trivial amount... I expect to bump this to 8M or 16MB in the future). That
might help keep whatever bug / pathology that's in the drive leading to the
hangs to not occur (though there's no guarantee: maybe it's due to a bug in the
firmware that's impossible to avoid).

-- 
You are receiving this mail because:
You are the assignee for the bug.