Re: ZFS deadlocks triggered by HDD timeouts

From: Warner Losh <imp_at_bsdimp.com>
Date: Wed, 01 Dec 2021 20:36:58 UTC
On Wed, Dec 1, 2021 at 1:28 PM Alan Somers <asomers@freebsd.org> wrote:

> On Wed, Dec 1, 2021 at 11:25 AM Warner Losh <imp@bsdimp.com> wrote:
> >
> >
> >
> > On Wed, Dec 1, 2021, 11:16 AM Alan Somers <asomers@freebsd.org> wrote:
> >>
> >> On a stable/13 build from 16-Sep-2021 I see frequent ZFS deadlocks
> >> triggered by HDD timeouts.  The timeouts are probably caused by
> >> genuine hardware faults, but they didn't lead to deadlocks in
> >> 12.2-RELEASE or 13.0-RELEASE.  Unfortunately I don't have much
> >> additional information.  ZFS's stack traces aren't very informative,
> >> and dmesg doesn't show anything besides the usual information about
> >> the disk timeout.  I don't see anything obviously related in the
> >> commit history for that time range, either.
> >>
> >> Has anybody else observed this phenomenon?  Or does anybody have a
> >> good way to deliberately inject timeouts?  CAM makes it easy enough to
> >> inject an error, but not a timeout.  If it did, then I could bisect
> >> the problem.  As it is I can only reproduce it on production servers.
> >
> >
> > What SIM? Timeouts are tricky because they have many sources, some of
> which are nonlocal...
> >
> > Warner
>
> mpr(4)
>

Is this just a single drive that's acting up, or is the controller
initialized as part of the error recovery? If a single drive,
are there multiple timeouts that happen at the same time such that we
timeout a request while we're waiting for
the abort command we send to the firmware to be acknowledged? Would you be
able to run a kgdb script to see
if you're hitting a situation that I fixed in mpr that would cause I/O to
never complete in this rather odd circumstance?
If you can, and if it is, then there's a change I can MFC :).

Warner