Re: ZFS deadlocks triggered by HDD timeouts

Reply: Warner Losh : "Re: ZFS deadlocks triggered by HDD timeouts"
In reply to: Warner Losh : "Re: ZFS deadlocks triggered by HDD timeouts"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Alan Somers <asomers_at_freebsd.org>
Date: Wed, 01 Dec 2021 21:35:50 UTC
On Wed, Dec 1, 2021 at 1:56 PM Warner Losh <imp@bsdimp.com> wrote:
>
>
>
> On Wed, Dec 1, 2021 at 1:47 PM Alan Somers <asomers@freebsd.org> wrote:
>>
>> On Wed, Dec 1, 2021 at 1:37 PM Warner Losh <imp@bsdimp.com> wrote:
>> >
>> >
>> >
>> > On Wed, Dec 1, 2021 at 1:28 PM Alan Somers <asomers@freebsd.org> wrote:
>> >>
>> >> On Wed, Dec 1, 2021 at 11:25 AM Warner Losh <imp@bsdimp.com> wrote:
>> >> >
>> >> >
>> >> >
>> >> > On Wed, Dec 1, 2021, 11:16 AM Alan Somers <asomers@freebsd.org> wrote:
>> >> >>
>> >> >> On a stable/13 build from 16-Sep-2021 I see frequent ZFS deadlocks
>> >> >> triggered by HDD timeouts.  The timeouts are probably caused by
>> >> >> genuine hardware faults, but they didn't lead to deadlocks in
>> >> >> 12.2-RELEASE or 13.0-RELEASE.  Unfortunately I don't have much
>> >> >> additional information.  ZFS's stack traces aren't very informative,
>> >> >> and dmesg doesn't show anything besides the usual information about
>> >> >> the disk timeout.  I don't see anything obviously related in the
>> >> >> commit history for that time range, either.
>> >> >>
>> >> >> Has anybody else observed this phenomenon?  Or does anybody have a
>> >> >> good way to deliberately inject timeouts?  CAM makes it easy enough to
>> >> >> inject an error, but not a timeout.  If it did, then I could bisect
>> >> >> the problem.  As it is I can only reproduce it on production servers.
>> >> >
>> >> >
>> >> > What SIM? Timeouts are tricky because they have many sources, some of which are nonlocal...
>> >> >
>> >> > Warner
>> >>
>> >> mpr(4)
>> >
>> >
>> > Is this just a single drive that's acting up, or is the controller initialized as part of the error recovery?
>>
>> I'm not doing anything fancy with mprutil or sas3flash, if that's what
>> you're asking.
>
>
> No. I'm asking if you've enabled debugging on the recovery messages and see that we enter any kind of
> controller reset when the timeouts occur.

No.  My CAM setup is the default except that I enabled CAM_IO_STATS
and changed the following two sysctls:
kern.cam.da.retry_count=2
kern.cam.da.default_timeout=10


>
>>
>> > If a single drive,
>> > are there multiple timeouts that happen at the same time such that we timeout a request while we're waiting for
>> > the abort command we send to the firmware to be acknowledged?
>>
>> I don't know.
>
>
> OK.
>
>>
>> > Would you be able to run a kgdb script to see
>> > if you're hitting a situation that I fixed in mpr that would cause I/O to never complete in this rather odd circumstance?
>> > If you can, and if it is, then there's a change I can MFC :).
>>
>> Possibly.  When would I run this kgdb script?  Before ZFS locks up,
>> after, or while the problematic timeout happens?
>
>
> After the timeouts. I've been doing 'kgdb' followed by 'source mpr-hang.gdb' to run this.
>
> What you are looking for is anything with a qfrozen_cnt > 0.. The script is imperfect and racy
> with normal operations (but not in a bad way), so you may need to run it a couple of times
> to get consistent data. On my systems, there'd be one or two devices with a frozen count > 1
> and no I/O happened on those drives and processes hung. That might not be any different than
> a deadlock :)
>
> Warner
>
> P.S. here's the mpr-hang.gdb script. Not sure if I can make an attachment survive the mailing lists :)

Thanks, I'll try that.  If this is the problem, do you have any idea
why it wouldn't happen on 12.2-RELEASE (I haven't seen it on
13.0-RELEASE, but maybe I just don't have enough runtime on that
version).

>
> Warner