Re: ZFS deadlocks triggered by HDD timeouts
- Reply: Warner Losh : "Re: ZFS deadlocks triggered by HDD timeouts"
- In reply to: Warner Losh : "Re: ZFS deadlocks triggered by HDD timeouts"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Wed, 01 Dec 2021 20:47:01 UTC
On Wed, Dec 1, 2021 at 1:37 PM Warner Losh <imp@bsdimp.com> wrote: > > > > On Wed, Dec 1, 2021 at 1:28 PM Alan Somers <asomers@freebsd.org> wrote: >> >> On Wed, Dec 1, 2021 at 11:25 AM Warner Losh <imp@bsdimp.com> wrote: >> > >> > >> > >> > On Wed, Dec 1, 2021, 11:16 AM Alan Somers <asomers@freebsd.org> wrote: >> >> >> >> On a stable/13 build from 16-Sep-2021 I see frequent ZFS deadlocks >> >> triggered by HDD timeouts. The timeouts are probably caused by >> >> genuine hardware faults, but they didn't lead to deadlocks in >> >> 12.2-RELEASE or 13.0-RELEASE. Unfortunately I don't have much >> >> additional information. ZFS's stack traces aren't very informative, >> >> and dmesg doesn't show anything besides the usual information about >> >> the disk timeout. I don't see anything obviously related in the >> >> commit history for that time range, either. >> >> >> >> Has anybody else observed this phenomenon? Or does anybody have a >> >> good way to deliberately inject timeouts? CAM makes it easy enough to >> >> inject an error, but not a timeout. If it did, then I could bisect >> >> the problem. As it is I can only reproduce it on production servers. >> > >> > >> > What SIM? Timeouts are tricky because they have many sources, some of which are nonlocal... >> > >> > Warner >> >> mpr(4) > > > Is this just a single drive that's acting up, or is the controller initialized as part of the error recovery? I'm not doing anything fancy with mprutil or sas3flash, if that's what you're asking. > If a single drive, > are there multiple timeouts that happen at the same time such that we timeout a request while we're waiting for > the abort command we send to the firmware to be acknowledged? I don't know. > Would you be able to run a kgdb script to see > if you're hitting a situation that I fixed in mpr that would cause I/O to never complete in this rather odd circumstance? > If you can, and if it is, then there's a change I can MFC :). Possibly. When would I run this kgdb script? Before ZFS locks up, after, or while the problematic timeout happens? > > Warner