Re: ZFS deadlocks triggered by HDD timeouts
- Reply: Alan Somers : "Re: ZFS deadlocks triggered by HDD timeouts"
- In reply to: Alan Somers : "Re: ZFS deadlocks triggered by HDD timeouts"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sat, 04 Dec 2021 00:18:51 UTC
Hey Alan, On Fri, Dec 3, 2021 at 8:38 AM Alan Somers <asomers@freebsd.org> wrote: > On Wed, Dec 1, 2021 at 3:48 PM Warner Losh <imp@bsdimp.com> wrote: > > > > > > > > On Wed, Dec 1, 2021, 3:36 PM Alan Somers <asomers@freebsd.org> wrote: > >> > >> On Wed, Dec 1, 2021 at 2:46 PM Warner Losh <imp@bsdimp.com> wrote: > >> > > >> > > >> > > >> > > >> > On Wed, Dec 1, 2021, 2:36 PM Alan Somers <asomers@freebsd.org> wrote: > >> >> > >> >> On Wed, Dec 1, 2021 at 1:56 PM Warner Losh <imp@bsdimp.com> wrote: > >> >> > > >> >> > > >> >> > > >> >> > On Wed, Dec 1, 2021 at 1:47 PM Alan Somers <asomers@freebsd.org> > wrote: > >> >> >> > >> >> >> On Wed, Dec 1, 2021 at 1:37 PM Warner Losh <imp@bsdimp.com> > wrote: > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > On Wed, Dec 1, 2021 at 1:28 PM Alan Somers <asomers@freebsd.org> > wrote: > >> >> >> >> > >> >> >> >> On Wed, Dec 1, 2021 at 11:25 AM Warner Losh <imp@bsdimp.com> > wrote: > >> >> >> >> > > >> >> >> >> > > >> >> >> >> > > >> >> >> >> > On Wed, Dec 1, 2021, 11:16 AM Alan Somers < > asomers@freebsd.org> wrote: > >> >> >> >> >> > >> >> >> >> >> On a stable/13 build from 16-Sep-2021 I see frequent ZFS > deadlocks > >> >> >> >> >> triggered by HDD timeouts. The timeouts are probably > caused by > >> >> >> >> >> genuine hardware faults, but they didn't lead to deadlocks > in > >> >> >> >> >> 12.2-RELEASE or 13.0-RELEASE. Unfortunately I don't have > much > >> >> >> >> >> additional information. ZFS's stack traces aren't very > informative, > >> >> >> >> >> and dmesg doesn't show anything besides the usual > information about > >> >> >> >> >> the disk timeout. I don't see anything obviously related > in the > >> >> >> >> >> commit history for that time range, either. > >> >> >> >> >> > >> >> >> >> >> Has anybody else observed this phenomenon? Or does anybody > have a > >> >> >> >> >> good way to deliberately inject timeouts? CAM makes it > easy enough to > >> >> >> >> >> inject an error, but not a timeout. If it did, then I > could bisect > >> >> >> >> >> the problem. As it is I can only reproduce it on > production servers. > >> >> >> >> > > >> >> >> >> > > >> >> >> >> > What SIM? Timeouts are tricky because they have many > sources, some of which are nonlocal... > >> >> >> >> > > >> >> >> >> > Warner > >> >> >> >> > >> >> >> >> mpr(4) > >> >> >> > > >> >> >> > > >> >> >> > Is this just a single drive that's acting up, or is the > controller initialized as part of the error recovery? > >> >> >> > >> >> >> I'm not doing anything fancy with mprutil or sas3flash, if that's > what > >> >> >> you're asking. > >> >> > > >> >> > > >> >> > No. I'm asking if you've enabled debugging on the recovery > messages and see that we enter any kind of > >> >> > controller reset when the timeouts occur. > >> >> > >> >> No. My CAM setup is the default except that I enabled CAM_IO_STATS > >> >> and changed the following two sysctls: > >> >> kern.cam.da.retry_count=2 > >> >> kern.cam.da.default_timeout=10 > >> >> > >> >> > >> >> > > >> >> >> > >> >> >> > If a single drive, > >> >> >> > are there multiple timeouts that happen at the same time such > that we timeout a request while we're waiting for > >> >> >> > the abort command we send to the firmware to be acknowledged? > >> >> >> > >> >> >> I don't know. > >> >> > > >> >> > > >> >> > OK. > >> >> > > >> >> >> > >> >> >> > Would you be able to run a kgdb script to see > >> >> >> > if you're hitting a situation that I fixed in mpr that would > cause I/O to never complete in this rather odd circumstance? > >> >> >> > If you can, and if it is, then there's a change I can MFC :). > >> >> >> > >> >> >> Possibly. When would I run this kgdb script? Before ZFS locks > up, > >> >> >> after, or while the problematic timeout happens? > >> >> > > >> >> > > >> >> > After the timeouts. I've been doing 'kgdb' followed by 'source > mpr-hang.gdb' to run this. > >> >> > > >> >> > What you are looking for is anything with a qfrozen_cnt > 0.. The > script is imperfect and racy > >> >> > with normal operations (but not in a bad way), so you may need to > run it a couple of times > >> >> > to get consistent data. On my systems, there'd be one or two > devices with a frozen count > 1 > >> >> > and no I/O happened on those drives and processes hung. That might > not be any different than > >> >> > a deadlock :) > >> >> > > >> >> > Warner > >> >> > > >> >> > P.S. here's the mpr-hang.gdb script. Not sure if I can make an > attachment survive the mailing lists :) > >> >> > >> >> Thanks, I'll try that. If this is the problem, do you have any idea > >> >> why it wouldn't happen on 12.2-RELEASE (I haven't seen it on > >> >> 13.0-RELEASE, but maybe I just don't have enough runtime on that > >> >> version). > >> > > >> > > >> > 9781c28c6d63 was merged to stable/13 as a996b55ab34c on Sept 2nd. I > fixed a bug > >> > with that version in current as a8837c77efd0, but haven't merged it. > I kinda expect that > >> > this might be the cause of the problem. But in Netflix's fleet we've > seen this maybe a > >> > couple of times a week over many thousands of machines, so I've been > a little cautious > >> > in merging it to make sure that it's really fixed. So far, the jury > is out. > >> > > >> > Warner > >> > >> Well, I'm experiencing this error much more frequently than you then. > >> I've seen it on about 10% of similarly-configured servers and they've > >> only been running that release for 1 week. > > > > > > You can run my script soon then to see if it's the same thing. > > > > Warner > > > >> -Alan > > That confirms it. I hit the deadlock again, and qfrozen_cnt was > between 1 and 3 for four devices: two da devices (we use multipath) > and their accompanying pass devices. So I should try merging > a8837c77efd0 next? > Yes. I'd planned on merging it this weekend, but if you wanted a jump on me, that's the next step. Warner