[Bug 277499] panic in doneq0 xpt_done_td xpt_done_process after HDD falling off the bus (Periph destroyed)
Date: Tue, 05 Mar 2024 19:30:50 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=277499 --- Comment #3 from Warner Losh <imp@FreeBSD.org> --- commit 6c8ab086fed37a6b44fa84377e48c499f223ae80 Author: Warner Losh <imp@FreeBSD.org> Date: Sun May 1 10:39:04 2022 -0600 ada: Retry commands with retries left on CAM_SEL_TIMEOUT The AHCI and ATA SIMs will return CAM_SEL_TIMEOUT when an underlying device has stopped responding. This is usually seen after a timeouted out command and can be a transient event. Rather than fail the peripheral immediately after seeing this, queue a retry. For transient events, this allows drives to continue to provide data, though with some added latency, just like we do when we have some other kind of retriable error. If the error isn't transient (the drive is truly gone), then we'll discover that eventually and fail the transaction and invalidate the drive like we do today. This helps us avoid a panic at the end of camperiphfree when CAM_PERIPH_NEW_DEV_FOUND is set. However, the deferred callback should be queued to xpt_async_td instead of being made inline there. This issue will be solved in a different patch that does that. PR 263703. This also helps us avoid another bug where we can drop all references to the device (causing us to go through camperiphfree and destroy the path) while we have an I/O pending in the ata_da state machine (usually in state ADA_STATE_RAHEAD with ATA_SETFEATURES ATA_SF_ENAB_RCACHE command). It's not clear why the reference that we take out to do the reprobe isn't effective at blocking this. By retrying this condition, though we avoid this bug (at least more often, I don't have a good reproduction test case, I just see this panic a few times a month at work on systems that have transient disk errors on ahci connected SATA SSDs). PR 263704. It's too soon to know how much this helps us avoid this bug. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D34977 -- You are receiving this mail because: You are the assignee for the bug.