git: 6c8ab086fed3 - main - ada: Retry commands with retries left on CAM_SEL_TIMEOUT
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sun, 01 May 2022 17:10:56 UTC
The branch main has been updated by imp: URL: https://cgit.FreeBSD.org/src/commit/?id=6c8ab086fed37a6b44fa84377e48c499f223ae80 commit 6c8ab086fed37a6b44fa84377e48c499f223ae80 Author: Warner Losh <imp@FreeBSD.org> AuthorDate: 2022-05-01 16:39:04 +0000 Commit: Warner Losh <imp@FreeBSD.org> CommitDate: 2022-05-01 17:08:56 +0000 ada: Retry commands with retries left on CAM_SEL_TIMEOUT The AHCI and ATA SIMs will return CAM_SEL_TIMEOUT when an underlying device has stopped responding. This is usually seen after a timeouted out command and can be a transient event. Rather than fail the peripheral immediately after seeing this, queue a retry. For transient events, this allows drives to continue to provide data, though with some added latency, just like we do when we have some other kind of retriable error. If the error isn't transient (the drive is truly gone), then we'll discover that eventually and fail the transaction and invalidate the drive like we do today. This helps us avoid a panic at the end of camperiphfree when CAM_PERIPH_NEW_DEV_FOUND is set. However, the deferred callback should be queued to xpt_async_td instead of being made inline there. This issue will be solved in a different patch that does that. PR 263703. This also helps us avoid another bug where we can drop all references to the device (causing us to go through camperiphfree and destroy the path) while we have an I/O pending in the ata_da state machine (usually in state ADA_STATE_RAHEAD with ATA_SETFEATURES ATA_SF_ENAB_RCACHE command). It's not clear why the reference that we take out to do the reprobe isn't effective at blocking this. By retrying this condition, though we avoid this bug (at least more often, I don't have a good reproduction test case, I just see this panic a few times a month at work on systems that have transient disk errors on ahci connected SATA SSDs). PR 263704. It's too soon to know how much this helps us avoid this bug. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D34977 --- sys/cam/ata/ata_da.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sys/cam/ata/ata_da.c b/sys/cam/ata/ata_da.c index b82671315138..b76058c8f19d 100644 --- a/sys/cam/ata/ata_da.c +++ b/sys/cam/ata/ata_da.c @@ -2872,7 +2872,7 @@ adadone(struct cam_periph *periph, union ccb *done_ccb) cam_periph_lock(periph); bp = (struct bio *)done_ccb->ccb_h.ccb_bp; if ((done_ccb->ccb_h.status & CAM_STATUS_MASK) != CAM_REQ_CMP) { - error = adaerror(done_ccb, 0, 0); + error = adaerror(done_ccb, CAM_RETRY_SELTO, 0); if (error == ERESTART) { /* A retry was scheduled, so just return. */ cam_periph_unlock(periph);