[Bug 229745] ahcich: CAM status: Command timeout
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Fri, 22 Jul 2022 15:13:28 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=229745 --- Comment #62 from Warner Losh <imp@FreeBSD.org> --- This bug should be closed. There's too many different symptoms that have been co-located in this bug that are likely unrelated. There's clearly some bad hardware here. There's clearly some issues with ahci itself that appear to my eye to be 'quirky' resets of the bridge (though w/o traces it will be hard to know). Things that fail to reset on reboot are different than transient errors than are WRITE errors with codes that aren't timeouts. It's hard to sort out all the issues here. A number of other bugs should be filed to take its place, for real bugs that can be reproducible (because otherwise we won't know if any changes fix the problem or not). By and large, if a drive hangs, it is to blame. If a drive throws write errors, it's always the drive (though the CRC errors might be cabling issues). Reducing the write load makes sense at having the problem 'disappear': it puts a much higher instantaneous load on the drive than would otherwise be seen for drives that have marginal data and can cope with retries for a few writes vs retires on lots of writes all at once. The latter can overwhelm some drives' firmware. It's also possible that error recovery could be better in ahci, since we do recovery things when we get a timeout. However, those improvements can be hard to roll out and test due to needing real hardware that's basically good but sometimes misbehaves and most operations will retire / discard such hardware. -- You are receiving this mail because: You are the assignee for the bug.