Possible CAM regression: Error handling (retries) change between 13.2 and 14

From: Borja Marcos <borjam_at_sarenet.es>
Date: Thu, 15 Feb 2024 05:29:37 UTC
Hi,

I have updated a system from 13.2 to 14 and I have noticed a change in SATA error handling.

Although I am replacing the troublesome disk I am not sure whether this is a regression.

The server where it is running has been showing some SATA errors stating that the command was retried. 

Feb  4 20:07:04 micro1 kernel: (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 28 01 00 40 00 00 00 00 00 00
Feb  4 20:07:04 micro1 kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Feb  4 20:07:04 micro1 kernel: (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
Feb  4 20:07:04 micro1 kernel: (ada0:ahcich0:0:0:0): RES: 41 10 28 01 00 40 00 00 00 00 00
Feb  4 20:07:04 micro1 kernel: (ada0:ahcich0:0:0:0): Retrying command, 3 more tries remain

However, seems the command was retried. I am not sure whether there is some backplane change or it is really a disk problem. That
said, the ZFS pool was scrubbed monthly and there was never any corruption.Moreover it has quite a lot of disk I/O and I haven’t seen any
hiccups. 

After updating to 14 I have seen a similar pattern, but error handling has changed:

Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 f8 01 00 40 00 00 00 00 00 00
Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): CAM status: Auto-Sense Retrieval Failed
Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): Error 5, Unretryable error
Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 f8 03 00 40 00 00 00 00 00 00
Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): CAM status: Auto-Sense Retrieval Failed
Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): Error 5, Unretryable error
Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 f8 9f 50 40 5d 01 00 00 00 00
Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): CAM status: Auto-Sense Retrieval Failed
Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): Error 5, Unretryable error
Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 f8 a1 50 40 5d 01 00 00 00 00
Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): CAM status: Auto-Sense Retrieval Failed
Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): Error 5, Unretryable error
Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 10 02 00 40 00 00 00 00 00 00
Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): CAM status: Auto-Sense Retrieval Failed
Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): Error 5, Unretryable error
Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 10 9e 50 40 5d 01 00 00 00 00
Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): CAM status: Auto-Sense Retrieval Failed
Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): Error 5, Unretryable error
Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 10 a0 50 40 5d 01 00 00 00 00
Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): CAM status: Auto-Sense Retrieval Failed
Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): Error 5, Unretryable error
Feb  9 16:59:52 micro1 ZFS[4228]: vdev I/O failure, zpool=unpul path=/dev/ada0 offset=270336 size=8192 error=5
Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 10 02 00 40 00 00 00 00 00 00
Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): CAM status: Auto-Sense Retrieval Failed
Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): Error 5, Unretryable error
Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 10 9e 50 40 5d 01 00 00 00 00
Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): CAM status: Auto-Sense Retrieval Failed
Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): Error 5, Unretryable error
Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 10 a0 50 40 5d 01 00 00 00 00
Feb  9 16:59:52 micro1 kernel: (ada0:ahcich0:0:0:0): CAM status: Auto-Sense Retrieval Failed

Now, the questions are:
- Were some errors considered retryable on FreeBSD < 14 and since FreeBSD 14 they are unretryable now?
- Is this new FreeBSD 14 behavior a bug or a feature? I mean, was it wrong to consider these errors retryable before?
- Or, of course, is this a coincidence and the disk has just begun to show its age right after I have updated to FreeBSD 14?


Hardware information:

It is a HP Microserver Gen 8 with its builtin AHCI controller:
ahci0: <Intel Cougar Point AHCI SATA controller> port 0x10c0-0x10c7,0x10c8-0x10cb,0x10d0-0x10d7,0x10d8-0x10db,0x10e0-0x10ff mem 0xfacd0000-0xfacd07ff irq 17 at device 31.2 on pci0
ahci0: AHCI v1.30 with 6 6Gbps ports, Port Multiplier supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich2: <AHCI channel> at channel 2 on ahci0
ahcich3: <AHCI channel> at channel 3 on ahci0
ahcich4: <AHCI channel> at channel 4 on ahci0
ahcich5: <AHCI channel> at channel 5 on ahci0
ahciem0: <AHCI enclosure management bridge> on ahci0

And the affected disk is a 3 TB Western Digital Red 
ada0: <WDC WD30EFRX-68EUZN0 82.00A82> ACS-2 ATA SATA 3.x device
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 2861588MB (5860533168 512 byte sectors)
ada0: quirks=0x1<4K>
ses0: ada0,pass0 in 'Slot 00', SATA Slot: scbus0 target 0



Thanks!






Borja.