SSD trim crash

From: Lee MATTHEWS via freebsd-drivers <freebsd-drivers_at_FreeBSD.org>
Date: Mon, 03 Jan 2022 16:01:10 UTC
Hello,

I have been tasked with working on an intermittent problem that has been affecting some of our products. During in-house testing, we observe that occasionally our system crashes and reboots. After reboot, the system gets blocked in the BIOS and it does not advance any further because the BIOS does not detect the SSD disk. The crash appears to happen randomly during testing, as it can take anything from several hours of continuous testing to a week of running without stop for the crash to occur. The only way to get the system back up and running again is to power down and then up the machine, at which point the disk once again becomes visible to the BIOS.

Our products are based on FreeBSD version 11.3.

The crash appears to be linked to the type of disk that is being used. We have one brand of disk that has never had a problem (even after a month of continuous testing) and others that, after testing, fail and after reboot get blocked in the BIOS.


Logging via the serial console, I have observed the following crash :

--
[2021-12-10 20:56:10] ahcich1: Timeout on slot 5 port 0
[2021-12-10 20:56:10] ahcich1: is 00000000 cs 00000020 ss 00000000 rs 00000020 tfd c0 serr 00000000 cmd 0000c517
[2021-12-10 20:56:11] ahcich1: (ada0:ahcich1:0:0:0): DSM TRIM. ACB: 06 01 00 00 00 40 00 00 00 00 01 00
[2021-12-10 20:56:11] AHCI reset...
[2021-12-10 20:56:11] (ada0:ahcich1:0:0:0): CAM status: Command timeout
[2021-12-10 20:56:11] ahcich1: (ada0:ahcich1:0:SATA connect time=100us status=00000133
[2021-12-10 20:56:11] 0:0): Retrying command
[2021-12-10 20:56:11] ahcich1: AHCI reset: device found
[2021-12-10 20:56:42] ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
[2021-12-10 20:57:15] ahcich1: Timeout on slot 6 port 0
[2021-12-10 20:57:15] ahcich1: is 00000000 cs 00000040 ss 00000000 rs 00000040 tfd 80 serr 00000000 cmd 0000c617
[2021-12-10 20:57:15] ahcich1: (aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
[2021-12-10 20:57:16] AHCI reset...
[2021-12-10 20:57:16] (aprobe0:ahcich1:0:0:0): CAM status: Command timeout
[2021-12-10 20:57:16] ahcich1: (aprobe0:SATA connect time=100us status=00000133
[2021-12-10 20:57:16] ahcich1:0:ahcich1: AHCI reset: device found
[2021-12-10 20:57:16] 0:0): Retrying command
[2021-12-10 20:57:47] ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
[2021-12-10 20:58:20] ahcich1: Timeout on slot 7 port 0
[2021-12-10 20:58:20] ahcich1: is 00000000 cs 00000080 ss 00000000 rs 00000080 tfd 80 serr 00000000 cmd 0000c717
[2021-12-10 20:58:20] ahcich1: (aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
[2021-12-10 20:58:21] AHCI reset...
[2021-12-10 20:58:21] (aprobe0:ahcich1:0:0:0): CAM status: Command timeout
[2021-12-10 20:58:21] ahcich1: (aprobe0:ahcich1:0:SATA connect time=100us status=00000133
[2021-12-10 20:58:21] 0:0): ahcich1: AHCI reset: device found
[2021-12-10 20:58:21] Error 5, Retries exhausted
[2021-12-10 20:58:52] ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
[2021-12-10 20:59:25] ahcich1: Timeout on slot 8 port 0
[2021-12-10 20:59:25] ahcich1: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd 80 serr 00000000 cmd 0000c817
[2021-12-10 20:59:25] ahcich1: (aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
[2021-12-10 20:59:26] AHCI reset...
[2021-12-10 20:59:26] (aprobe0:ahcich1:0:0:0): CAM status: Command timeout
[2021-12-10 20:59:26] ahcich1: (aprobe0:SATA connect time=100us status=00000133
[2021-12-10 20:59:26] ahcich1:0:ahcich1: 0:AHCI reset: device found
[2021-12-10 20:59:26] 0): Error 5, Retry was blocked
[2021-12-10 20:59:26] ada0 at ahcich1 bus 0 scbus0 target 0 lun 0
[2021-12-10 20:59:26] ada0: <WDC PC SA530 SDASB8Y256G 40103000> s/n 200914802684 detached
[2021-12-10 20:59:26] g_vfs_done():ufs/log[WRITE(offset=5309202432, length=32768)]error = 6
[2021-12-10 20:59:26] pass0 at ahcich1 bus 0 scbus0 target 0 lun 0
[2021-12-10 20:59:26] g_vfs_done():pass0: <WDC PC SA530 SDASB8Y256G 40103000>ufs/log[WRITE(offset=65027883008, length=16384)] s/n 200914802684error = 6
[2021-12-10 20:59:26]  detached
[2021-12-10 20:59:26] g_vfs_done():(pass0:ahcich1:0:ufs/log[WRITE(offset=65124368384, length=32768)]0:error = 6
[2021-12-10 20:59:26] 0): g_vfs_done():Periph destroyed
[2021-12-10 20:59:26] ufs/log[WRITE(offset=65124368384, length=32768)]error = 6
[2021-12-10 20:59:26] g_vfs_done():ufs/log[WRITE(offset=65126465536, length=32768)]error = 6
[2021-12-10 20:59:26] g_vfs_done():ufs/log[WRITE(offset=65126465536, length=32768)]error = 6
[2021-12-10 20:59:26] g_vfs_done():g_vfs_done():ufs/log[WRITE(offset=24702976, length=512)]ufs/log[WRITE(offset=65126662144, length=16384)]error = 6
[2021-12-10 20:59:27] panic: cannot reassign paging buffer
[2021-12-10 20:59:27] cpuid = 0
[2021-12-10 20:59:27] __HardenedBSD_version = 1100056 __FreeBSD_version = 1103500
[2021-12-10 20:59:27] version = NS-BSD 4.2.0.beta--HBSD #0 : Wed Dec  1 14:00:48 CET 2021
[2021-12-10 20:59:27]     build@BuildFreeBSD-11.3-hardened:/home/build/build/kernel/work-OPTIM/sys/amd64/compile/NETASQ.XL.SMP.HW.RELEASE
[2021-12-10 20:59:27] Uptime: 2d0h46m1s
[2021-12-10 20:59:27] Dumping 2004 out of 8064 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%
[2021-12-10 21:00:29] Dump complete
[2021-12-10 21:00:31] þVersion 2.15.1236. Copyright (C) 2012 American Megatrends, Inc.      
--

It appears that after sending a SATA TRIM command, that the disk stops responding. I've also observed another crash that happens just after a TRIM command.

According to camcontrol, the disk supports the TRIM command :

--
camcontrol identify /dev/ada0
pass0: <WDC PC SA530 SDASB8Y256G 40103000> ACS-4 ATA SATA 3.x device
pass0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)
  
protocol              ATA/ATAPI-11 SATA 3.x
device model          WDC PC SA530 SDASB8Y256G
firmware revision     40103000
serial number         200914802684
WWN                   5001b448b13c08ca
cylinders             16383
heads                 16
sectors/track         63
sector size           logical 512, physical 512, offset 0
LBA supported         268435455 sectors
LBA48 supported       500118192 sectors
PIO supported         PIO4
DMA supported         WDMA2 UDMA6
media RPM             non-rotating
Zoned-Device Commands no
 
Feature                      Support  Enabled   Value           Vendor
read ahead                     yes      yes
write cache                    yes      yes
flush cache                    yes      yes
overlap                        no
Tagged Command Queuing (TCQ)   no       no
Native Command Queuing (NCQ)   yes              32 tags
NCQ Queue Management           no
NCQ Streaming                  no
Receive & Send FPDMA Queued    no
SMART                          yes      yes
microcode download             yes      yes
security                       yes      no
power management               yes      yes
advanced power management      yes      yes     128/0x80
automatic acoustic management  no       no
media status notification      no       no
power-up in Standby            no       no
write-read-verify              no       no
unload                         no       no
general purpose logging        yes      yes
free-fall                      no       no
Data Set Management (DSM/TRIM) yes
DSM - max 512byte blocks       yes              8
DSM - deterministic read       yes              zeroed
Host Protected Area (HPA)      no
--


I found this patch concerning a similar problem : https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=222802

I've modified sys/cam/ata/ata_da.c to deactivate the trim functionality.

--
+diff --git sys/cam/ata/ata_da.c sys/cam/ata/ata_da.c
+index cba52c5458be..ad256da0a495 100644
+--- sys/cam/ata/ata_da.c
++++ sys/cam/ata/ata_da.c
+@@ -1798,7 +1798,7 @@ adaregister(struct cam_periph *periph, void *arg)
+       softc->disk->d_flags = DISKFLAG_DIRECT_COMPLETION | DISKFLAG_CANZONE;
+       if (softc->flags & ADA_FLAG_CAN_FLUSHCACHE)
+               softc->disk->d_flags |= DISKFLAG_CANFLUSHCACHE;
+-      if (softc->flags & ADA_FLAG_CAN_TRIM) {
++      if (0 /*softc->flags & ADA_FLAG_CAN_TRIM */) {
+               softc->disk->d_flags |= DISKFLAG_CANDELETE;
+               softc->disk->d_delmaxsize = softc->params.secsize *
+                                           ATA_DSM_RANGE_MAX *
--

I've been running a set of tests on one of our products for over two weeks with the trim deactivated (using the above patch) and there have been no crashes.

Could this issue be an internal disk firmware problem?

Is it possible that camcontrol reports that the disk supports TRIM yet in reality it isn't supported, or isn't supported fully?

I've observed that this crash doesn't happen on the first TRIM command, is it possible that a set of SATA commands destabilize the disk firmware and cause it to crash?

Is it possible that the timeout isn't long enough for the TRIM command?

Thanks in advance for any help.

Best wishes,
Lee Matthews