From nobody Mon Jan 03 16:01:10 2022 X-Original-To: freebsd-drivers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 6F8FC1925D44 for ; Mon, 3 Jan 2022 16:01:16 +0000 (UTC) (envelope-from lee.matthews.external@stormshield.eu) Received: from work.stormshield.eu (gwlille.netasq.com [91.212.116.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4JSL8C5Cx0z4XPn for ; Mon, 3 Jan 2022 16:01:12 +0000 (UTC) (envelope-from lee.matthews.external@stormshield.eu) Received: from work.stormshield.eu (localhost.localdomain [127.0.0.1]) by work.stormshield.eu (Postfix) with ESMTPS id 9D9833761810 for ; Mon, 3 Jan 2022 17:01:10 +0100 (CET) Received: from localhost (localhost.localdomain [127.0.0.1]) by work.stormshield.eu (Postfix) with ESMTP id 7D6623761E0D for ; Mon, 3 Jan 2022 17:01:10 +0100 (CET) DKIM-Filter: OpenDKIM Filter v2.10.3 work.stormshield.eu 7D6623761E0D DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=stormshield.eu; s=5EC8D4C2-228B-11EC-A8BC-C05B4988C3CE; t=1641225670; bh=FaiTOrLo79MUjAC8lSBvb/losQooFp4SGkzuU2geBY8=; h=Date:From:To:Message-ID:MIME-Version; b=V7cswFymKM8IGwYnlrfbzafwklQj/6+P/A73Zdg7w03syeCDxy5NIiBc2PYCjFAkE Fxg8V9/RrmxErQnKBDZUEPLdTuvKoIOgq1J2IhVyPc1M8RE7pmIy4N0dGYGkOCwNao zCeOHYVZkTid0KyGAa6++WyahgTbGWJpGVHNumkg= Received: from work.stormshield.eu ([127.0.0.1]) by localhost (work.stormshield.eu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id FCk-9Pn_euqA for ; Mon, 3 Jan 2022 17:01:10 +0100 (CET) Received: from work.stormshield.eu (localhost.localdomain [127.0.0.1]) by work.stormshield.eu (Postfix) with ESMTP id 5866A3761810 for ; Mon, 3 Jan 2022 17:01:10 +0100 (CET) Date: Mon, 3 Jan 2022 17:01:10 +0100 (CET) To: freebsd-drivers@FreeBSD.org Message-ID: <823709224.21545967.1641225670315.JavaMail.zimbra@stormshield.eu> Subject: SSD trim crash List-Id: Writing device drivers for FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-drivers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-drivers@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Thread-Index: QHNPJEU4Dwl/WVSUegySx6eItLxaEg== Thread-Topic: SSD trim crash X-Rspamd-Queue-Id: 4JSL8C5Cx0z4XPn X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=stormshield.eu header.s=5EC8D4C2-228B-11EC-A8BC-C05B4988C3CE header.b=V7cswFym; dmarc=pass (policy=quarantine) header.from=stormshield.eu; spf=pass (mx1.freebsd.org: domain of lee.matthews.external@stormshield.eu designates 91.212.116.1 as permitted sender) smtp.mailfrom=lee.matthews.external@stormshield.eu X-Spamd-Result: default: False [-4.00 / 15.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; RCVD_COUNT_FIVE(0.00)[5]; R_DKIM_ALLOW(-0.20)[stormshield.eu:s=5EC8D4C2-228B-11EC-A8BC-C05B4988C3CE]; FROM_HAS_DN(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:91.212.116.1]; MIME_GOOD(-0.10)[text/plain]; TO_DN_NONE(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[freebsd-drivers@freebsd.org]; RCPT_COUNT_ONE(0.00)[1]; NEURAL_HAM_LONG(-1.00)[-1.000]; DKIM_TRACE(0.00)[stormshield.eu:+]; DMARC_POLICY_ALLOW(-0.50)[stormshield.eu,quarantine]; NEURAL_HAM_SHORT(-1.00)[-0.999]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; RCVD_TLS_LAST(0.00)[]; ASN(0.00)[asn:49068, ipnet:91.212.116.0/24, country:FR]; MID_RHS_MATCH_FROM(0.00)[] Reply-To: lee.matthews.external@stormshield.eu From: Lee MATTHEWS via freebsd-drivers X-Original-From: Lee MATTHEWS X-ThisMailContainsUnwantedMimeParts: N Hello, I have been tasked with working on an intermittent problem that has been af= fecting some of our products. During in-house testing, we observe that occa= sionally our system crashes and reboots. After reboot, the system gets bloc= ked in the BIOS and it does not advance any further because the BIOS does n= ot detect the SSD disk. The crash appears to happen randomly during testing= , as it can take anything from several hours of continuous testing to a wee= k of running without stop for the crash to occur. The only way to get the s= ystem back up and running again is to power down and then up the machine, a= t which point the disk once again becomes visible to the BIOS. Our products are based on FreeBSD version 11.3. The crash appears to be linked to the type of disk that is being used. We h= ave one brand of disk that has never had a problem (even after a month of c= ontinuous testing) and others that, after testing, fail and after reboot ge= t blocked in the BIOS. Logging via the serial console, I have observed the following crash : -- [2021-12-10 20:56:10] ahcich1: Timeout on slot 5 port 0 [2021-12-10 20:56:10] ahcich1: is 00000000 cs 00000020 ss 00000000 rs 00000= 020 tfd c0 serr 00000000 cmd 0000c517 [2021-12-10 20:56:11] ahcich1: (ada0:ahcich1:0:0:0): DSM TRIM. ACB: 06 01 0= 0 00 00 40 00 00 00 00 01 00 [2021-12-10 20:56:11] AHCI reset... [2021-12-10 20:56:11] (ada0:ahcich1:0:0:0): CAM status: Command timeout [2021-12-10 20:56:11] ahcich1: (ada0:ahcich1:0:SATA connect time=3D100us st= atus=3D00000133 [2021-12-10 20:56:11] 0:0): Retrying command [2021-12-10 20:56:11] ahcich1: AHCI reset: device found [2021-12-10 20:56:42] ahcich1: AHCI reset: device not ready after 31000ms (= tfd =3D 00000080) [2021-12-10 20:57:15] ahcich1: Timeout on slot 6 port 0 [2021-12-10 20:57:15] ahcich1: is 00000000 cs 00000040 ss 00000000 rs 00000= 040 tfd 80 serr 00000000 cmd 0000c617 [2021-12-10 20:57:15] ahcich1: (aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: = ec 00 00 00 00 40 00 00 00 00 00 00 [2021-12-10 20:57:16] AHCI reset... [2021-12-10 20:57:16] (aprobe0:ahcich1:0:0:0): CAM status: Command timeout [2021-12-10 20:57:16] ahcich1: (aprobe0:SATA connect time=3D100us status=3D= 00000133 [2021-12-10 20:57:16] ahcich1:0:ahcich1: AHCI reset: device found [2021-12-10 20:57:16] 0:0): Retrying command [2021-12-10 20:57:47] ahcich1: AHCI reset: device not ready after 31000ms (= tfd =3D 00000080) [2021-12-10 20:58:20] ahcich1: Timeout on slot 7 port 0 [2021-12-10 20:58:20] ahcich1: is 00000000 cs 00000080 ss 00000000 rs 00000= 080 tfd 80 serr 00000000 cmd 0000c717 [2021-12-10 20:58:20] ahcich1: (aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: = ec 00 00 00 00 40 00 00 00 00 00 00 [2021-12-10 20:58:21] AHCI reset... [2021-12-10 20:58:21] (aprobe0:ahcich1:0:0:0): CAM status: Command timeout [2021-12-10 20:58:21] ahcich1: (aprobe0:ahcich1:0:SATA connect time=3D100us= status=3D00000133 [2021-12-10 20:58:21] 0:0): ahcich1: AHCI reset: device found [2021-12-10 20:58:21] Error 5, Retries exhausted [2021-12-10 20:58:52] ahcich1: AHCI reset: device not ready after 31000ms (= tfd =3D 00000080) [2021-12-10 20:59:25] ahcich1: Timeout on slot 8 port 0 [2021-12-10 20:59:25] ahcich1: is 00000000 cs 00000100 ss 00000000 rs 00000= 100 tfd 80 serr 00000000 cmd 0000c817 [2021-12-10 20:59:25] ahcich1: (aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: = ec 00 00 00 00 40 00 00 00 00 00 00 [2021-12-10 20:59:26] AHCI reset... [2021-12-10 20:59:26] (aprobe0:ahcich1:0:0:0): CAM status: Command timeout [2021-12-10 20:59:26] ahcich1: (aprobe0:SATA connect time=3D100us status=3D= 00000133 [2021-12-10 20:59:26] ahcich1:0:ahcich1: 0:AHCI reset: device found [2021-12-10 20:59:26] 0): Error 5, Retry was blocked [2021-12-10 20:59:26] ada0 at ahcich1 bus 0 scbus0 target 0 lun 0 [2021-12-10 20:59:26] ada0: s/n 2009148= 02684 detached [2021-12-10 20:59:26] g_vfs_done():ufs/log[WRITE(offset=3D5309202432, lengt= h=3D32768)]error =3D 6 [2021-12-10 20:59:26] pass0 at ahcich1 bus 0 scbus0 target 0 lun 0 [2021-12-10 20:59:26] g_vfs_done():pass0: ufs/log[WRITE(offset=3D65027883008, length=3D16384)] s/n 200914802684erro= r =3D 6 [2021-12-10 20:59:26] detached [2021-12-10 20:59:26] g_vfs_done():(pass0:ahcich1:0:ufs/log[WRITE(offset=3D= 65124368384, length=3D32768)]0:error =3D 6 [2021-12-10 20:59:26] 0): g_vfs_done():Periph destroyed [2021-12-10 20:59:26] ufs/log[WRITE(offset=3D65124368384, length=3D32768)]e= rror =3D 6 [2021-12-10 20:59:26] g_vfs_done():ufs/log[WRITE(offset=3D65126465536, leng= th=3D32768)]error =3D 6 [2021-12-10 20:59:26] g_vfs_done():ufs/log[WRITE(offset=3D65126465536, leng= th=3D32768)]error =3D 6 [2021-12-10 20:59:26] g_vfs_done():g_vfs_done():ufs/log[WRITE(offset=3D2470= 2976, length=3D512)]ufs/log[WRITE(offset=3D65126662144, length=3D16384)]err= or =3D 6 [2021-12-10 20:59:27] panic: cannot reassign paging buffer [2021-12-10 20:59:27] cpuid =3D 0 [2021-12-10 20:59:27] __HardenedBSD_version =3D 1100056 __FreeBSD_version = =3D 1103500 [2021-12-10 20:59:27] version =3D NS-BSD 4.2.0.beta--HBSD #0 : Wed Dec 1 1= 4:00:48 CET 2021 [2021-12-10 20:59:27] build@BuildFreeBSD-11.3-hardened:/home/build/buil= d/kernel/work-OPTIM/sys/amd64/compile/NETASQ.XL.SMP.HW.RELEASE [2021-12-10 20:59:27] Uptime: 2d0h46m1s [2021-12-10 20:59:27] Dumping 2004 out of 8064 MB:..1%..11%..21%..31%..41%.= .51%..61%..71%..81%..91% [2021-12-10 21:00:29] Dump complete [2021-12-10 21:00:31] =C3=BEVersion 2.15.1236. Copyright (C) 2012 American = Megatrends, Inc. =20 -- It appears that after sending a SATA TRIM command, that the disk stops resp= onding. I've also observed another crash that happens just after a TRIM com= mand. According to camcontrol, the disk supports the TRIM command : -- camcontrol identify /dev/ada0 pass0: ACS-4 ATA SATA 3.x device pass0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes) =20 protocol ATA/ATAPI-11 SATA 3.x device model WDC PC SA530 SDASB8Y256G firmware revision 40103000 serial number 200914802684 WWN 5001b448b13c08ca cylinders 16383 heads 16 sectors/track 63 sector size logical 512, physical 512, offset 0 LBA supported 268435455 sectors LBA48 supported 500118192 sectors PIO supported PIO4 DMA supported WDMA2 UDMA6 media RPM non-rotating Zoned-Device Commands no =20 Feature Support Enabled Value Vendor read ahead yes yes write cache yes yes flush cache yes yes overlap no Tagged Command Queuing (TCQ) no no Native Command Queuing (NCQ) yes 32 tags NCQ Queue Management no NCQ Streaming no Receive & Send FPDMA Queued no SMART yes yes microcode download yes yes security yes no power management yes yes advanced power management yes yes 128/0x80 automatic acoustic management no no media status notification no no power-up in Standby no no write-read-verify no no unload no no general purpose logging yes yes free-fall no no Data Set Management (DSM/TRIM) yes DSM - max 512byte blocks yes 8 DSM - deterministic read yes zeroed Host Protected Area (HPA) no -- I found this patch concerning a similar problem : https://bugs.freebsd.org/= bugzilla/show_bug.cgi?id=3D222802 I've modified sys/cam/ata/ata_da.c to deactivate the trim functionality. -- +diff --git sys/cam/ata/ata_da.c sys/cam/ata/ata_da.c +index cba52c5458be..ad256da0a495 100644 +--- sys/cam/ata/ata_da.c ++++ sys/cam/ata/ata_da.c +@@ -1798,7 +1798,7 @@ adaregister(struct cam_periph *periph, void *arg) + softc->disk->d_flags =3D DISKFLAG_DIRECT_COMPLETION | DISKFLAG_CANZ= ONE; + if (softc->flags & ADA_FLAG_CAN_FLUSHCACHE) + softc->disk->d_flags |=3D DISKFLAG_CANFLUSHCACHE; +- if (softc->flags & ADA_FLAG_CAN_TRIM) { ++ if (0 /*softc->flags & ADA_FLAG_CAN_TRIM */) { + softc->disk->d_flags |=3D DISKFLAG_CANDELETE; + softc->disk->d_delmaxsize =3D softc->params.secsize * + ATA_DSM_RANGE_MAX * -- I've been running a set of tests on one of our products for over two weeks = with the trim deactivated (using the above patch) and there have been no cr= ashes. Could this issue be an internal disk firmware problem? Is it possible that camcontrol reports that the disk supports TRIM yet in r= eality it isn't supported, or isn't supported fully? I've observed that this crash doesn't happen on the first TRIM command, is = it possible that a set of SATA commands destabilize the disk firmware and c= ause it to crash? Is it possible that the timeout isn't long enough for the TRIM command? Thanks in advance for any help. Best wishes, Lee Matthews