[Bug 283189] Sporadic NVMe DMAR faults since updating to 14.2-STABLE
Date: Wed, 02 Apr 2025 03:06:40 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=283189 --- Comment #5 from Jason A. Harmening <jah@FreeBSD.org> --- (In reply to Konstantin Belousov from comment #4) I'm not sure where these small writes are coming from, but they do appear to be happening in service of some file system function, as these faults are always followed by a syslog entry like this: Mar 28 23:14:52 corona ZFS[29596]: vdev I/O failure, zpool=zroot path=/dev/nda0p4 offset=3014377947136 size=4096 error=5 I'm not sure if this is some sort of metadata write, or if the block layer is somehow splitting up these transfers in a strange way, or if it's something else. The GAS address reported by the DMAR fault always matches the PRP1 bus address, for example: Mar 28 23:14:52 corona kernel: DMAR4: nvme0: WRITE sqid:7 cid:119 nsid:1 lba:5892185760 len:8 Mar 28 23:14:52 corona kernel: nvme0: pci7:0:0 sid 700 fault acc 1 adt 0x0 reason 0x6 addr fef05000 Mar 28 23:14:52 corona kernel: nvme0: nsid:0x1 rsvd2:0 rsvd3:0 mptr:0 prp1:0xfef05000 prp2:0 Mar 28 23:14:52 corona kernel: nvme0: cdw10: 0x5f339ea0 cdw11:0x1 cdw12:0x7 cdw13:0 cdw14:0 cdw15:0 Mar 28 23:14:52 corona kernel: nvme0: DATA TRANSFER ERROR (00/04) crd:0 m:1 dnr:1 p:1 sqid:7 cid:119 cdw0:0 Mar 28 23:14:52 corona kernel: (nda0:nvme0:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=5f339ea0 1 7 0 0 0 Mar 28 23:14:52 corona kernel: (nda0:nvme0:0:0:1): CAM status: Unknown (0x420) Mar 28 23:14:52 corona kernel: (nda0:nvme0:0:0:1): Error 5, Retries exhausted Mar 28 23:14:52 corona ZFS[29596]: vdev I/O failure, zpool=zroot path=/dev/nda0p4 offset=3014377947136 size=4096 error=5 This doesn't look to be address space wraparound; the last several faulting GAS addresses from the syslog are (in order): fef03000 fef03000 fef03000 fef05000 fef05000 a5205000 Since NVMe rapidly sets up and tears down mappings, this just looks more like the expected behavior of the IOMMU repeatedly handing out the same addresses. I'll post the (E)CAP reports when I can get in front of the machine to do a verbose boot. -- You are receiving this mail because: You are the assignee for the bug.