[Bug 283189] Sporadic NVMe DMAR faults since updating to 14.2-STABLE

From: <bugzilla-noreply_at_freebsd.org>
Date: Wed, 02 Apr 2025 03:06:40 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=283189

--- Comment #5 from Jason A. Harmening <jah@FreeBSD.org> ---
(In reply to Konstantin Belousov from comment #4)

I'm not sure where these small writes are coming from, but they do appear to be
happening in service of some file system function, as these faults are always
followed by a syslog entry like this:

Mar 28 23:14:52 corona ZFS[29596]: vdev I/O failure, zpool=zroot
path=/dev/nda0p4 offset=3014377947136 size=4096 error=5

I'm not sure if this is some sort of metadata write, or if the block layer is
somehow splitting up these transfers in a strange way, or if it's something
else.

The GAS address reported by the DMAR fault always matches the PRP1 bus address,
for example:

Mar 28 23:14:52 corona kernel: DMAR4: nvme0: WRITE sqid:7 cid:119 nsid:1
lba:5892185760 len:8
Mar 28 23:14:52 corona kernel: nvme0: pci7:0:0 sid 700 fault acc 1 adt 0x0
reason 0x6 addr fef05000
Mar 28 23:14:52 corona kernel: nvme0: nsid:0x1 rsvd2:0 rsvd3:0 mptr:0
prp1:0xfef05000 prp2:0
Mar 28 23:14:52 corona kernel: nvme0: cdw10: 0x5f339ea0 cdw11:0x1 cdw12:0x7
cdw13:0 cdw14:0 cdw15:0
Mar 28 23:14:52 corona kernel: nvme0: DATA TRANSFER ERROR (00/04) crd:0 m:1
dnr:1 p:1 sqid:7 cid:119 cdw0:0
Mar 28 23:14:52 corona kernel: (nda0:nvme0:0:0:1): WRITE. NCB: opc=1 fuse=0
nsid=1 prp1=0 prp2=0 cdw=5f339ea0 1 7 0 0 0
Mar 28 23:14:52 corona kernel: (nda0:nvme0:0:0:1): CAM status: Unknown (0x420)
Mar 28 23:14:52 corona kernel: (nda0:nvme0:0:0:1): Error 5, Retries exhausted
Mar 28 23:14:52 corona ZFS[29596]: vdev I/O failure, zpool=zroot
path=/dev/nda0p4 offset=3014377947136 size=4096 error=5


This doesn't look to be address space wraparound; the last several faulting GAS
addresses from the syslog are (in order):
fef03000
fef03000
fef03000
fef05000
fef05000
a5205000

Since NVMe rapidly sets up and tears down mappings, this just looks more like
the expected behavior of the IOMMU repeatedly handing out the same addresses.

I'll post the (E)CAP reports when I can get in front of the machine to do a
verbose boot.

-- 
You are receiving this mail because:
You are the assignee for the bug.