[Bug 261324] xhci1 resets on ROCKPro64 when reading from ZFS pool

From: <bugzilla-noreply_at_freebsd.org>
Date: Wed, 19 Jan 2022 01:08:07 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=261324

            Bug ID: 261324
           Summary: xhci1 resets on ROCKPro64 when reading from ZFS pool
           Product: Base System
           Version: 13.0-STABLE
          Hardware: arm64
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: arm
          Assignee: freebsd-arm@FreeBSD.org
          Reporter: freebsd@toyingwithfate.com

I have a ROCKPro64 single board computer with 4GB of RAM that I am attempting
to use as a small NAS. The storage consists of an encrypted (OpenZFS native,
not GELI) RAID-Z pool using three 8TB SATA spinning disks and one 512GB SATA
SSD (used as cache) in a simple 4-disk JBOD USB enclosure (the "OWC Mercury
Elite Pro Quad"). My understanding is that it contains a USB 3 hub with four
SATA-to-USB bridges in it.

The issue I run into is that after reading ~40GB of data from the ZFS
filesystem, disk I/O stops and I get the following messages in dmesg:

xhci1: Resetting controller
uhub4: at usbus5, port 1, addr 1 (disconnected)
ugen5.7: <OWC Mercury Elite Pro Quad> at usbus5 (disconnected)
uhub7: at uhub4, port 1, addr 6 (disconnected)
uhub7: detached
ugen5.2: <OWC Mercury Elite Pro Quad> at usbus5 (disconnected)
uhub6: at uhub4, port 2, addr 1 (disconnected)
ugen5.3: <OWC Mercury Elite Pro Quad A> at usbus5 (disconnected)
(da0:umass-sim0:0:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5c f8 00 00 00
80 00 00 
umass0: at uhub6, port 1, addr 2 (disconnected)
(da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
(da0:umass-sim0:0:0:0): Retrying command, 3 more tries remain
(da0:umass-sim0:0:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5c f8 00 00 00
80 00 00 
(da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
(da0:umass-sim0:0:0:0): Retrying command, 2 more tries remain
(da0:umass-sim0:0:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5c f8 00 00 00
80 00 00 
(da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
(da0:umass-sim0:0:0:0): Retrying command, 1 more tries remain
da0 at umass-sim0 bus 0 scbus0 target 0 lun 0
da0: <Mercury Elite Pro Quad A 0>  s/n 123456789004 detached
(da1:umass-sim1:1:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5a 00 00 00 00
80 00 00 
(da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
(da1:umass-sim1:1:0:0): Retrying command, 3 more tries remain
(da2:umass-sim2:2:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5d 78 00 00 00
80 00 00 
(da2:umass-sim2:2:0:0): CAM status: CCB request completed with an error
(da2:umass-sim2:2:0:0): Retrying command, 3 more tries remain
(da3:umass-sim3:3:0:0): WRITE(10). CDB: 2a 00 2d 5d 61 31 00 01 00 00 
(da3:umass-sim3:3:0:0): CAM status: CCB request completed with an error
(da3:umass-sim3:3:0:0): Retrying command, 3 more tries remain
(da1:umass-sim1:1:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5a 00 00 00 00
80 00 00 
(da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
(da1:umass-sim1:1:0:0): Retrying command, 2 more tries remain
(da2:umass-sim2:2:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5d 78 00 00 00
80 00 00 
(da2:umass-sim2:2:0:0): CAM status: CCB request completed with an error
(da2:umass-sim2:2:0:0): Retrying command, 2 more tries remain
(da3:umass-sim3:3:0:0): WRITE(10). CDB: 2a 00 2d 5d 61 31 00 01 00 00 
(da3:umass-sim3:3:0:0): CAM status: CCB request completed with an error
(da3:umass-sim3:3:0:0): Retrying command, 2 more tries remain
(da1:umass-sim1:1:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5a 00 00 00 00
80 00 00 
(da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
(da1:umass-sim1:1:0:0): Retrying command, 1 more tries remain
(da2:umass-sim2:2:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5d 78 00 00 00
80 00 00 
(da2:umass-sim2:2:0:0): CAM status: CCB request completed with an error
(da2:umass-sim2:2:0:0): Retrying command, 1 more tries remain
(da3:umass-sim3:3:0:0): WRITE(10). CDB: 2a 00 2d 5d 61 31 00 01 00 00 
(da3:umass-sim3:3:0:0): CAM status: CCB request completed with an error
(da3:umass-sim3:3:0:0): Retrying command, 1 more tries remain
(da1:umass-sim1:1:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5a 00 00 00 00
80 00 00 
(da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
(da1:umass-sim1:1:0:0): Retrying command, 0 more tries remain
(da2:umass-sim2:2:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5d 78 00 00 00
80 00 00 
(da2:umass-sim2:2:0:0): CAM status: CCB request completed with an error
(da2:umass-sim2:2:0:0): Retrying command, 0 more tries remain
(da3:umass-sim3:3:0:0): WRITE(10). CDB: 2a 00 2d 5d 61 31 00 01 00 00 
(da3:umass-sim3:3:0:0): CAM status: CCB request completed with an error
(da3:umass-sim3:3:0:0): Retrying command, 0 more tries remain
(da1:umass-sim1:1:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5a 00 00 00 00
80 00 00 
(da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
(da1:umass-sim1:1:0:0): Error 5, Retries exhausted
(da2:umass-sim2:2:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5d 78 00 00 00
80 00 00 
(da2:umass-sim2:2:0:0): CAM status: CCB request completed with an error
(da2:umass-sim2:2:0:0): Error 5, Retries exhausted
(da3:umass-sim3:3:0:0): WRITE(10). CDB: 2a 00 2d 5d 61 31 00 01 00 00 
(da3:umass-sim3:3:0:0): CAM status: CCB request completed with an error
(da3:umass-sim3:3:0:0): Error 5, Retries exhausted
(da1:umass-sim1:1:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5b 80 00 00 08
00 00 00 
(da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
(da1:umass-sim1:1:0:0): Retrying command, 3 more tries remain
(da2:umass-sim2:2:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5e f8 00 00 01
80 00 00 
(da2:umass-sim2:2:0:0): CAM status: CCB request completed with an error
(da2:umass-sim2:2:0:0): Retrying command, 3 more tries remain
(da3:umass-sim3:3:0:0): WRITE(10). CDB: 2a 00 2d 5d 63 31 00 01 00 00 
(da3:umass-sim3:3:0:0): CAM status: CCB request completed with an error
(da3:umass-sim3:3:0:0): Retrying command, 3 more tries remain
(da1:umass-sim1:1:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5b 80 00 00 08
00 00 00 
(da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
(da1:umass-sim1:1:0:0): Retrying command, 2 more tries remain
(da2:umass-sim2:2:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5e f8 00 00 01
80 00 00 
(da2:umass-sim2:2:0:0): CAM status: CCB request completed with an error
(da2:umass-sim2:2:0:0): Retrying command, 2 more tries remain
(da3:umass-sim3:3:0:0): WRITE(10). CDB: 2a 00 2d 5d 63 31 00 01 00 00 
(da3:umass-sim3:3:0:0): CAM status: CCB request completed with an error
(da3:umass-sim3:3:0:0): Retrying command, 2 more tries remain
(da1:umass-sim1:1:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5b 80 00 00 08
00 00 00 
(da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
(da1:umass-sim1:1:0:0): Retrying command, 1 more tries remain
(da2:umass-sim2:2:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5e f8 00 00 01
80 00 00 
(da2:umass-sim2:2:0:0): CAM status: CCB request completed with an error
(da2:umass-sim2:2:0:0): Retrying command, 1 more tries remain
(da3:umass-sim3:3:0:0): WRITE(10). CDB: 2a 00 2d 5d 63 31 00 01 00 00 
(da3:umass-sim3:3:0:0): CAM status: CCB request completed with an error
(da3:umass-sim3:3:0:0): Retrying command, 1 more tries remain
(da1:umass-sim1:1:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5b 80 00 00 08
00 00 00 
(da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
(da1:umass-sim1:1:0:0): Retrying command, 0 more tries remain
(da2:umass-sim2:2:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5e f8 00 00 01
80 00 00 
(da2:umass-sim2:2:0:0): CAM status: CCB request completed with an error
(da2:umass-sim2:2:0:0): Retrying command, 0 more tries remain
(da3:umass-sim3:3:0:0): WRITE(10). CDB: 2a 00 2d 5d 63 31 00 01 00 00 
(da3:umass-sim3:3:0:0): CAM status: CCB request completed with an error
(da3:umass-sim3:3:0:0): Retrying command, 0 more tries remain
(da1:umass-sim1:1:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5b 80 00 00 08
00 00 00 
(da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
(da1:umass-sim1:1:0:0): Error 5, Retries exhausted
Jan  1 05:47:13 generic ZFS[5395]: vdev probe failure, zpool=zraid-v5
path=/dev/da1
Jan  1 05:47:13 generic ZFS[5399]: vdev probe failure, zpool=zraid-v5
path=/dev/da1
(da2:umass-sim2:2:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5e f8 00 00 01
80 00 00 
(da2:umass-sim2:2:0:0): CAM status: CCB request completed with an error
(da2:umass-sim2:2:0:0): Error 5, Retries exhausted
Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure
and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure
and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure
and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure
and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure
and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure
and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure
and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure
and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure
and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure
and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure
and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure
and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure
and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure
and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure
and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure
and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure
and has been suspended.

Jan  1 05:47:14 generic ZFS[5423]: vdev probe failure, zpool=zraid-v5
path=/dev/da2
Jan  1 05:47:14 generic ZFS[5447]: catastrophic pool I/O failure,
zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5495]: catastrophic pool I/O failure,
zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5499]: catastrophic pool I/O failure,
zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5503]: catastrophic pool I/O failure,
zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5507]: catastrophic pool I/O failure,
zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5511]: catastrophic pool I/O failure,
zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5515]: catastrophic pool I/O failure,
zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5519]: catastrophic pool I/O failure,
zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5523]: catastrophic pool I/O failure,
zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5527]: catastrophic pool I/O failure,
zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5531]: catastrophic pool I/O failure,
zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5535]: catastrophic pool I/O failure,
zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5539]: catastrophic pool I/O failure,
zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5543]: catastrophic pool I/O failure,
zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5547]: catastrophic pool I/O failure,
zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5551]: catastrophic pool I/O failure,
zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5555]: catastrophic pool I/O failure,
zpool=zraid-v5
(da3:umass-sim3:3:0:0): WRITE(10). CDB: 2a 00 2d 5d 63 31 00 01 00 00 
(da3:umass-sim3:3:0:0): CAM status: CCB request completed with an error
(da3:umass-sim3:3:0:0): Error 5, Retries exhausted


...and so on, including further complaints from Solaris, ZFS, and CAM. The
system is still operable, and I can SSH in, but the ZFS pool is unresponsive. I
can tell FreeBSD to shut down, and although it will start to do so
(disconnecting me from SSH), it will not complete powering off: SSH connections
are refused and it still responds to pings. If I unplug the SBC's power,
disconnect the enclosure's USB cable (since the ROCKPro64 won't boot with it
connected, another issue I'll report separately), then boot the SBC and
reconnect the USB, things will start working again. (The ZFS pool will
sometimes complain about errors that I need to clear, but if I scrub the pool
it confirms that no data was corrupted.)

Interestingly, I seem to be able to scrub the pool fine: despite the pool
having several TB of data, the issue hasn't appeared then. Additionally, I can
use dd to read from the individual block devices for many hours without issue,
and can even do so from all the disks simultaneously (averaging ~100MB/s per
disk, so ~400MB/s total) without seeing the issue. But if I mount the ZFS
filesystem (I do so read-only during my tests to try to reduce risk to the
data) and start making a tarball of said filesystem (writing said tar stream to
/dev/null), it gets about 40GB in and then the issue appears. The exact amount
of data it's able to read (as measured by piping the stream through pv) varies
a bit: my last three tests have failed at 43.3, 40.6, and 54.2 GiB,
respectively. The issue is 100% repeatable: although the amount of data it's
able to read before failing varies, it always fails.

I've tried both the FreeBSD-13.0-RELEASE-arm64-aarch64-ROCKPRO64.img.xz and
FreeBSD-13.0-STABLE-arm64-aarch64-ROCKPRO64-20211230-3684bb89d52-248759.img
images, but they both have the issue. I've tried Linux on the same SBC with the
same ZFS pool, and it had no issue. I tried a Raspberry Pi 4 running FreeBSD
13.0 RELEASE with the same ZFS pool, and it had no issue. I should note that
both the Linux and Raspberry Pi tests lacked hardware crypto support (Linux
because Linux apparently doesn't support ARM crypto in ZFS yet, and Raspberry
Pi because those have no crypto hardware), so they both relied on software
crypto which constrained performance to tens of MB/s, rather than hundreds as
FreeBSD on the ROCKPro64 does in my problem case. To that end I also used the
ZFS pool with an Intel system running FreeBSD 13, which has working hardware
crypto and it was able to work with the disks even faster than the ROCKPro64,
with no issues. So my suspicion is the enclosure is not the culprit, since it
worked fine on RPi and Intel systems. I also suspect the SBC itself is not the
culprit since it worked fine with Linux, though it's possible the issue only
shows up when reading from the disks at a particular speed and Linux's
constrained crypto performance just didn't reach that speed. I'm also able to
work with other disks just fine with FreeBSD on the ROCKPro64: it's only this
multi-disk enclosure that has trouble. I can create an encrypted ZFS pool on a
USB 3 SSD connected directly to the SBC (no intermediary hub) and it works
fine, no trouble at all.

Initially I suspected that my SBC or enclosure might be at fault, but through
this testing I've come to instead suspect there is some sort of issue in the
xhci driver for the ROCKPro64, perhaps somehow related to the enclosure having
a USB hub in front of the disks. But I am only speculating! Could someone help
me troubleshoot this issue further, to see why xhci1 is resetting unexpectedly?
Thank you for your help!

-- 
You are receiving this mail because:
You are the assignee for the bug.