[Bug 253954] kernel: g_access(958): provider da8 has error 6 set

From: <bugzilla-noreply_at_freebsd.org>
Date: Mon, 13 Jun 2022 20:20:06 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=253954

jnaughto@ee.ryerson.ca changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jnaughto@ee.ryerson.ca

--- Comment #4 from jnaughto@ee.ryerson.ca ---
Any update on this bug.  I just experienced the exact same issue.  I have 8
disks (all SATA) connected to a Freebsd 12.3 system.  The ZFS pool is setup as
a raidz3.  Got in today found one drive was "REMOVED"

# zpool status pool
  pool: pool
 state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 0 in 0 days 02:32:26 with 0 errors on Sat Jun 11
05:32:26 2022
config:

        NAME                     STATE     READ WRITE CKSUM
        pool                     DEGRADED     0     0     0
          raidz3-0               DEGRADED     0     0     0
            ada0                 ONLINE       0     0     0
            ada1                 ONLINE       0     0     0
            ada2                 ONLINE       0     0     0
            ada3                 ONLINE       0     0     0
            ada4                 ONLINE       0     0     0
            8936423309855741075  REMOVED      0     0     0  was /dev/ada5
            ada6                 ONLINE       0     0     0
            ada7                 ONLINE       0     0     0

I assumed that the drive had died and pulled it.  I put a new drive in place
and attempted to replace it:

# zpool replace pool 8936423309855741075 ada5
cannot replace 8936423309855741075 with ada5: no such pool or dataset


It seems that the old drive somehow is still remembered by the system.  I dug
through the logs to find the following occurring when the new drive is inserted
into the system:

Jun 13 13:03:15 server kernel: cam_periph_alloc: attempt to re-allocate valid
device ada5 rejected flags 0x118 refcount 1
Jun 13 13:03:15 server kernel: adaasync: Unable to attach to new device due to
status 0x6
Jun 13 13:04:23 server kernel: g_access(961): provider ada5 has error 6 set

Did a reboot without the new drive in place.  On reboot the output of the pool
did look somewhat different:

# zpool status pool
  pool: pool
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 0 days 02:32:26 with 0 errors on Sat Jun 11
05:32:26 2022
config:

        NAME                      STATE     READ WRITE CKSUM
        pool                      DEGRADED     0     0     0
          raidz3-0                DEGRADED     0     0     0
            ada0                  ONLINE       0     0     0
            ada1                  ONLINE       0     0     0
            ada2                  ONLINE       0     0     0
            ada3                  ONLINE       0     0     0
            ada4                  ONLINE       0     0     0
            8936423309855741075   FAULTED      0     0     0  was /dev/ada5
            ada5                  ONLINE       0     0     0
            diskid/DISK-Z1W4HPXX  ONLINE       0     0     0

errors: No known data errors

I assumed this was due to the fact that there was one less drive attached and
the system assigned new adaX values to each drive.   At this point when I
inserted the new drive the new drive appeared as an ada9.  So I re-issued the
zpool replace command but now with ada9.  Though it did take about 3mins before
the zpool replace command responded back (which really concerned me).  Yet the
server has quite a few users accessing the filesystem so I thought as long as
the new drive was re-silvering I would be fine....

I do a weekly scrub of the pool and I believe the error crept up after the
scub.  at 11am today the logs showed the following response:


Jun 13 11:29:15 172.16.20.66 kernel: (ada5:ahcich5:0:0:0): FLUSHCACHE48. ACB:
ea 00 00 00 00 40 00 00 00 00 00 00
Jun 13 11:29:15 172.16.20.66 kernel: (ada5:ahcich5:0:0:0): CAM status: Command
timeout
Jun 13 11:29:15 172.16.20.66 kernel: (ada5:ahcich5:0:0:0): Retrying command, 0
more tries remain
Jun 13 11:30:35 172.16.20.66 kernel: ahcich5: Timeout on slot 5 port 0
Jun 13 11:30:35 172.16.20.66 kernel: ahcich5: is 00000000 cs 00000060 ss
00000000 rs 00000060 tfd c0 serr 00000000 cmd 0004c517
Jun 13 11:30:35 172.16.20.66 kernel: (ada5:ahcich5:0:0:0): FLUSHCACHE48. ACB:
ea 00 00 00 00 40 00 00 00 00 00 00
Jun 13 11:30:35 172.16.20.66 kernel: (ada5:ahcich5:0:0:0): CAM status: Command
timeout
Jun 13 11:30:35 172.16.20.66 kernel: (ada5:ahcich5:0:0:0): Retrying command, 0
more tries remain
Jun 13 11:31:08 172.16.20.66 kernel: ahcich5: AHCI reset: device not ready
after 31000ms (tfd = 00000080)

At 11:39 I believe the following log entries are of note:

Jun 13 11:39:45 172.16.20.66 kernel: (ada5:ahcich5:0:0:0): CAM status:
Unconditionally Re-queue Request
Jun 13 11:39:45 172.16.20.66 kernel: (ada5:ahcich5:0:0:0): Error 5, Periph was
invalidated
Jun 13 11:39:45 172.16.20.66 ZFS[92964]: vdev state changed,
pool_guid=$5100646062824685774 vdev_guid=$8936423309855741075
Jun 13 11:39:45 172.16.20.66 ZFS[92966]: vdev is removed,
pool_guid=$5100646062824685774 vdev_guid=$8936423309855741075
Jun 13 11:39:46 172.16.20.66 kernel: g_access(961): provider ada5 has error 6
set
Jun 13 11:39:47 reactor syslogd: last message repeated 1 times
Jun 13 11:39:47 172.16.20.66 syslogd: last message repeated 1 times
Jun 13 11:39:47 172.16.20.66 kernel: ZFS WARNING: Unable to attach to ada5.

Any idea on what was the issue?

-- 
You are receiving this mail because:
You are the assignee for the bug.