ZFS: zpool status on degraded pools (FreeBSD12 vs FreeBSD13)

From: Dave Baukus <daveb_at_spectralogic.com>
Date: Wed, 14 Jul 2021 21:10:04 UTC
I'm seeking comments on the following 2 difference in the behavior of ZFS.
The first, I consider a bug; the second could be a bug or a conscious choice:

1) Given a pool of 2 disks and one extra disk exactly the same as the 2 pool members (no ZFS labels on the extra disk),
power the box off, replace one pool disk with extra disk in the same location; power box back on.

The pool is state on FreeBSD13 is ONLINE vs DEGRADED on FreeBSD12:

FreeBSD13# zpool status poolXXX
  pool: poolXXX
 state: ONLINE
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
config:

        NAME                      STATE     READ WRITE CKSUM
        poolXXX                   ONLINE       0     0     0
          mirror-0                ONLINE       0     0     0
            da3                   ONLINE       0     0     0
            16562597496848792747  UNAVAIL      0     0     0  was /dev/da2

---- On FreeBSD12:

FreeBSD12# zpool status poolXXX
  pool: poolXXX
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: resilvered 2.01G in 0 days 00:00:36 with 0 errors on Wed Jul 14 17:05:57 2021
config:

        NAME                     STATE     READ WRITE CKSUM
        poolXXX                  DEGRADED     0     0     0
          mirror-0               DEGRADED     0     0     0
            da17                 ONLINE       0     0     0
            5133460438962496754  UNAVAIL      0     0     0  was /dev/da4


errors: No known data errors

In the FreeBSD13 case, zpool_get_status() as called from zpool_main.c::status_callback() gets the correct the "reason" for the "status:" line, but
health = zpool_get_state_str(zhp) defaults to  the state from zhp->zpool_config which ultimately comes from the kernel; the user land logic
does not appear semantically different to me on FreeBSD12.

--------------------

2.) Add a spare to a degraded pool and issue a zpool replace to activate the spare.
On FreeBSD13 after the resilver is complete, the pool remains degraded until the degraded disk
is removed via zpool detach; on Freebsd12, the pool becomes ONLINE when the resilver is complete:

FreeBSD13# # zpool status poolXXX
  pool: poolXXX
 state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using zpool online' or replace the device with
        'zpool replace'.
  scan: resilvered 2.47G in 00:00:30 with 0 errors on Wed Jul 14 20:23:31 2021
config:

        NAME        STATE     READ WRITE CKSUM
        poolXXX     DEGRADED     0     0     0
          mirror-0  DEGRADED     0     0     0
            da3     ONLINE       0     0     0
            da2     REMOVED      0     0     0
        spares
          da20      AVAIL

errors: No known data errors

FreeBSD13# zpool replace poolXXX da2 da20
FreeBSD12# zpool status poolXXX
  pool: poolXXX
 state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using zpool online' or replace the device with
        'zpool replace'.
  scan: resilvered 2.47G in 00:00:24 with 0 errors on Wed Jul 14 20:39:09 2021
config:

        NAME         STATE     READ WRITE CKSUM
        poolXXX      DEGRADED     0     0     0
          mirror-0   DEGRADED     0     0     0
            da3      ONLINE       0     0     0
            spare-1  DEGRADED     0     0     0
              da2    REMOVED      0     0     0
              da20   ONLINE       0     0     0
        spares
          da20       INUSE     currently in use

errors: No known data errors

FreeBSD13# zpool detach poolXXX da2
FreeBSD13# zpool status poolXXX
  pool: poolXXX
 state: ONLINE
  scan: resilvered 2.47G in 00:00:24 with 0 errors on Wed Jul 14 20:39:09 2021
config:

        NAME        STATE     READ WRITE CKSUM
        poolXXX     ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            da3     ONLINE       0     0     0
            da20    ONLINE       0     0     0

errors: No known data errors

----- On FreeBSD12:

FreeBSD12# zpool status poolXXX
  pool: poolXXX
 state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: none requested
config:

        NAME                     STATE     READ WRITE CKSUM
        poolXXX                  DEGRADED     0     0     0
          mirror-0               DEGRADED     0     0     0
            8657568776772252257  REMOVED      0     0     0  was /dev/da1
            da5                  ONLINE       0     0     0
        spares
          da4                    AVAIL

errors: No known data errors

FreeBSD12# zpool replace poolXXX 8657568776772252257  da4
FreeBSD12# zpool status poolXXX
  pool: poolXXX
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Jul 14 20:58:14 2021
        2.02G scanned at 64.6M/s, 1.70G issued at 54.5M/s, 2.02G total
        1.70G resilvered, 84.21% done, 0 days 00:00:06 to go
config:

        NAME                       STATE     READ WRITE CKSUM
        poolXXX                    DEGRADED     0     0     0
          mirror-0                 DEGRADED     0     0     0
            spare-0                REMOVED      0     0     0
              8657568776772252257  REMOVED      0     0     0  was /dev/da1
              da4                  ONLINE       0     0     0  (resilvering)
            da5                    ONLINE       0     0     0
        spares
          6757612719167571619      INUSE     was /dev/da4

errors: No known data errors

FreeBSD12# zpool status poolXXX
  pool: poolXXX
 state: ONLINE
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: resilvered 2.01G in 0 days 00:00:43 with 0 errors on Wed Jul 14 20:58:57 2021
config:

        NAME                       STATE     READ WRITE CKSUM
        poolXXX                    ONLINE       0     0     0
          mirror-0                 ONLINE       0     0     0
            spare-0                ONLINE       0     0     0
              8657568776772252257  REMOVED      0     0     0  was /dev/da1
              da4                  ONLINE       0     0     0
            da5                    ONLINE       0     0     0
        spares
          6757612719167571619      INUSE     was /dev/da4

errors: No known data errors
------------------------

autoreplace is off in all pools and zfsd is not running.

Any feedback appreciated.

--
Dave Baukus