Resolving errors with ZVOL-s

Sat Sep 2 17:17:19 UTC 2017

Hi,

I have recently encountered errors on my ZFS Pool on my 11.1-R:
$ uname -a
FreeBSD kadlubek 11.1-RELEASE-p1 FreeBSD 11.1-RELEASE-p1 #0: Wed Aug  9
11:55:48 UTC 2017
root at amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC
amd64

# zpool status -v tank
  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 5h27m with 0 errors on Sat Sep  2 15:30:59 2017
config:

        NAME               STATE     READ WRITE CKSUM
        tank               ONLINE       0     0    98
          mirror-0         ONLINE       0     0   196
            gpt/tank1.eli  ONLINE       0     0   196
            gpt/tank2.eli  ONLINE       0     0   196

errors: Permanent errors have been detected in the following files:

        dkr-test:<0x1>

dkr-test is ZVOL that I use within bhyve and indeed - within bhyve I have
noticed I/O errors on this volume. This ZVOL did not have any snapshots.

Following the advice mentioned in action I tried to restore the ZVOL:
# zfs desroy tank/dkr-test

But still errors are mentioned in zpool status:
errors: Permanent errors have been detected in the following files:

        <0x5095>:<0x1>

I can't find any reference to this dataset in zdb:
 # zdb -d tank | grep 5095
 # zdb -d tank | grep 20629

I tried also getting statistics about metadata in this pool:
# zdb -b tank

Traversing all blocks to verify nothing leaked ...

loading space map for vdev 0 of 1, metaslab 159 of 174 ...
        No leaks (block sum matches space maps exactly)

        bp count:        24426601
        ganged count:           0
        bp logical:    1983127334912      avg:  81187
        bp physical:   1817897247232      avg:  74422     compression:
1.09
        bp allocated:  1820446928896      avg:  74527     compression:
1.09
        bp deduped:             0    ref>1:      0   deduplication:   1.00
        SPA allocated: 1820446928896     used: 60.90%

        additional, non-pointer bps of type 0:      57981
        Dittoed blocks on same vdev: 296490

And zdb got stuck using 100% CPU

And now to my questions:
1. Do I interpret correctly, that this situation is probably due to error
during write, and both copies of the block got checksum mismatching their
data? And if it is a hardware problem, it is probably something other than
disk? (No, I don't use ECC RAM)

2. Is there any way to remove offending dataset and clean the pool of the
errors?

3. Is my metadata OK? Or should I restore entire pool from backup?

4. I tried also running zdb -bc tank, but this resulted in kernel panic. I
might try to get the stack trace once I get physical access to machine next
week. Also - checksum verification slows down process from 1000MB/s to less
than 1MB/s. Is this expected?

5. When I work with zdb (as as above) should I try to limit writes to the
pool (e.g. by unmounting the datasets)?

Cheers,

Wiktor Niesiobedzki

PS. Sorry for previous partial message.