Resolving errors with ZVOL-s
Wiktor Niesiobedzki
bsd at vink.pl
Sat Sep 2 17:17:19 UTC 2017
Hi,
I have recently encountered errors on my ZFS Pool on my 11.1-R:
$ uname -a
FreeBSD kadlubek 11.1-RELEASE-p1 FreeBSD 11.1-RELEASE-p1 #0: Wed Aug 9
11:55:48 UTC 2017
root at amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC
amd64
# zpool status -v tank
pool: tank
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
scan: scrub repaired 0 in 5h27m with 0 errors on Sat Sep 2 15:30:59 2017
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 98
mirror-0 ONLINE 0 0 196
gpt/tank1.eli ONLINE 0 0 196
gpt/tank2.eli ONLINE 0 0 196
errors: Permanent errors have been detected in the following files:
dkr-test:<0x1>
dkr-test is ZVOL that I use within bhyve and indeed - within bhyve I have
noticed I/O errors on this volume. This ZVOL did not have any snapshots.
Following the advice mentioned in action I tried to restore the ZVOL:
# zfs desroy tank/dkr-test
But still errors are mentioned in zpool status:
errors: Permanent errors have been detected in the following files:
<0x5095>:<0x1>
I can't find any reference to this dataset in zdb:
# zdb -d tank | grep 5095
# zdb -d tank | grep 20629
I tried also getting statistics about metadata in this pool:
# zdb -b tank
Traversing all blocks to verify nothing leaked ...
loading space map for vdev 0 of 1, metaslab 159 of 174 ...
No leaks (block sum matches space maps exactly)
bp count: 24426601
ganged count: 0
bp logical: 1983127334912 avg: 81187
bp physical: 1817897247232 avg: 74422 compression:
1.09
bp allocated: 1820446928896 avg: 74527 compression:
1.09
bp deduped: 0 ref>1: 0 deduplication: 1.00
SPA allocated: 1820446928896 used: 60.90%
additional, non-pointer bps of type 0: 57981
Dittoed blocks on same vdev: 296490
And zdb got stuck using 100% CPU
And now to my questions:
1. Do I interpret correctly, that this situation is probably due to error
during write, and both copies of the block got checksum mismatching their
data? And if it is a hardware problem, it is probably something other than
disk? (No, I don't use ECC RAM)
2. Is there any way to remove offending dataset and clean the pool of the
errors?
3. Is my metadata OK? Or should I restore entire pool from backup?
4. I tried also running zdb -bc tank, but this resulted in kernel panic. I
might try to get the stack trace once I get physical access to machine next
week. Also - checksum verification slows down process from 1000MB/s to less
than 1MB/s. Is this expected?
5. When I work with zdb (as as above) should I try to limit writes to the
pool (e.g. by unmounting the datasets)?
Cheers,
Wiktor Niesiobedzki
PS. Sorry for previous partial message.
More information about the freebsd-fs
mailing list