ZFS: Silent/hidden errors, nothing logged anywhere
Thomas Backman
serenity at exscape.org
Fri Jun 12 17:33:37 UTC 2009
OK, so I filed a PR late May (kern/135050): http://www.freebsd.org/cgi/query-pr.cgi?pr=135050
.
I don't know if this is a "feature" or a bug, but it really should be
considered the latter. The data could be repaired in the background
without the user ever knowing - until the disk dies completely. I'd
prefer to have warning signs (i.e. checksum errors) so that I can buy
a replacement drive *before* that.
Not only does this mean that errors can go unnoticed, but also that
it's impossible to figure out which disk is broken, if ZFS has
*temporarily* repaired the broken data! THAT is REALLY bad!
Is this something that we can expect to see changed before 8.0-RELEASE?
BTW, note that the md5sums always check out (good!), and that it never
mentions "x MB repaired" when repairing silent damage (bad!), but only
when scrubbing. Scrubbing may be a hard task with a dying disk - I
haven't tried it, but I'd guess so.
Regards,
Thomas
PS. I'm not subscribed to fs@, so please CC me if you read this
message over there.
[root at clone ~]# uname -a
FreeBSD clone.exscape.org 8.0-CURRENT FreeBSD 8.0-CURRENT #0 r194059M:
Fri Jun 12 18:25:05 CEST 2009 root at clone.exscape.org:/usr/obj/usr/
src/sys/DTRACE amd64
[root at clone ~]# sysctl kern.geom.debugflags=0x10 ### To allow
overwriting of the disk
kern.geom.debugflags: 0 -> 16
[root at clone ~]# zpool create test raidz da1 da2 da3
[root at clone ~]# dd if=/dev/random of=/test/testfile bs=1000k
dd: /test/testfile: No space left on device
188+0 records in
187+1 records out
192413696 bytes transferred in 105.004322 secs (1832436 bytes/sec)
[root at clone ~]# dd if=/dev/random of=/dev/da3 bs=1000k count=10 seek=80
10+0 records in
10+0 records out
10240000 bytes transferred in 0.838391 secs (12213871 bytes/sec)
[root at clone ~]# cat /test/testfile > /dev/null
[root at clone ~]# zpool status -xv
pool: test
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
raidz1 ONLINE 0 0 0
da1 ONLINE 0 0 0
da2 ONLINE 0 0 0
da3 ONLINE 0 0 92
errors: No known data errors
[root at clone ~]# reboot
--- immediately after reboot ---
[root at clone ~]# zpool status -xv
pool: test
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
raidz1 ONLINE 0 0 0
da1 ONLINE 0 0 0
da2 ONLINE 0 0 0
da3 ONLINE 0 0 1
errors: No known data errors
[root at clone ~]# zpool scrub test
(...)
[root at clone ~]# zpool status -xv
pool: test
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: scrub completed after 0h0m with 0 errors on Fri Jun 12
19:11:36 2009
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
raidz1 ONLINE 0 0 0
da1 ONLINE 0 0 0
da2 ONLINE 0 0 0
da3 ONLINE 0 0 88 2.72M repaired
errors: No known data errors
[root at clone ~]# reboot
--- immediately after reboot, again ---
[root at clone ~]# zpool status -xv
all pools are healthy
[root at clone ~]# zpool status -v test
pool: test
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
raidz1 ONLINE 0 0 0
da1 ONLINE 0 0 0
da2 ONLINE 0 0 0
da3 ONLINE 0 0 0
errors: No known data errors
[root at clone ~]#
----------------- even more testing, no scrub this time
-----------------
[root at clone ~]# sysctl kern.geom.debugflags=0x10
kern.geom.debugflags: 0 -> 16
[root at clone ~]# md5 /test/testfile && dd if=/dev/random of=/dev/da2
bs=1000k count=10 seek=40 ; md5 /test/testfile
MD5 (/test/testfile) = 510479f16592bf66e7ba63c0a4dda0b6
10+0 records in
10+0 records out
10240000 bytes transferred in 0.901645 secs (11357020 bytes/sec)
MD5 (/test/testfile) = 510479f16592bf66e7ba63c0a4dda0b6
[root at clone ~]# zpool status -xv
pool: test
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
raidz1 ONLINE 0 0 0
da1 ONLINE 0 0 0
da2 ONLINE 0 0 104
da3 ONLINE 0 0 0
errors: No known data errors
[root at clone ~]# reboot
--- immediately after reboot, yet again ---
[root at clone ~]# md5 /test/testfile
MD5 (/test/testfile) = 510479f16592bf66e7ba63c0a4dda0b6
[root at clone ~]# zpool status -xv
pool: test
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
raidz1 ONLINE 0 0 0
da1 ONLINE 0 0 0
da2 ONLINE 0 0 3
da3 ONLINE 0 0 0
errors: No known data errors
[root at clone ~]# reboot
--- immediately after reboot, yet *again* ---
[root at clone ~]# md5 /test/testfile
MD5 (/test/testfile) = 510479f16592bf66e7ba63c0a4dda0b6
[root at clone ~]# zpool status -xv
all pools are healthy
[root at clone ~]# zpool status -v test
pool: test
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
raidz1 ONLINE 0 0 0
da1 ONLINE 0 0 0
da2 ONLINE 0 0 0
da3 ONLINE 0 0 0
errors: No known data errors
[root at clone ~]# zpool history -il test
History for 'test':
2009-06-12.19:03:43 zpool create test raidz da1 da2 da3 [user root on
clone.exscape.org:global]
2009-06-12.19:10:42 [internal pool scrub txg:160] func=1 mintxg=0
maxtxg=160 [user root on clone.exscape.org]
2009-06-12.19:10:44 zpool scrub test [user root on
clone.exscape.org:global]
2009-06-12.19:11:36 [internal pool scrub done txg:162] complete=1
[user root on clone.exscape.org]
More information about the freebsd-current
mailing list