ZFS panic under extreme circumstances (2/3 disks corrupted)

Mon May 25 09:13:45 UTC 2009

On May 24, 2009, at 09:02 PM, Thomas Backman wrote:

> So, I was playing around with RAID-Z and self-healing...
Yet another follow-up to this.
It appears that all traces of errors vanish after a reboot. So, say  
you have a dying disk; ZFS repairs the data for you, and you don't  
notice (unless you check zpool status). Then you reboot, and there's  
NO (easy?) way that I can tell to find out that something is wrong  
with your hardware!

[root at clone ~]# zpool status test
   pool: test
  state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the  
errors
	using 'zpool clear' or replace the device with 'zpool replace'.
    see: http://www.sun.com/msg/ZFS-8000-9P
  scrub: scrub completed after 0h1m with 0 errors on Mon May 25  
11:01:22 2009
config:

	NAME        STATE     READ WRITE CKSUM
	test        ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    da1     ONLINE       0     0     0
	    da2     ONLINE       0     0     1  64K repaired
	    da3     ONLINE       0     0     0

errors: No known data errors

----------- reboot -----------

[root at clone ~]# zpool status test
   pool: test
  state: ONLINE
  scrub: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	test        ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    da1     ONLINE       0     0     0
	    da2     ONLINE       0     0     0
	    da3     ONLINE       0     0     0

errors: No known data errors

[root at clone ~]# zpool history -i test
# ... snip ...
# Below is the relevant output from the scrub that found the errors:
2009-05-25.11:00:21 [internal pool scrub txg:118] func=1 mintxg=0  
maxtxg=118
2009-05-25.11:00:23 zpool scrub test
2009-05-25.11:01:22 [internal pool scrub done txg:120] complete=1

Nothing there to say that it found errors, right? If there is, it  
should be a lot more clear. Also, root should receive automatic mails  
when data corruption occurs IMHO.

[root at clone ~]# zpool scrub test
# Wait a while...
[root at clone ~]# zpool status test
   pool: test
  state: ONLINE
  scrub: scrub completed after 0h1m with 0 errors on Mon May 25  
11:06:05 2009
config:

	NAME        STATE     READ WRITE CKSUM
	test        ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    da1     ONLINE       0     0     0
	    da2     ONLINE       0     0     0
	    da3     ONLINE       0     0     0

errors: No known data errors

I'm guessing this is the case in OpenSolaris as well...? In any case,  
it's BAD. Unless you keep checking zpool status over and over, you  
could have a disk "failing silently" - which defeats one of the major  
purposes of ZFS! Sure, auto-healing is nice, but it should tell you  
that it's happening, so that you can prepare to replace a disk (i.e.  
order a new one BEFORE it crasches bigtime).

Regards,
Thomas