ZFS corruption due to lack of space?

Wed Oct 31 17:25:18 UTC 2012

Been running some tests on new hardware here to verify all
is good. One of the tests was to fill the zfs array which
seems like its totally corrupted the tank.

The HW is 7 x 3TB disks in RAIDZ2 with dual 13GB ZIL
partitions and dual 100GB L2ARC on Enterprise SSD's.

All disks are connected to an LSI 2208 RAID controller
run by mfi driver. HD's via a SAS2X28 backplane and
SSD's via a passive blackplane backplane.

The file system has 31 test files most random data from
/dev/random and one blank from /dev/zero.

The test running was multiple ~20 dd's under screen with
all but one from /dev/random and to final one from /dev/zero

e.g. dd if=/dev/random bs=1m of=/tank2/random10

No hardware errors have raised, so no disk timeouts etc.

On completion each dd reported no space as you would expect
e.g. dd if=/dev/random bs=1m of=/tank2/random13
dd: /tank2/random13: No space left on device
503478+0 records in
503477+0 records out
527933898752 bytes transferred in 126718.731762 secs (4166187 bytes/sec)
You have new mail.

At that point with the test seemingly successful I went
to delete test files which resulted in:-
rm random*
rm: random1: Unknown error: 122
rm: random10: Unknown error: 122
rm: random11: Unknown error: 122
rm: random12: Unknown error: 122
rm: random13: Unknown error: 122
rm: random14: Unknown error: 122
rm: random18: Unknown error: 122
rm: random2: Unknown error: 122
rm: random3: Unknown error: 122
rm: random4: Unknown error: 122
rm: random5: Unknown error: 122
rm: random6: Unknown error: 122
rm: random7: Unknown error: 122
rm: random9: Unknown error: 122

Error 122 I assume is ECKSUM

At this point the pool was showing checksum errors
zpool status
  pool: tank
 state: ONLINE
  scan: none requested
config:

        NAME                                            STATE     READ WRITE CKSUM
        tank                                            ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/41fb7e5c-21cf-11e2-92a3-002590881138  ONLINE       0     0     0
            gptid/42a1b53c-21cf-11e2-92a3-002590881138  ONLINE       0     0     0

errors: No known data errors

  pool: tank2
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
  scan: none requested
config:

        NAME           STATE     READ WRITE CKSUM
        tank2          ONLINE       0     0 4.22K
          raidz2-0     ONLINE       0     0 16.9K
            mfisyspd0  ONLINE       0     0     0
            mfisyspd1  ONLINE       0     0     0
            mfisyspd2  ONLINE       0     0     0
            mfisyspd3  ONLINE       0     0     0
            mfisyspd4  ONLINE       0     0     0
            mfisyspd5  ONLINE       0     0     0
            mfisyspd6  ONLINE       0     0     0
        logs
          mfisyspd7p3  ONLINE       0     0     0
          mfisyspd8p3  ONLINE       0     0     0
        cache
          mfisyspd9    ONLINE       0     0     0
          mfisyspd10   ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        tank2:<0x3>
        tank2:<0x8>
        tank2:<0x9>
        tank2:<0xa>
        tank2:<0xb>
        tank2:<0xf>
        tank2:<0x10>
        tank2:<0x11>
        tank2:<0x12>
        tank2:<0x13>
        tank2:<0x14>
        tank2:<0x15>

So I tried a scrub, which looks like its going to
take 5 days to complete and is reporting many many more
errors:-

  pool: tank2
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
  scan: scrub in progress since Wed Oct 31 16:13:53 2012
        118G scanned out of 18.7T at 42.2M/s, 128h19m to go
        49.0M repaired, 0.62% done
config:

        NAME           STATE     READ WRITE CKSUM
        tank2          ONLINE       0     0  596K
          raidz2-0     ONLINE       0     0 1.20M
            mfisyspd0  ONLINE       0     0     0  (repairing)
            mfisyspd1  ONLINE       0     0     0  (repairing)
            mfisyspd2  ONLINE       0     0     0  (repairing)
            mfisyspd3  ONLINE       0     0     2  (repairing)
            mfisyspd4  ONLINE       0     0     1  (repairing)
            mfisyspd5  ONLINE       0     0     0  (repairing)
            mfisyspd6  ONLINE       0     0     1  (repairing)
        logs
          mfisyspd7p3  ONLINE       0     0     0
          mfisyspd8p3  ONLINE       0     0     0
        cache
          mfisyspd9    ONLINE       0     0     0
          mfisyspd10   ONLINE       0     0     0

errors: 596965 data errors, use '-v' for a list

At this point I decided to cancel the scrub but no joy on that

zpool scrub -s tank2
cannot cancel scrubbing tank2: out of space

So questions:-

1. Given the information it seems like the multiple writes filling
the disk may have caused metadata corruption?
2. Is there anyway to stop the scrub?
3. Surely low space should never prevent stopping a scrub?

    Regards
    Steve