ZFS: 'checksum mismatch' all over the place

Mon Aug 20 04:30:54 PDT 2007

On Sat, Aug 18, 2007 at 12:05:27PM +0200, Kenneth Vestergaard Schmidt wrote:
> Hello.
> 
> We've just put a 12x750 GB raidz2 pool into use, but we're seeing
> constant 'checksum mismatch' errors. The drives are brand new.
> 
> 'zpool status' currently lists the following:
> 
>         NAME        STATE     READ WRITE CKSUM
>         pil         ONLINE       0     0 189.9
>           raidz2    ONLINE       0     0 189.9
>             da0     ONLINE       0     0 2.99K
>             da1     ONLINE       0     0   606
>             da2     ONLINE       0     0    75
>             da3     ONLINE       0     0 1.94K
>             da4     ONLINE       0     0   786
>             da5     ONLINE       0     0    88
>             da6     ONLINE       0     0    79
>             da7     ONLINE       0     0    99
>             da8     ONLINE       0     0   533
>             da9     ONLINE       0     0 1.38K
>             da10    ONLINE       0     0    15
>             da11    ONLINE       0     0   628
> 
> da0-da11 are really logical drives on an EonStor SCSI drive-cage. The
> physical disks are SATA, but since our EonStor can't run in JBOD-mode,
> I've had to create a logical drive per physical drive, and map each onto
> a separate SCSI LUN.
> 
> The drive-cage was previously used to expose a RAID-5 array, composed of
> the 12 disks. This worked just fine, connecting to the same machine and
> controller (i386 IBM xSeries X335, mpt(4) controller).

How do you know it was fine? Did you have something that did
checksumming? You could try geli with integrity verification feature
turned on, fill the disks with some random data and then read it back,
if your controller corrupts the data, geli should tell you this.

> The EonStor can report SMART-statistics on each SATA-drive, and
> everything looks peachy there.
> 
> What puzzles me is, that the drives don't seem to be failing - they just
> develop checksum errors. If they had hard failures, ZFS should mark them
> broken. It's also spread across all disks, and I have a hard time
> believing we just got 12 bad drives, which don't register as bad to the
> EonStor.
> 
> Has anybody seen something like this? Any pointers on how to debug it?

-- 
Pawel Jakub Dawidek                       http://www.wheel.pl
pjd at FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20070820/746e350d/attachment.pgp