a strange and terrible saga of the cursed iSCSI ZFS SAN

Sun Aug 6 01:13:23 UTC 2017

Eugene M. Zheganin wrote:
> Hi,
>
> On 05.08.2017 22:08, Eugene M. Zheganin wrote:
>>
>>   pool: userdata
>>  state: ONLINE
>> status: One or more devices has experienced an error resulting in data
>>         corruption.  Applications may be affected.
>> action: Restore the file in question if possible.  Otherwise restore the
>>         entire pool from backup.
>>    see: http://illumos.org/msg/ZFS-8000-8A
>>   scan: none requested
>> config:
>>
>>         NAME               STATE     READ WRITE CKSUM
>>         userdata           ONLINE       0     0  216K
>>           mirror-0         ONLINE       0     0  432K
>>             gpt/userdata0  ONLINE       0     0  432K
>>             gpt/userdata1  ONLINE       0     0  432K
> That would be funny, if not that sad, but while writing this message,
> the pool started to look like below (I just asked zpool status twice in
> a row, comparing to what it was):
>
> [root at san1:~]# zpool status userdata
>   pool: userdata
>  state: ONLINE
> status: One or more devices has experienced an error resulting in data
>         corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>         entire pool from backup.
>    see: http://illumos.org/msg/ZFS-8000-8A
>   scan: none requested
> config:
>
>         NAME               STATE     READ WRITE CKSUM
>         userdata           ONLINE       0     0  728K
>           mirror-0         ONLINE       0     0 1,42M
>             gpt/userdata0  ONLINE       0     0 1,42M
>             gpt/userdata1  ONLINE       0     0 1,42M
>
> errors: 4 data errors, use '-v' for a list
> [root at san1:~]# zpool status userdata
>   pool: userdata
>  state: ONLINE
> status: One or more devices has experienced an error resulting in data
>         corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>         entire pool from backup.
>    see: http://illumos.org/msg/ZFS-8000-8A
>   scan: none requested
> config:
>
>         NAME               STATE     READ WRITE CKSUM
>         userdata           ONLINE       0     0  730K
>           mirror-0         ONLINE       0     0 1,43M
>             gpt/userdata0  ONLINE       0     0 1,43M
>             gpt/userdata1  ONLINE       0     0 1,43M
>
> errors: 4 data errors, use '-v' for a list
>
> So, you see, the error rate is like speed of light. And I'm not sure if
> the data access rate is that enormous, looks like they are increasing on
> their own.
> So may be someone have an idea on what this really means.

It is remarkable that You always have the same error count on both sides 
of the mirror.
 From what I have seen, such a picture appears when an unrecoverable 
error (i.e. one that is on both sides of the mirror) is read again and 
again.
File number 0x1 is probably some important metadata, and since it is not 
readable it cannot be put into the ARC, so the read is tried ever again.

An error that would appear only on one side appears only once, because 
then it is auto-corrected. In that case the figures have some erratic 
deviations.

Therefore it is worthwile to remove the erroneous data soon, because as 
long as that exists one does not get anything useful from the figures 
(like how many errors are actually appearing anew).