zpool errors

Thu Jul 11 13:18:11 UTC 2019


> On 11 Jul 2019, at 10:39, Daniel Braniss <danny at cs.huji.ac.il> wrote:
> 
> 
> 
>> On 10 Jul 2019, at 20:23, Allan Jude <allanjude at freebsd.org> wrote:
>> 
>> On 2019-07-10 11:37, Daniel Braniss wrote:
>>> 
>>> 
>>>> On 10 Jul 2019, at 18:24, Allan Jude <allanjude at freebsd.org> wrote:
>>>> 
>>>> On 2019-07-10 10:48, Daniel Braniss wrote:
>>>>> hi,
>>>>> i got a degraded pool, but can’t make sense  of the file name:
>>>>> 
>>>>> protonew-2# zpool status -vx
>>>>> pool: h
>>>>> state: ONLINE
>>>>> status: One or more devices has experienced an error resulting in data
>>>>>     corruption.  Applications may be affected.
>>>>> action: Restore the file in question if possible.  Otherwise restore the
>>>>>     entire pool from backup.
>>>>> see: http://illumos.org/msg/ZFS-8000-8A <http://illumos.org/msg/ZFS-8000-8A>
>>>>> scan: scrub repaired 6.50K in 17h30m with 0 errors on Wed Jul 10 12:06:14 2019
>>>>> config:
>>>>> 
>>>>>     NAME          STATE     READ WRITE CKSUM
>>>>>     h             ONLINE       0     0 14.4M
>>>>>       gpt/r5/zfs  ONLINE       0     0 57.5M
>>>>> 
>>>>> errors: Permanent errors have been detected in the following files:
>>>>> 
>>>>>     <0x102>:<0x30723>
>>>>>     <0x102>:<0x30726>
>>>>>     <0x102>:<0x3062a>
>>>>> …
>>>>>     <0x281>:<0x0>
>>>>>     <0x6aa>:<0x305cd>
>>>>>     <0xffffffffffffffff>:<0x305cd>
>>>>> 
>>>>> 
>>>>> any hints as how I can identify third files?
>>>>> 
>>>>> thanks,
>>>>> 	danny
>>>>> 
>>>>> _______________________________________________
>>>>> freebsd-hackers at freebsd.org mailing list
>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>>>>> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe at freebsd.org"
>>>>> 
>>>> 
>>>> Once a file has been deleted, ZFS can have a hard time determining its
>>>> filename.
>>>> 
>>>> It is inode 198186 (0x3062a) on dataset 0x102. The file has been
>>>> deleted, but still exists in at least one snapshot.
>>>> 
>>>> Although, 57 million checksum errors seems like there may be some other
>>>> problem. You might look for and resolve the problem with what appears to
>>>> be a raid5 you have built your ZFS pool on top of it? Then do 'zpool
>>>> clear' to reset the counters to zero, and 'zpool scrub' to try to read
>>>> everything again.
>>>> 
>>>> -- 
>>>> Allan Jude
>>>> 
>>> I don’t know when the first error was detected, and this host has been up for 367 days!
>>> I did a scrub but no change.
>>> i will remove old snapshots and see if it helps.
>>> 
>>> is it possible to know at least which volume?
>>> 
>>> thanks,
>>> 	danny
>>> 
>>> 
>>> _______________________________________________
>>> freebsd-hackers at freebsd.org mailing list
>>> https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>>> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe at freebsd.org"
>>> 
>> 
>> zdb -ddddd h 0x102
>> 
>> Should tell you about which dataset that is
>> 
>> -- 
>> Allan Jude
>> 
> 

the above did’t work for me, but,
after removing old snapshots I reduced the problematic files to 1!
     <0xffffffffffffffff>:<0x305cd>
which seems very odd -1?
so now I removed more old snapshots, and started a a new zpool scrub.
what still worries me is the fast growing checksum count,
thanks,
	danny


> firstly, thanks for your help!
> now, after doing a zpool clear, I notice that the CHKSUM is growing,
> the pool is on a raid controller raid5 (PERC from dell) which is showing
> it’s correcting the errors (‘Corrected medium error during recovery on PD …).
> 
> so what can be  the cause? btw, the FreeBSD is 10.3-stable.
> 
> _______________________________________________
> freebsd-hackers at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe at freebsd.org"