Confusing smartd messages

Thu Jul 5 19:31:40 UTC 2018

On Thu, Jul 5, 2018 at 12:15 PM, Rodney W. Grimes <
freebsd-rwg at pdx.rh.cn85.dnsmgr.net> wrote:

> > On Thu, Jul 5, 2018 at 11:03 AM, Wojciech Puchar <wojtek at puchar.net>
> wrote:
> >
> > >
> > >> Rewriting suspicious sectors is useless in this day and age.  HDDs and
> > >> SSDs
> > >> already do it internally and have for years.  Even healthy sectors get
> > >>
> > >
> > > unreadable sectors cannot be rewritten by drive electronics as it
> doesn't
> > > know what to rewrite. it may possibly remap it but still report read
> error
> > > until some data will be written - unless giving no error and returning
> > > meaningless data is an accepted behaviour.
> > >
> >
> > But if that disk is already managed by ZFS, the pool is redundant, and
> the
> > bad sector is allocated by ZFS, then ZFS will immediately rewrite the
> > unreadable sector.
>
> ZFS, if it gets a re error, will rewrite the unreadable sector
> to a DIFFERENT block, not over the top of the bad spot.
>

Are you sure?  For read errors, I think ZFS rewrites the data in-place, so
it doesn't have to rewrite it on all other members of the same mirror/raid
group.  For persistent write errors of course, it would have to move it to
a different LBA as you describe.

>
> > > only on write it can be done properly.
> > >
> > > that the HDD/SSD won't fix itself would be a checksum error.  Those are
> > >>
> > >
> > > yes and this will happen if you powerdown your disk on write. or get
> some
> > > power spike or other source of noise that would affect electronic
> > > components.
> > >
> >
> > It happens surprisingly rarely.  Even on a sudden power loss, the drive
> is
> > usually able to finish its current write operation.  When you run into
> > problems would be if the power loss were coincident with a mechanical
> shock
> > that knocks the head off-track, or something like that.
>
> I agree that "power failure" are rare causes of write errors, and an
> idea of how often this might of happened is look at the emergency
> retract counter, if your gettng lots of those you should try to find
> out why and stop that.   Vibration has become a serious problem though,
> at todays head flight hight drives are sensitive to this, you can
> even cause a drive to do retires by yelling at it with a loud
> voice :-)   Look at the "high fly" counter to see if your getting
> this issue.
>
> > > performing full disk rewrite (so not zfs rebuilds) and THEN looking at
> > > smart stats and THEN performing regular smartctl -t long will tell the
> > > truth.
> > >
> > > which usually is "drive is fine" in my practice. really faulty drive
> will
> > > QUICKLY develop new problems.
> > >
> >
> > Yeah, that should make the error go away.  It takes a long time, though.
> > With a SCSI drive, you can get the exact LBAs affected with a "READ
> > DEFECTS" command.  But there isn't a vendor-independent equivalent for
> > SATA, unfortunately.
>
> My bitch exactly about ATA missing this.  Though there are vendor specific
> commands to get it.
>
> --
> Rod Grimes
> rgrimes at freebsd.org
>