Confusing smartd messages

Thu Jul 5 01:07:59 UTC 2018

In message <5B3D6975.2060508 at grosbein.net>, Eugene Grosbein writes:
> 05.07.2018 7:03, George Mitchell Ð¿Ð¸ÑˆÐµÑ‚:
> > Every thirty minutes, smartd is telling me:
> > 
> > Device: /dev/ada1, 2 Currently unreadable (pending) sectors
> > Device: /dev/ada1, 2 Offline uncorrectable sectors
> > 
> > smartctl -a /dev/ada1 seems to be reassuring me that everything is
> > fine (SMART overall-health self-assessment test result: PASSED),
>
> If that would say FAILED, you should be replacing the disk immediately.
> PASSED does not mean it has no problems, but problems are not fatal (yet).
>
> > though it also says:
> > 
> > 197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always
> >       -       2
> > 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
> > Offline      -       2
> > 
> > which sounds like it confirms the log message above.  The disk is
> > part of a zraid pool whose "zpool status" also says everything is
> > okay.  What's the recommended action at this point?     -- George
>
> You need to force the disk performing rewrite of those two bad sectors.
> There is a possibility they are just an example of "soft bad" and in that eve
> nt
> the problem will just disappear without new remaps, that would be best possbl
> e case.
>
> Or two sectors could happen really bad and remap will "fix" (really hide) the
>  problem,
> in that case you should be ready for possible increasing number of bad sector
> s
> and have a replacement handy.
>
> First step is running zpool scrub or even replace the disk and run "dd if=/de
> v/zero of=/dev/ada1".

A better option would be to determine which blocks had the issue. Then 
use dd if=/dev/ada1 of=/dev/ada1 iseek=<the bad block> oseek=<bad block>
 count=<number of bad blocks>

Alternatively you can dd_rescue -d -s <input block #> -S <output block 
#> /dev/ada1 /dev/ada1

Failing that dd_rescue the whole device. Make sure your zpool has been 
exported. If "repairing" a UFS root filesystem, use single user mode or 
the machine will panic, though no loss of data, just a PITA.

This avoids loss of data.

Ideally your best bet would be to back up the data and write zeros, 
ones, and some random data. This "exercises" each sector such that 
there is less chance of having the same magnetic transitions 
interfering with each other. The reason is that an actuator never 
writes to the same area of disk because of variations in actuator 
movement. Phantom transitions have a slight chance of having effect.

Finally, if after going through this exercise the bad sectors are not 
remapped or clear up only to show up as bad later then replace the 
disk. Of course if your data is critically important then replace the 
disk right away. You don't know how quickly your disk is aging or 
deteriorating until it's too late.

On the positive side, I've been able to resurrect many disks this way. 
If in a critical server (my main machine or firewall) I replace the 
disk immediately, moving the one experiencing errors to a testbed 
machine, one I don't mind losing data as it's easily reproduced or 
replicated from the main machine. Many times the flaky disks don't 
complain while in my testbed for years before dying.

YMMV

-- 
Cheers,
Cy Schubert <Cy.Schubert at cschubert.com>
FreeBSD UNIX:  <cy at FreeBSD.org>   Web:  http://www.FreeBSD.org

	The need of the many outweighs the greed of the few.