Confusing smartd messages
Cy Schubert
Cy.Schubert at cschubert.com
Thu Jul 5 01:07:59 UTC 2018
In message <5B3D6975.2060508 at grosbein.net>, Eugene Grosbein writes:
> 05.07.2018 7:03, George Mitchell пиÑеÑ:
> > Every thirty minutes, smartd is telling me:
> >
> > Device: /dev/ada1, 2 Currently unreadable (pending) sectors
> > Device: /dev/ada1, 2 Offline uncorrectable sectors
> >
> > smartctl -a /dev/ada1 seems to be reassuring me that everything is
> > fine (SMART overall-health self-assessment test result: PASSED),
>
> If that would say FAILED, you should be replacing the disk immediately.
> PASSED does not mean it has no problems, but problems are not fatal (yet).
>
> > though it also says:
> >
> > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always
> > - 2
> > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age
> > Offline - 2
> >
> > which sounds like it confirms the log message above. The disk is
> > part of a zraid pool whose "zpool status" also says everything is
> > okay. What's the recommended action at this point? -- George
>
> You need to force the disk performing rewrite of those two bad sectors.
> There is a possibility they are just an example of "soft bad" and in that eve
> nt
> the problem will just disappear without new remaps, that would be best possbl
> e case.
>
> Or two sectors could happen really bad and remap will "fix" (really hide) the
> problem,
> in that case you should be ready for possible increasing number of bad sector
> s
> and have a replacement handy.
>
> First step is running zpool scrub or even replace the disk and run "dd if=/de
> v/zero of=/dev/ada1".
A better option would be to determine which blocks had the issue. Then
use dd if=/dev/ada1 of=/dev/ada1 iseek=<the bad block> oseek=<bad block>
count=<number of bad blocks>
Alternatively you can dd_rescue -d -s <input block #> -S <output block
#> /dev/ada1 /dev/ada1
Failing that dd_rescue the whole device. Make sure your zpool has been
exported. If "repairing" a UFS root filesystem, use single user mode or
the machine will panic, though no loss of data, just a PITA.
This avoids loss of data.
Ideally your best bet would be to back up the data and write zeros,
ones, and some random data. This "exercises" each sector such that
there is less chance of having the same magnetic transitions
interfering with each other. The reason is that an actuator never
writes to the same area of disk because of variations in actuator
movement. Phantom transitions have a slight chance of having effect.
Finally, if after going through this exercise the bad sectors are not
remapped or clear up only to show up as bad later then replace the
disk. Of course if your data is critically important then replace the
disk right away. You don't know how quickly your disk is aging or
deteriorating until it's too late.
On the positive side, I've been able to resurrect many disks this way.
If in a critical server (my main machine or firewall) I replace the
disk immediately, moving the one experiencing errors to a testbed
machine, one I don't mind losing data as it's easily reproduced or
replicated from the main machine. Many times the flaky disks don't
complain while in my testbed for years before dying.
YMMV
--
Cheers,
Cy Schubert <Cy.Schubert at cschubert.com>
FreeBSD UNIX: <cy at FreeBSD.org> Web: http://www.FreeBSD.org
The need of the many outweighs the greed of the few.
More information about the freebsd-hackers
mailing list