bad sector in gmirror HDD

Sat Aug 20 20:19:20 UTC 2011

A follow-up given that I just viewed the SMART attribute data at the
very bottom of this page as of this writing (Sat Aug 20 13:00:09 PDT
2011):

http://beta.freebsddiary.org/smart-fixing-bad-sector.php

And I see this:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   020    Pre-fail  Always   -           2
  9 Power_On_Hours          0x0012   059   059   001    Old_age   Always   -           27440
196 Reallocated_Event_Count 0x0010   099   099   020    Old_age   Offline  -           1
197 Current_Pending_Sector  0x0032   100   100   020    Old_age   Always   -           2
198 Offline_Uncorrectable   0x0010   100   253   000    Old_age   Offline  -           0

These attributes USUALLY mean:

1) Reallocated_Sector_Ct   == There are 2 remapped LBAs.
2) Reallocated_Event_Count == There is 1 remapping event which has been
                              noticed (either failure or success).
3) Current_Pending_Sector  == There are 2 LBAs which are suspect.

Now, given my previous statement about this particular model of drive,
Maxtor may have a firmware quirk or other oddities that don't cause
Current_Pending_Sector to drop to 0 or Reallocated_Event_Count to match
reality.  I simply don't know.  But keep reading.

And remember, this is what we started with:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   020    Pre-fail  Always   -           1
  9 Power_On_Hours          0x0012   059   059   001    Old_age   Always   -           27416
196 Reallocated_Event_Count 0x0010   100   100   020    Old_age   Offline  -           0
197 Current_Pending_Sector  0x0032   100   100   020    Old_age   Always   -           1
198 Offline_Uncorrectable   0x0010   100   253   000    Old_age   Offline  -           0

Anyway, in the SMART error log, I see 3 entries (2 new ones since the
last time I saw the web page):

* Error 3 occurred at disk power-on lifetime: 27422 hours (1142 days + 14 hours)
  40 59 18 e8 ef 54 e0  Error: UNC 24 sectors at LBA = 0x0054efe8 = 5566440
* Error 2 occurred at disk power-on lifetime: 27421 hours (1142 days + 13 hours)
  40 59 18 e8 ef 54 e0  Error: UNC 24 sectors at LBA = 0x0054efe8 = 5566440
* Error 1 occurred at disk power-on lifetime: 27400 hours (1141 days + 16 hours)
  40 59 18 e8 ef 54 e0  Error: UNC 24 sectors at LBA = 0x0054efe8 = 5566440

These are all for the same LBA -- 5566440.

"Error 1" was something we already saw on the page the first time.  So
where did the other two come from?  Earlier on the web page I saw these
commands being executed:

sh ./bad_block_scan /dev/ad2 5566400 5566500   <-- will hit bad LBA
sh ./bad_block_scan /dev/ad2 5566000 5566500   <-- will hit bad LBA
sh ./bad_block_scan /dev/ad2 5560000 5566000   <-- will not hit bad LBA
sh ./bad_block_scan /dev/ad2 5560000 5566000   <-- will not hit bad LBA

So there's the explanation for the two newly-added entries in the SMART
error log.  I'm very surprised if bad_block_scan did not echo that it
had encountered read errors on LBA 5566440.  It should have, unless I
left the script in some weird state.  The commands to use to verify
would be:

dd if=/dev/ad2 of=/dev/null bs=512 count=1 skip=5566439
dd if=/dev/ad2 of=/dev/null bs=512 count=1 skip=5566440
dd if=/dev/ad2 of=/dev/null bs=512 count=1 skip=5566441

(I tend to check "around" that LBA area as well, just to make sure,
that's why there's 3 commands with -1 and +1 LBAs).  One of these should
return an I/O error, unless the LBA has been remapped already, in which
case it shouldn't.

Finally, there's this very interesting piece of information in the SMART
self-test log (not selective scan log, but the self-test log; meaning
this was the result of "smartctl -t long /dev/ad2" at some point):

Num  Test_Description    Status                  Remaining     LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     27416            786767

So it seems this is one of those drives which does do a surface scan on
a long test.

But that's interesting -- LBA 786767.

If that's true, then issuing the same dd commands as above (but with
"skip" changed appropriately) should return an I/O error as well.
Naturally check the SMART error log for verification.

So, it's possible that there are actually two bad LBAs on this drive --
LBA 5566440 and LBA 786767.  I simply don't know about the latter, but
the former is confirmed in the SMART error log.

If either of these LBAs are the ones which Current_Pending_Sector is
referring to, then writes to them should be sufficient to induce
re-analysis.  E.g.:

dd if=/dev/zero of=/dev/ad2 bs=512 count=1 seek=5566440
dd if=/dev/zero of=/dev/ad2 bs=512 count=1 seek=786767

The offsets for seek (not skip!!!) should probably be based on what the
dd reads done earlier would show.  Unless of course what we're seeing is
just a batch of LBAs in a small region that are getting worse the more
they're read from (possible).

No idea if LBA 5566440 and LBA 786767 are anywhere near one another on
the physical media.  I don't have a way to determine that (way too
complex).

That's about all the light I can shed on this for now.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                   Mountain View, CA, US |
| Making life hard for others since 1977.               PGP 4BD6C0CB |