bad sector in gmirror HDD
Jeremy Chadwick
freebsd at jdc.parodius.com
Sat Aug 20 20:19:20 UTC 2011
A follow-up given that I just viewed the SMART attribute data at the
very bottom of this page as of this writing (Sat Aug 20 13:00:09 PDT
2011):
http://beta.freebsddiary.org/smart-fixing-bad-sector.php
And I see this:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 020 Pre-fail Always - 2
9 Power_On_Hours 0x0012 059 059 001 Old_age Always - 27440
196 Reallocated_Event_Count 0x0010 099 099 020 Old_age Offline - 1
197 Current_Pending_Sector 0x0032 100 100 020 Old_age Always - 2
198 Offline_Uncorrectable 0x0010 100 253 000 Old_age Offline - 0
These attributes USUALLY mean:
1) Reallocated_Sector_Ct == There are 2 remapped LBAs.
2) Reallocated_Event_Count == There is 1 remapping event which has been
noticed (either failure or success).
3) Current_Pending_Sector == There are 2 LBAs which are suspect.
Now, given my previous statement about this particular model of drive,
Maxtor may have a firmware quirk or other oddities that don't cause
Current_Pending_Sector to drop to 0 or Reallocated_Event_Count to match
reality. I simply don't know. But keep reading.
And remember, this is what we started with:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 020 Pre-fail Always - 1
9 Power_On_Hours 0x0012 059 059 001 Old_age Always - 27416
196 Reallocated_Event_Count 0x0010 100 100 020 Old_age Offline - 0
197 Current_Pending_Sector 0x0032 100 100 020 Old_age Always - 1
198 Offline_Uncorrectable 0x0010 100 253 000 Old_age Offline - 0
Anyway, in the SMART error log, I see 3 entries (2 new ones since the
last time I saw the web page):
* Error 3 occurred at disk power-on lifetime: 27422 hours (1142 days + 14 hours)
40 59 18 e8 ef 54 e0 Error: UNC 24 sectors at LBA = 0x0054efe8 = 5566440
* Error 2 occurred at disk power-on lifetime: 27421 hours (1142 days + 13 hours)
40 59 18 e8 ef 54 e0 Error: UNC 24 sectors at LBA = 0x0054efe8 = 5566440
* Error 1 occurred at disk power-on lifetime: 27400 hours (1141 days + 16 hours)
40 59 18 e8 ef 54 e0 Error: UNC 24 sectors at LBA = 0x0054efe8 = 5566440
These are all for the same LBA -- 5566440.
"Error 1" was something we already saw on the page the first time. So
where did the other two come from? Earlier on the web page I saw these
commands being executed:
sh ./bad_block_scan /dev/ad2 5566400 5566500 <-- will hit bad LBA
sh ./bad_block_scan /dev/ad2 5566000 5566500 <-- will hit bad LBA
sh ./bad_block_scan /dev/ad2 5560000 5566000 <-- will not hit bad LBA
sh ./bad_block_scan /dev/ad2 5560000 5566000 <-- will not hit bad LBA
So there's the explanation for the two newly-added entries in the SMART
error log. I'm very surprised if bad_block_scan did not echo that it
had encountered read errors on LBA 5566440. It should have, unless I
left the script in some weird state. The commands to use to verify
would be:
dd if=/dev/ad2 of=/dev/null bs=512 count=1 skip=5566439
dd if=/dev/ad2 of=/dev/null bs=512 count=1 skip=5566440
dd if=/dev/ad2 of=/dev/null bs=512 count=1 skip=5566441
(I tend to check "around" that LBA area as well, just to make sure,
that's why there's 3 commands with -1 and +1 LBAs). One of these should
return an I/O error, unless the LBA has been remapped already, in which
case it shouldn't.
Finally, there's this very interesting piece of information in the SMART
self-test log (not selective scan log, but the self-test log; meaning
this was the result of "smartctl -t long /dev/ad2" at some point):
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 27416 786767
So it seems this is one of those drives which does do a surface scan on
a long test.
But that's interesting -- LBA 786767.
If that's true, then issuing the same dd commands as above (but with
"skip" changed appropriately) should return an I/O error as well.
Naturally check the SMART error log for verification.
So, it's possible that there are actually two bad LBAs on this drive --
LBA 5566440 and LBA 786767. I simply don't know about the latter, but
the former is confirmed in the SMART error log.
If either of these LBAs are the ones which Current_Pending_Sector is
referring to, then writes to them should be sufficient to induce
re-analysis. E.g.:
dd if=/dev/zero of=/dev/ad2 bs=512 count=1 seek=5566440
dd if=/dev/zero of=/dev/ad2 bs=512 count=1 seek=786767
The offsets for seek (not skip!!!) should probably be based on what the
dd reads done earlier would show. Unless of course what we're seeing is
just a batch of LBAs in a small region that are getting worse the more
they're read from (possible).
No idea if LBA 5566440 and LBA 786767 are anywhere near one another on
the physical media. I don't have a way to determine that (way too
complex).
That's about all the light I can shed on this for now.
--
| Jeremy Chadwick jdc at parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, US |
| Making life hard for others since 1977. PGP 4BD6C0CB |
More information about the freebsd-stable
mailing list