bad sector in gmirror HDD
Dan Langille
dan at langille.org
Sat Aug 20 17:34:45 UTC 2011
On Aug 19, 2011, at 11:24 PM, Jeremy Chadwick wrote:
> On Fri, Aug 19, 2011 at 09:39:17PM -0400, Dan Langille wrote:
>>
>> On Aug 19, 2011, at 7:21 PM, Jeremy Chadwick wrote:
>>
>>> On Fri, Aug 19, 2011 at 04:50:01PM -0400, Dan Langille wrote:
>>>> System in question: FreeBSD 8.2-STABLE #3: Thu Mar 3 04:52:04 GMT 2011
>>>>
>>>> After a recent power failure, I'm seeing this in my logs:
>>>>
>>>> Aug 19 20:36:34 bast smartd[1575]: Device: /dev/ad2, 2 Currently unreadable (pending) sectors
>>>
>>> I doubt this is related to a power failure.
>>>
>>>> Searching on that error message, I was led to believe that identifying the bad sector and
>>>> running dd to read it would cause the HDD to reallocate that bad block.
>>>>
>>>> http://smartmontools.sourceforge.net/badblockhowto.html
>>>
>>> This is incorrect (meaning you've misunderstood what's written there).
>>>
>>> Unreadable LBAs can be a result of the LBA being actually bad (as in
>>> uncorrectable), or the LBA being marked "suspect". In either case the
>>> LBA will return an I/O error when read.
>>>
>>> If the LBAs are marked "suspect", the drive will perform re-analysis of
>>> the LBA (to determine if the LBA can be read and the data re-mapped, or
>>> if it cannot then the LBA is marked uncorrectable) when you **write** to
>>> the LBA.
>>>
>>> The above smartd output doesn't tell me much. Providing actual SMART
>>> attribute data (smartctl -a) for the drive would help. The brand of the
>>> drive, the firmware version, and the model all matter -- every drive
>>> behaves a little differently.
>>
>> Information such as this? http://beta.freebsddiary.org/smart-fixing-bad-sector.php
>
> Yes, perfect. Thank you. First thing first: upgrade smartmontools to
> 5.41. Your attributes will be the same after you do this (the drive is
> already in smartmontools' internal drive DB), but I often have to remind
> people that they really need to keep smartmontools updated as often as
> possible. The changes between versions are vast; this is especially
> important for people with SSDs (I'm responsible for submitting some
> recent improvements for Intel 320 and 510 SSDs).
Done.
> Anyway, the drive (albeit an old PATA Maxtor) appears to have three
> anomalies:
>
> 1) One confirmed reallocated LBA (SMART attribute 5)
>
> 2) One "suspect" LBA (SMART attribute 197)
>
> 3) A very high temperature of 51C (SMART attribute 194). If this drive
> is in an enclosure or in a system with no fans this would be
> understandable, otherwise this is a bit high. My home workstation which
> has only one case fan has a drive with more platters than your Maxtor,
> and it idles at ~38C. Possibly this drive has been undergoing constant
> I/O recently (which does greatly increase drive temperature)? Not sure.
> I'm not going to focus too much on this one.
This is an older system. I suspect insufficient ventilation. I'll look at getting
a new case fan, if not some HDD fans.
> The SMART error log also indicates an LBA failure at the 26000 hour mark
> (which is 16 hours prior to when you did smartctl -a /dev/ad2). Whether
> that LBA is the remapped one or the suspect one is unknown. The LBA was
> 5566440.
>
> The SMART tests you did didn't really amount to anything; no surprise.
> short and long tests usually do not test the surface of the disk. There
> are some drives which do it on a long test, but as I said before,
> everything varies from drive to drive.
>
> Furthermore, on this model of drive, you cannot do a surface scans via
> SMART. Bummer. That's indicated in the "Offline data collection
> capabilities" section at the top, where it reads:
>
> No Selective Self-test supported.
>
> So you'll have to use the dd method. This takes longer than if surface
> scanning was supported by the drive, but is acceptable. I'll get to how
> to go about that in a moment.
FWIW, I've done a dd read of the entire suspect disk already. Just two errors.
From the URL mentioned above:
[root at bast:~] # dd of=/dev/null if=/dev/ad2 bs=1m conv=noerror
dd: /dev/ad2: Input/output error
2717+0 records in
2717+0 records out
2848980992 bytes transferred in 127.128503 secs (22410246 bytes/sec)
dd: /dev/ad2: Input/output error
38170+1 records in
38170+1 records out
40025063424 bytes transferred in 1544.671423 secs (25911701 bytes/sec)
[root at bast:~] #
That seems to indicate two problems. Are those the values I should be using
with dd?
I did some more precise testing:
# time dd of=/dev/null if=/dev/ad2 bs=512 iseek=5566440
dd: /dev/ad2: Input/output error
9+0 records in
9+0 records out
4608 bytes transferred in 5.368668 secs (858 bytes/sec)
real 0m5.429s
user 0m0.000s
sys 0m0.010s
NOTE: that's 9 blocks later than mentioned in smarctl
The above generated this in /var/log/messages:
Aug 20 17:29:25 bast kernel: ad2: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE> LBA=5566449
> [stuff snipped]
> That said:
>
> http://jdc.parodius.com/freebsd/bad_block_scan
>
> If you run this on your ad2 drive, I'm hoping what you'll find are two
> LBAs which can't be read -- one will be the remapped LBA and one will be
> the "suspect" LBA. If you only get one LBA error then that's fine too,
> and will be the "suspect" LBA.
> Once you have the LBA(s), you can submit writes to them to get the drive
> to re-analyse them (assuming they're "suspect"):
>
> dd if=/dev/zero of=/dev/XXX bs=512 count=1 seek=NNNNN
>
> Where XXX is the device and NNNNN is the LBA number.
>
> If this works properly, the dd command should sit there for a little bit
> (as the drive does its re-analysis magic) and then should complete.
ad2 is part of a gmirror with ad0. Does this change things?
I haven't tried the dd yet.
>
> You'll want to check SMART stats after that; you should see
> Current_Pending_Sector drop to 0. If Offline_Uncorrectable incremented
> then the LBA could not be re-read/remapped.
It did increment:
197 Current_Pending_Sector 0x0032 100 100 020 Old_age Always - 2
[was 1]
> If Reallocated_Sector_Ct
> incremented then you now have a total of 2 LBAs which are remapped.
It did increment:
$ diff smarctl.1 smarctl.3 | grep Reallocated_Sector_Ct
< 5 Reallocated_Sector_Ct 0x0033 100 100 020 Pre-fail Always - 1
> 5 Reallocated_Sector_Ct 0x0033 100 100 020 Pre-fail Always - 2
Full output of smartctl has been appended to http://beta.freebsddiary.org/smart-fixing-bad-sector.php
> In
> the case of remapping, you get to deal with the UFS/FFS thing above.
> To get the stats to update in this situation you *might* (but probably
> not) have to run "smartctl -t offline /dev/XXX".
I didn't try that...
>
> You might also be wondering "that dd command writes 512 bytes of zero to
> that LBA; what about the old data that was there, in the case that the
> drive remaps the LBA?" This is a great question, and one I've never
> actually taken the time to answer because at this present time I have
> absolutely *no* bad disks in my possession. I'm under the impression
> that the write does in fact write zeros if the LBA is remapped, but that
> might not be true at all. I've been waiting to test this for quite some
> time and document it/write about it.
>
> I still suggest you replace the drive, although given its age I doubt
> you'll be able to find a suitable replacement. I tend to keep disks
> like this around for testing/experimental purposes and not for actual
> use.
I have several unused 80GB HDD I can place into this system. I think that's
what I'll wind up doing. But I'd like to follow this process through and get it documented
for future reference.
--
Dan Langille - http://langille.org
More information about the freebsd-stable
mailing list