Read / write timeouts on SATA disks connected to ICH9

Sat May 15 20:39:18 UTC 2010

Hi,

<SNIP: disk without errors timing out>
> That could be caused by a multitude of other known things.  For
> example, some Western Digital "Green" drives (including the
> Enterprise class ones) are known to perform head parking/offloading
> excessively, which could result in the drive spending more time doing
> that than actually serving overall I/O requests.  There are some
> other reports of Samsung Spinpoint drives experiencing other issues
> (I've since forgotten and would have to dig up the threads).

> If you could provide full SMART stats for that drive, it might help.
Attached the SMART output of both disks I replaced about a month ago. It
appears I replaced perfectly fine drives with the current disks with
errors ;(  One of the old disks is in a USB-enclosure now, so 'da0'.

<SNIP: enabling TLER>
> Yes, it's a DOS-based utility (like most firmware upgraders these
> days). I can provide it if you'd like.  I've been meaning to spend
> some time trying to reverse-engineer the binary to figure out what
> ATA commands it sends to the disk to toggle/adjust the feature (so
> that one could do it in real-time rather than have to boot into DOS).
> 
I'd like to try that tool. Since the old WD disks are now lying around
at home, I have some time to get a DOS boot working to try it out. A
FreeBSD-implementation of the WD tool and possibly other brands would be
really useful indeed.

>> At a certain point in time I had read errors from specific LBA's on
>>  ad4. Using dd I was able to pinpoint those to single sectors.

> This isn't very effective (dd will read large chunks/amounts of data 
> (read: multiple LBAs) from the underlying disk at once, rather than
> the disk itself performing a per-LBA test).  My opinion is that the
> "dd method" should only be used on drives which don't support
> selective LBA scanning via SMART.
Will dd read multiple LBAs even when using 'bs=512'? The process I used
was reading using bs=8192, then zooming in on the LBA's mentioned in
the errors in dmesg with bs=512 to find the actual LBA.

A selective scan on ad4 did not reveal any errors today: it 'completed 
without error'. On ad6 it's a whole lot slower; at the time of writing 
it's at 2/3.

> All HD vendors have their own quirks/ordeals right now.  You
> basically just have to go with one who works wells for you, then if
> things start going downhill, switch to another.  None of them are
> perfect.
I figured as much. What irritates though is that I've had consistent 
problems with 4 disks in this specific system, but not (such) issues 
with any other disk in other systems I've had. I generally replace disks 
when I grow out of them, not because they break down.

> What this indicates to me is that if a disk falls off the bus on an
> ICH9 controller in Enhanced (non-AHCI) mode, FreeBSD starts seeing an
> absurd number of interrupts generated from the ICH9.  My guess is
> FreeBSD isn't doing something correctly with the controller when this
> happens; maybe certain commands aren't being sent back to the
> controller or handling of certain events are being done improperly
> when it comes to ICH9 (or possibly earlier ICH revisions too).  This
> should be *very* easy to reproduce.

Unfortunately I'm not really in a position to help reproducing this or 
testing possible fixes; downtime is currently very unwelcome. Although 
one of the previous disks indeed fell of the bus entirely (couldn't get 
it back with atacontrol either), that hasn't happened again so far. I 
only see timeouts (and a few days ago read errors on ad4) which gmirror 
doesn't like. I guess those aren't that simple to reproduce (apart from 
on my system ;).

> If you see any of your disks on the ICH9 controller fall off the bus
> or report ATA errors (doesn't matter what kind), please make note of
> the timestamp (should be in the kernel log), and ASAP run "smartctl
> -a" on the disk.  You should compare attributes before and after the
> event.
> You might also want to consider using smartd, which can log SMART 
> attribute changes on its own.  Note that you might have to tune the 
> arguments in smartd.conf to ignore some attributes which fluctuate 
> naturally (such as drive temperature and seek error rate).

I've configured smartd to poll both disks every 5 minutes. I -think- the 
issues happen specifically under load: the periodic scripts of the host 
and its 4 jails appear to trigger it sometimes. At that time I'm 
normally trying to get some sleep, so smartd will have to do for now. 
Although I'll run a "smartctl -a" asap anyway.

-- 
Pieter