A failed drive causes system to hang
Jeremy Chadwick
jdc at koitsu.org
Sun Apr 14 19:44:41 UTC 2013
On Sun, Apr 14, 2013 at 02:58:15PM -0400, Zaphod Beeblebrox wrote:
> I'd like to throw in my two cents here. I've seen this (drives in RAID-1
> configuration) hanging whole systems. Back in the IDE days, two drives
> were connected with one cable --- I largely wrote it off as a deficiency of
> IDE hardware and resolved to by SCSI hardware for more important systems.
> Of late, the physical hardware for SCSI (SAS) and SATA drives have
> converged. I'm willing to accept that SAS hardware may be built to a
> different standard, but I'm suspicious of the fact that a bad SATA drive on
> an ACH* controller can hang the whole system.
Note to readers: this is borderline off-topic and is going to confuse
the thread even more. I will respond to this ONLY ONCE, and WILL NOT be
responding to this part of the thread past this point.
I have only seen this happen on very specific controllers (JMicron for
example), where either the AHCI driver was broken/badly written, or the
underlying AHCI option ROM/firmware code was broken/badly written.
> ... it's not complete, however. Often pulling the drive's cable will
> unfreeze things. It's also not entirely consistent. Drives I have
> behind 4:1 port multipliers haven't (so far) hung the system that
> they're on (which uses ACH10). Right now, I have a remote ACH10
> system that's hung hard a couple of times --- and it passes both it's
> short and long SMART tests on both drives.
PMPs (port multipliers) are a *completely* separate beast, where some
AHCI controllers (at a silicon level) screw up/break. In fact, the
IXP600/700 is one such controller, and workarounds had to be put into
FreeBSD and Linux for them. I can dig up the commits if need be.
Rule of thumb (which you know -- this is for other readers): when using
a PM, it's VERY IMPORTANT that be disclosed up front. These add a
serious complication to analysis of the SATA subsystem as a whole, and
in a lot of cases visibility into details are lost as a result. PMPs in
general are "bleh".
> Is there no global timeout we can depend on here?
Please see kern.cam.ada.default_timeout (for adaX devices) and
kern.cam.pmp.default_timeout (for I/O requests going across a PMP).
Otherwise Alexander Motin (mav@) would be the guy to ask about PMP
issues, and/or get him hardware + provide a reliable reproduction
methodology for the issue.
All the above said:
Respectfully, please do not conflate your issue with this one.
Please start a new thread (do not reply to this thread and change the
Subject line, please actually start a brand new Email to ensure no
Reference headers are retained) about this issue if you wish.
There is already too much crap going on in this thread with 4 different
people with what are 4 different issues, and nobody at this point is
able to keep track of it all (including the participants).
This situation happens way, WAY too often with storage-related matters
on the list. ANYTHING ZFS-related and ANYTHING storage-related results
in bandwagon-jumping and threads that spiral out of control/become
almost useless and certainly impossible to follow. It needs to stop.
--
| Jeremy Chadwick jdc at koitsu.org |
| UNIX Systems Administrator http://jdc.koitsu.org/ |
| Mountain View, CA, US |
| Making life hard for others since 1977. PGP 4BD6C0CB |
More information about the freebsd-fs
mailing list