A failed drive causes system to hang

Sun Apr 14 19:44:41 UTC 2013

On Sun, Apr 14, 2013 at 02:58:15PM -0400, Zaphod Beeblebrox wrote:
> I'd like to throw in my two cents here.  I've seen this (drives in RAID-1
> configuration) hanging whole systems.  Back in the IDE days, two drives
> were connected with one cable --- I largely wrote it off as a deficiency of
> IDE hardware and resolved to by SCSI hardware for more important systems.
> Of late, the physical hardware for SCSI (SAS) and SATA drives have
> converged.  I'm willing to accept that SAS hardware may be built to a
> different standard, but I'm suspicious of the fact that a bad SATA drive on
> an ACH* controller can hang the whole system.

Note to readers: this is borderline off-topic and is going to confuse
the thread even more.  I will respond to this ONLY ONCE, and WILL NOT be
responding to this part of the thread past this point.

I have only seen this happen on very specific controllers (JMicron for
example), where either the AHCI driver was broken/badly written, or the
underlying AHCI option ROM/firmware code was broken/badly written.

> ... it's not complete, however.  Often pulling the drive's cable will
> unfreeze things.  It's also not entirely consistent.  Drives I have
> behind 4:1 port multipliers haven't (so far) hung the system that
> they're on (which uses ACH10).  Right now, I have a remote ACH10
> system that's hung hard a couple of times --- and it passes both it's
> short and long SMART tests on both drives.

PMPs (port multipliers) are a *completely* separate beast, where some
AHCI controllers (at a silicon level) screw up/break.  In fact, the
IXP600/700 is one such controller, and workarounds had to be put into
FreeBSD and Linux for them.  I can dig up the commits if need be.

Rule of thumb (which you know -- this is for other readers): when using
a PM, it's VERY IMPORTANT that be disclosed up front.  These add a
serious complication to analysis of the SATA subsystem as a whole, and
in a lot of cases visibility into details are lost as a result.  PMPs in
general are "bleh".

> Is there no global timeout we can depend on here?

Please see kern.cam.ada.default_timeout (for adaX devices) and
kern.cam.pmp.default_timeout (for I/O requests going across a PMP).
Otherwise Alexander Motin (mav@) would be the guy to ask about PMP
issues, and/or get him hardware + provide a reliable reproduction
methodology for the issue.

All the above said:

Respectfully, please do not conflate your issue with this one.

Please start a new thread (do not reply to this thread and change the
Subject line, please actually start a brand new Email to ensure no
Reference headers are retained) about this issue if you wish.

There is already too much crap going on in this thread with 4 different
people with what are 4 different issues, and nobody at this point is
able to keep track of it all (including the participants).

This situation happens way, WAY too often with storage-related matters
on the list.  ANYTHING ZFS-related and ANYTHING storage-related results
in bandwagon-jumping and threads that spiral out of control/become
almost useless and certainly impossible to follow.  It needs to stop.

-- 
| Jeremy Chadwick                                   jdc at koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |