sporadic CAM (all devices) outage on 11-stable, mps(4), ahci(4) and bhyve(8) involved. [Was: Re: mps(4) blocks panic-reboot]

Wed Jun 7 08:18:33 UTC 2017

 Bezüglich Harry Schmalzbauer's Nachricht vom 01.06.2017 21:10 (localtime):
> Bezüglich Stephen Mcconnell's Nachricht vom 01.06.2017 20:55 (localtime):
>> Take a look at PR 212914. Could that be the issue? It was MFC'd to stable/11
>> with r309273 on Nov 28th, 2016.
> Thanks a lot, but that's unrelated.

Unfortunately, today a similar lockup occured :-(

I was informed by mps(4):

(da1:mps0:0:3:0): READ(10). CDB: 28 00 06 7e 4d 53 00 00 10 00
(da1:mps0:0:3:0): CAM status: Unrecoverable Host Bus Adapter Error
(da1:mps0:0:3:0): Retrying command
(da1:mps0:0:3:0): WRITE(10). CDB: 2a 00 06 f8 c5 1f 00 00 38 00
(da1:mps0:0:3:0): CAM status: Unrecoverable Host Bus Adapter Error
(da1:mps0:0:3:0): Retrying command
(da1:mps0:0:3:0): WRITE(10). CDB: 2a 00 06 f8 c5 1f 00 00 38 00
(da1:mps0:0:3:0): CAM status: SCSI Status Error
(da1:mps0:0:3:0): SCSI status: Check Condition
(da1:mps0:0:3:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset,
or bus device reset occurred)
(da1:mps0:0:3:0): Error 6, Retries exhausted
(da1:mps0:0:3:0): Invalidating pack

But it seemed all drives got lost again (although the kernel message
couldn't be printed anymore), since on another still responsive
(memorydisk rootfs) session I could get the zpool status and zfs
reported all members having outstanding requests:
  pool: cetusPsys
 state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool
clear'.
   see: http://illumos.org/msg/ZFS-8000-JQ
  scan: none requested
config:

        NAME                    STATE     READ WRITE CKSUM
        cetusPsys               ONLINE     370    13     0
          mirror-0              ONLINE      40    12     0
            gpt/cetusSYSzd1of4  ONLINE       3    26     0
            da2                 ONLINE       3    16     0
          mirror-1              ONLINE     700     9     0
            gpt/cetusSYSzd2of4  ONLINE       3     9     0
            da3                 ONLINE       3    54     0

I'll do anything I can do to help tracking this problem, since the one
thing happened which I have taken massive precaution not to happen... a
freezing hypervisor :-(

Thanks,

-harry

(In case one is following any of my other recent PRs: This time, no
passthru-enabled-VM was involved. The latter causes some very serious
memory corruption IMHO... This machine is a XEON E3 with ECC, neither
MBC nor MCE reports ECC errors...