multipath device never failing - loops over providers instead

Jan Bramkamp crest at rlwinm.de
Tue Feb 14 14:19:21 UTC 2017


On 11/02/2017 05:56, John wrote:
> Hi Folks,
>
>    Running 10.3-STABLE  r308246 from Nov 3, 2016
>
>    I thought I saw a commit in this area a while back but I
> cannot seem to find it nor is google helping..
>
>    I have SAS drives behind 2 multiplexers (4 paths total) which
> are all configured similar to the following:
>
> # gmultipath status Z76
>          Name   Status  Components
> multipath/Z76  OPTIMAL  da92 (ACTIVE)
>                         da236 (PASSIVE)
>                         da428 (PASSIVE)
>                         da572 (PASSIVE)
>
>    For each path on the components above, the following sequence occurs:
>
> kernel: (da92:mpr0:0:399:0): READ(10). CDB: 28 00 0b a7 20 c0 00 00 10 00
> kernel: (da92:mpr0:0:399:0): CAM status: SCSI Status Error
> kernel: (da92:mpr0:0:399:0): SCSI status: Check Condition
> kernel: (da92:mpr0:0:399:0): SCSI sense: HARDWARE FAILURE asc:32,0 (No defect spare location available)
> kernel: (da92:mpr0:0:399:0): Info: 0xba720c0
> kernel: (da92:mpr0:0:399:0): Field Replaceable Unit: 157
> kernel: (da92:mpr0:0:399:0): Command Specific Info: 0x80010000
> kernel: (da92:mpr0:0:399:0): Actual Retry Count: 255
> kernel: (da92:mpr0:0:399:0): Retrying command (per sense data)
>
>    After each path has failed, the following is seen:
>
> kernel: GEOM_MULTIPATH: Error 5, da92 in Z76 marked FAIL
> kernel: GEOM_MULTIPATH: all paths in Z76 were marked FAIL, restore da572
> kernel: GEOM_MULTIPATH: all paths in Z76 were marked FAIL, restore da428
> kernel: GEOM_MULTIPATH: all paths in Z76 were marked FAIL, restore da236
> kernel: GEOM_MULTIPATH: da572 is now active path in Z76
>
>    and the entire failure loop occurs again. The multipath device
> itself is never failed (so the zfs pool can never go into degraded mode,
> the faulty drive replaced with a spare, etc).
>
>    Once I pulled the drive the multipath device Z76 fails and
> things sent as expected.
>
>    It seems g_multipath_fault() in this instance should just fail the device.
>
>    Does anyone have any pointers on this issue?

This is a known bug in GEOM multipath. There are at least two open PRs 
mentioning exactly this problem. When i encountered it even prevented my 
system from booting into single user mode. As soon as GEOM multipath 
found the metadata over one path it consumed that path. Too bad that the 
tasting on of the new multipath provider triggered a read error, because 
GEOM multipath just entered an infinite retry loop over all known paths. 
Because of this bug GEOM multipath is unusable for production. I suspect 
that it wouldn't be too hard to fix if there is a way to attach some 
state (e.g. a bitmap of failed paths) to each BIO request.


More information about the freebsd-scsi mailing list