multipath problem: active provider chosen on passive FC path?

Mon Dec 8 10:14:43 UTC 2014

Hello,

I'm not sure it's the best place to expose my problem, let me know if another mailing list is recommended.

I've installed FreeBSD 9.3 on two HP blade servers (G6), into an HP C7000 chassis. This chassis uses two Brocade FC switches (active/passive if I'm not mistaken). The blade servers use QLogic HBA:

isp0: <Qlogic ISP 2432 PCI FC-AL Adapter> port 0x4000-0x40ff mem 0xfbff0000-0xfbff3fff irq 30 at device 0.0 on pci6
isp1: <Qlogic ISP 2432 PCI FC-AL Adapter> port 0x4400-0x44ff mem 0xfbfe0000-0xfbfe3fff irq 37 at device 0.1 on pci6

A SAN array presents a dedicated logical unit to each FreeBSD server. On a given server I see 4 paths to the presented LU that I use to create a GEOM_MULTIPATH device:

(from dmesg)
GEOM_MULTIPATH: SPLUNK_1 created
GEOM_MULTIPATH: da2 added to SPLUNK_1
GEOM_MULTIPATH: da2 is now active path in SPLUNK_1
GEOM_MULTIPATH: da3 added to SPLUNK_1
GEOM_MULTIPATH: da6 added to SPLUNK_1
GEOM_MULTIPATH: da7 added to SPLUNK_1

# camcontrol devlist | grep VRAID
<DGC VRAID 0532>                   at scbus0 target 2 lun 0 (pass4,da2)
<DGC VRAID 0532>                   at scbus0 target 3 lun 0 (pass5,da3)
<DGC VRAID 0532>                   at scbus1 target 4 lun 0 (pass12,da6)
<DGC VRAID 0532>                   at scbus1 target 5 lun 0 (pass13,da7)

# gmultipath status
              Name   Status  Components
multipath/SPLUNK_1  OPTIMAL  da2 (ACTIVE)
                             da3 (PASSIVE)
                             da6 (PASSIVE)
                             da7 (PASSIVE)

Unfortunately during boot, and during normal operation, the first provider (da2 here) seems faulty:

isp0: Chan 0 Abort Cmd for N-Port 0x0008 @ Port 0x090a00
(da2:isp0:0:2:0): Command Aborted
(da2:isp0:0:2:0): READ(6). CDB: 08 00 03 28 02 00 
(da2:isp0:0:2:0): CAM status: CCB request aborted by the host
(da2:isp0:0:2:0): Retrying command
../..
isp0: Chan 0 Abort Cmd for N-Port 0x0008 @ Port 0x090a00
(da2:isp0:0:2:0): Command Aborted
(da2:isp0:0:2:0): WRITE(10). CDB: 2a 00 00 50 20 21 00 00 05 00 
(da2:isp0:0:2:0): CAM status: CCB request aborted by the host
(da2:isp0:0:2:0): Retrying command
../..

Those errors make the boot really slow (10-15 minutes), but the device is not deactivated. On both servers it's always the first provider of the multipath device that seems faulty (always the first one on scbus0). So I guess scbus0 is connected to the passive FC switch.

If I use extensively on the multipath device, the faulty provider will eventually be marked as failed and another one will be marked ACTIVE, chosen on scbus1. As soon as a provider on scbus1 is marked ACTIVE, the read/write throughput comes back to expected values.

For example, disktnfo(1) shows horrendous performances (+240ms seek time...)

# diskinfo -t /dev/multipath/SPLUNK_1
/dev/multipath/SPLUNK_1
	512         	# sectorsize
	107374181888	# mediasize in bytes (100G)
	209715199   	# mediasize in sectors
	0           	# stripesize
	0           	# stripeoffset
	13054       	# Cylinders according to firmware.
	255         	# Heads according to firmware.
	63          	# Sectors according to firmware.
	CKM00114800912	# Disk ident.

Seek times:
	Full stroke:	  250 iter in   1.172849 sec =    4.691 msec
	Half stroke:	  250 iter in   2.499101 sec =    9.996 msec
	Quarter stroke:	  500 iter in 124.113431 sec =  248.227 msec
	Short forward:	  400 iter in  62.483828 sec =  156.210 msec
	Short backward:	  400 iter in  62.844187 sec =  157.110 msec
	Seq outer:	 2048 iter in 240.999614 sec =  117.676 msec
	Seq inner:	 2048 iter in 121.210282 sec =   59.185 msec
	(during this test da2 is marked failed:
	GEOM_MULTIPATH: Error 5, da2 in SPLUNK_1 marked FAIL 
	GEOM_MULTIPATH: da7 is now active path in SPLUNK_1 
	and the transfer rates test goes well:)
Transfer rates:
	outside:       102400 kbytes in   1.023942 sec =   100006 kbytes/sec
	middle:        102400 kbytes in   1.104299 sec =    92729 kbytes/sec
	inside:        102400 kbytes in   1.137533 sec =    90019 kbytes/sec

# gmultipath status
              Name    Status  Components
multipath/SPLUNK_1  DEGRADED  da2 (FAIL)
                              da3 (PASSIVE)
                              da6 (PASSIVE)
                              da7 (ACTIVE)

Is there any way I can tell GEOM to use an active provider chosen on scbus1 at boot time? Is there any chance I totally misunderstand the problem?

(other blades in the same chassis are used for ESXi VMware production for years without any problem, so I guess switches and SAN are correctly configured)

thanks,

Patrick
-- 

# sysctl -a | grep dev.isp
dev.isp.0.%desc: Qlogic ISP 2432 PCI FC-AL Adapter
dev.isp.0.%driver: isp
dev.isp.0.%location: slot=0 function=0 handle=\_SB_.PCI0.PT07.SLT0
dev.isp.0.%pnpinfo: vendor=0x1077 device=0x2432 subvendor=0x103c subdevice=0x1705 class=0x0c0400
dev.isp.0.%parent: pci6
dev.isp.0.wwnn: 5764963215108688473
dev.isp.0.wwpn: 5764963215108688472
dev.isp.0.loop_down_limit: 60
dev.isp.0.gone_device_time: 30
dev.isp.1.%desc: Qlogic ISP 2432 PCI FC-AL Adapter
dev.isp.1.%driver: isp
dev.isp.1.%location: slot=0 function=1 handle=\_SB_.PCI0.PT07.SLT1
dev.isp.1.%pnpinfo: vendor=0x1077 device=0x2432 subvendor=0x103c subdevice=0x1705 class=0x0c0400
dev.isp.1.%parent: pci6
dev.isp.1.wwnn: 5764963215108688475
dev.isp.1.wwpn: 5764963215108688474
dev.isp.1.loop_down_limit: 60
dev.isp.1.gone_device_time: 30