ZFS and 3ware controller resets

Tue Sep 27 19:17:43 UTC 2011

We use a lot of this exact 3ware controller (and firmware) with zfs and 8.1-RELEASE.  Though I have seen controller resets, I have not seen this exact error with zfs and 3ware. We do 2x RAID-1, and a 14-disk RAID 5,50 or 10, and the controller seems to survive disk failures in RAID config with ZFS. However, sometimes we will hit the "calru" ... time-went backwards while the controller resets and the kernel tries to figure things out.  Of course this is likely service impacting.

When multiple controller resets are detected, we have typically declared the card as bad, and RMA or replaced the card.  So far, our VAR has not rejected replacing the card while in the standard 3-years warranty. 

I would recommend replacing the controller. 

HOWEVER - I have seen this ZFS behavior with a different controller/HBA setup.  We have older Xyratex 5400-series 48 bay what-evers connected to the freebsd host via fiber channel and an LSI 7404EP HBA (mpt).  Legacy setups exported LUN/arrays from the Xyratex at RAID-5, and then gstripe'ed to form single volumes.  Setups upgraded to the ZFS setup, of course do away with the gstripe. 

When gstripe (with ufs2) when a Xyratex controllers crashes and resets, geom gets confused, produces read/write errors, and eventually panics.   In the ZFS world, these failures are almost silent, zpool never reports an error (we're striping the luns in the zpool, no raidz or raidz2 ). Eventually all the processes access disk hang is D-state, and the machine grinds to halt. 

The recommendation from the community was to use gmountver(8) from -head and use those vdevs in the zpool.  We got it back ported to 8.1.  However, there was some issues with geom-tasting order, and what vdevs will get picked up by the zpool.  I have since abandoned this testing.  We were never able to get multi-pathing working under freebsd.

---
David P. Discher
dpd at bitgravity.com * AIM: bgDavidDPD
BITGRAVITY * http://www.bitgravity.com

On Sep 25, 2011, at 10:15 AM, Adam Nowacki wrote:

> On 2011-09-25 18:59, Jeremy Chadwick wrote:
>> On Sun, Sep 25, 2011 at 05:32:55PM +0200, Adam Nowacki wrote:
>>> I have a 20 disk storage system, every now and then a disk dies and
>>> causes 3ware controller to reset because of disk timeouts. This cuts
>>> out ZFS from all disks, even healthy ones and the system requires a
>>> hard reset.
>>> Two issues here:
>>> 1) Why the controller has to reset? Thats a completely insane way of
>>> dealing with drive timeout.
>>> 2) ZFS not reopening the disk after controller reset.
>>> 
>>> FreeBSD version: 8.1-RELEASE-p1
>>> 
>>> /c0 Driver Version = 3.80.06.003
>>> /c0 Model = 9650SE-16ML
>>> /c0 Available Memory = 224MB
>>> /c0 Firmware Version = FE9X 4.10.00.007
>>> /c0 Bios Version = BE9X 4.08.00.002
>>> /c0 Boot Loader Version = BL9X 3.08.00.001

...

> 
> I mean that not only the timeouting disk is affected but all disks that are on the controller. Every single one stops working for ZFS, you can see that in the zpool status output, each disk reports read and write errors. zpool clear won't fix it, ZFS simply loses access to all disks on the controller while for example dd can read from each disk just fine. Also on the same controller I have a disk with UFS filesystem, mounted when the controller resets, this survives the reset as if it didn't even happen. For ZFS the only fix is to hard reset the whole system.