lsi1064e

Eugene M. Zheganin eugene at zhegan.in
Thu Jun 2 12:58:49 UTC 2011


Hi.

I'm using FreeBSD 8.2 and IBM system x 3250 servers which are bundled 
with an onboard LSI 1064e controller.
I'm using 'em with geom_mirror and zfs (I have like dozen of these).

Last time I noticed weird thing on a server with gmirror: one drive died 
and the server hung up until it was rebooted. This week I was examining 
some zfs-related freezes (I guess its about arc size, but someone on the 
irc told me that disks timeouts can be the reason too) and I was 
experimenting on my test server (waiting for being put into the 
production). And I noticed some wrong (at least I think it's wrong) 
behaviour: keeping in mind that last time I got freeze when drive died, 
I pulled out one of two drives in a zfs mirrored pool. Then I got 
immediate freeze - all of the disk operations were freezed, but the 
system was alive. I entered the kernel debugger and saw a bunch of 
proccesses in D state, including some of the zfs threads.

I updated the LSI1064e firmware (last 1.30.xx found on the IBM site), 
the BIOS, but nothing helps. When one of the disks is pulled out 
(there's no need to do that in production, but I guess the exact same 
thing happens when the drive dies along with all of its electric 
circuits) the system waits indefinitely, until the drive is pushed back, 
or until the server is rebooted. Then (if the drive is pushed back) the 
mpt driver realises that either the drive was reset, or that device was 
lost (I don't know what this depends from).

Funny thing: after the drive is pulled out and pushed back, and the 
camcontrol rescan is issued, you can pull it out again, and this time 
(and any time after that) the system willl detect that drive is gone 
quite fast, and no disk operations freeze will happen.

You can imagine that this behaviour is not the one anyone expects when 
drive dies. So I want to ask - if this, perhaps, can be tuned, so the 
system will keep running and somehow will detect that the drive is 
failed in some short time, like 3-15 seconds ? Or is this a bug and I 
need to write a pr ?

Thanks.
Eugene.


More information about the freebsd-scsi mailing list