3Ware 9000 series hangs under load
Scott Long
scottl at samsco.org
Thu Oct 30 09:34:19 PDT 2008
Oliver Lehmann wrote:
> Hi,
>
> I've problems with my 3ware controller. Havingg heavy I/O load (e.g.
> running 40 port builds the day over with tinderbox which involves
> un-taring a whole FreeBSD tree 40 times), my system hangs with the well
> known
>
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2, size: 4096
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2, size: 4096
>
> error. I'v opened a ticket at 3ware and after half a month of
> dummy-testings (are your drives fine, can you run a stress test), it
> looks like i was redirected to someone from the 2nd lvl support and he
> told me:
>
> There are 2 things that you can try,
> 1, disable apic in your bootloader.conf file, or RMA the controller.
>
> The error that you have is generally caused by an interrupt problem,
> defective backplane, bad drive or bad controller.
>
> and after I told him that I intend to use the 2 CPUs I have and not
> falling back to one CPU for ever he responded:
>
> Yes I do understand about disabling APIC, but the feature is sometimes
> not stable in all dual proc systems. There are many variables, the
> CPU's have to be matched down to the Lot #, the motherboard must have a
> good design and the kernel supporting APIC must be stable. But, it is a
> good test to see if it is software or hardware.
>
> So what I did now, was compiling a kernel w/o apic/smp and I'm running
> this configuration now for 3 days stressing the system w/o running into
> the swap_pager problem. Can it be still a controller problem or is it
> more likley a problem of FreeBSDs smp/apic implementation or the board
> I'm using (Intel L440GX).
>
> I'm asking because I'm not sure which problem it is now and before
> telling it 3ware and having them responding "ok it is a FreeBSD problem"
> or "ok it is a board problem" I'd like to know what can be the case here.
>
> (please keep me CCed, I'm not subscribed to smp@)
>
> Further information (and the history) on this topic can be found here
> (and following):
>
> http://lists.freebsd.org/pipermail/freebsd-stable/2008-September/045500.html
>
>
The probability that it's a problem in the generic interrupt/APIC code
in FreeBSD is low. That code has matured quite well over the last 5
years, and it is very solid for just about every other hardware
configuration out there. I'd suspect the following things in the
following order:
1. Driver bug. Driver might be loosing an interrupt, or might be
deadlocking due to coding/design problems.
2. Defective controller
3. Buggy firmware on the controller. FreeBSD does tend to push I/O
controllers a lot harder than other OS's, resulting in strange bugs
sometimes being found.
4. Defective motherboard.
The fact that it's running fine with SMP/APIC disabled could easily mean
that it's not taking as high of a load, and is thus avoiding problems.
It could also mean that latent bugs in the driver are not being exposed.
I don't have a lot of time to spend debugging this, but I'd suggest that
you either take up AMCC's offer to RMA the board, or put a spare ATA
drive in the chassis and set it up as a dump partition, then get a
crashdump of the system when it gets into this state.
Scott
More information about the freebsd-stable
mailing list