Only 0.44 (always) days of uptime with ciss (w/HP SA P812)
Sean Bruno
sbruno at ignoranthack.me
Mon Mar 24 12:32:59 UTC 2014
On Mon, 2014-03-24 at 10:48 +0100, Nagy, Attila wrote:
> Hi,
>
> I have an HP DL360G7 with a HP SmartArray P812 in it, which crashes
> exactly (well, some minutes plus or minus, but on the graph it's nearly
> the same) at 0.44 days of uptime no matter what I do, load the machine
> until it's so hot, I can't touch it, or just leave it idle.
> The P812 has an HP MDS600 connected to it with 70 1TB disks, with a 6
> disk RAID6 (ADG) setup. The volumes have 128k stripe size, because I use
> ZFS on top of them.
> The zpool is simply a stripe of the RAID6 volumes.
> What may be important: the controller's RAID6 initialization is still
> ongoing.
>
> In the first sentence idle means the zpool/zfs is just mounted and only
> some stat()s happening on them (crashes after 0.44 days) and fully
> loaded means gstat shows around 100% utilization on the disks nearly all
> the time (crashes after 0.44 days also).
>
> I've already tried with stable/9 at r260621 and stable/10 at r262152, it's the
> same.
> I've also tried with Linux (Ubuntu 13.10, hpsa driver, zfs on linux
> 0.6.2), it doesn't crash (neither idle or loaded).
> Already swapped the machine and the P812 to a different one, no effect.
> Everything (DL360, P812, MDS600, disks) has the latest firmware.
>
> The currently used ZFS is created under Linux to see whether this causes
> the problems, but of course there are many different things in the two
> OS (kernel, HP SA driver, block/SCSI layer and even ZFS is somewhat
> different).
> Linux works, FreeBSD crashes no matter what I do.
>
> The exact message I can see is (ciss0 is the built-in P411):
> ciss1: ADAPTER HEARTBEAT FAILED
>
>
> Fatal trap 1: privileged instruction fault while in kernel mode
> cpuid = 0; apic id = 00
> instruction pointer = 0x20:0xfffffe0c59ff795d
> stack pointer = 0x28:0xfffffe0baf1ab9d0
> frame pointer = 0x28:0xfffffe0baf1aba20
> code segment = base 0x0, limit 0xfffff, type 0x1b
> = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags = interrupt enabled, resume, IOPL = 0
> current process = 12 (swi4: clock)
> trap number = 1
> panic: privileged instruction fault
> cpuid = 0
> KDB: stack backtrace:
> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
> 0xfffffe0baf1ab560
> kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0baf1ab610
> panic() at panic+0x155/frame 0xfffffe0baf1ab690
> trap_fatal() at trap_fatal+0x3a2/frame 0xfffffe0baf1ab6f0
> trap() at trap+0x794/frame 0xfffffe0baf1ab910
> calltrap() at calltrap+0x8/frame 0xfffffe0baf1ab910
> --- trap 0x1, rip = 0xfffffe0c59ff795d, rsp = 0xfffffe0baf1ab9d0, rbp =
> 0xfffffe0baf1aba20 ---
> (null)() at 0xfffffe0c59ff795d/frame 0xfffffe0baf1aba20
> softclock_call_cc() at softclock_call_cc+0x16c/frame 0xfffffe0000e77120
> kernphys() at 0xffffffff/frame 0xfffffe0000e778a0
> kernphys() at 0xffffffff/frame 0xfffffe0000e78aa0
> kernphys() at 0xffffffff/frame 0xfffffe0000e78c20
> Uptime: 10h18m12s
> (da4:ciss1:0:0:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00
> 00 00
> (da4:ciss1:0:0:0): CAM status: Command timeout
> (da4:ciss1:0:0:0): Error 5, Retries exhausted
> (da4:ciss1:0:0:0): Synchronize cache failed
> (da5:ciss1:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00
> 00 00
> (da5:ciss1:0:1:0): CAM status: Command timeout
> (da5:ciss1:0:1:0): Error 5, Retries exhausted
> (da5:ciss1:0:1:0): Synchronize cache failed
> (da6:ciss1:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00
> 00 00
> (da6:ciss1:0:2:0): CAM status: Command timeout
> (da6:ciss1:0:2:0): Error 5, Retries exhausted
> (da6:ciss1:0:2:0): Synchronize cache failed
> (da7:ciss1:0:3:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00
> 00 00
> (da7:ciss1:0:3:0): CAM status: Command timeout
> (da7:ciss1:0:3:0): Error 5, Retries exhausted
> (da7:ciss1:0:3:0): Synchronize cache failed
> (da8:ciss1:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00
> 00 00
> (da8:ciss1:0:4:0): CAM status: Command timeout
> (da8:ciss1:0:4:0): Error 5, Retries exhausted
> (da8:ciss1:0:4:0): Synchronize cache failed
> (da9:ciss1:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00
> 00 00
> (da9:ciss1:0:5:0): CAM status: Command timeout
> (da9:ciss1:0:5:0): Error 5, Retries exhausted
> (da9:ciss1:0:5:0): Synchronize cache failed
> (da10:ciss1:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00
> 00 00
> (da10:ciss1:0:6:0): CAM status: Command timeout
> (da10:ciss1:0:6:0): Error 5, Retries exhausted
> (da10:ciss1:0:6:0): Synchronize cache failed
> (da11:ciss1:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00
> 00 00
> (da11:ciss1:0:7:0): CAM status: Command timeout
> (da11:ciss1:0:7:0): Error 5, Retries exhausted
> (da11:ciss1:0:7:0): Synchronize cache failed
> (da12:ciss1:0:8:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00
> 00 00
> (da12:ciss1:0:8:0): CAM status: Command timeout
> (da12:ciss1:0:8:0): Error 5, Retries exhausted
> (da12:ciss1:0:8:0): Synchronize cache failed
> (da13:ciss1:0:9:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00
> 00 00
> (da13:ciss1:0:9:0): CAM status: Command timeout
> (da13:ciss1:0:9:0): Error 5, Retries exhausted
> (da13:ciss1:0:9:0): Synchronize cache failed
> (da14:ciss1:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00
> 00 00
> (da14:ciss1:0:10:0): CAM status: Command timeout
> (da14:ciss1:0:10:0): Error 5, Retries exhausted
> (da14:ciss1:0:10:0): Synchronize cache failed
> Automatic reboot in 15 seconds - press a key on the console to abort
> Rebooting...
>
> Dmesg says:
> ciss1: <HP Smart Array P812> port 0x5000-0x50ff mem
> 0xfbe00000-0xfbffffff,0xfbdf0000-0xfbdf0fff irq 24 at device 0.0 on pci9
> ciss1: PERFORMANT Transport
> da5 at ciss1 bus 0 scbus2 target 1 lun 0
> da4 at ciss1 bus 0 scbus2 target 0 lun 0
> da6 at ciss1 bus 0 scbus2 target 2 lun 0
> da7 at ciss1 bus 0 scbus2 target 3 lun 0
> da8 at ciss1 bus 0 scbus2 target 4 lun 0
> da9 at ciss1 bus 0 scbus2 target 5 lun 0
> da10 at ciss1 bus 0 scbus2 target 6 lun 0
> da11 at ciss1 bus 0 scbus2 target 7 lun 0
> da12 at ciss1 bus 0 scbus2 target 8 lun 0
> da13 at ciss1 bus 0 scbus2 target 9 lun 0
> da14 at ciss1 bus 0 scbus2 target 10 lun 0
>
> I also find it interesting that the machine's IML (Integrated Management
> Log) contains this message after every crash:
> POST Error: 1719 - A controller failure event occurred prior to this
> power-up
>
> Which might show that the controller indeed locks up, but why does it do
> this under FreeBSD and doesn't under Linux?
> I've already tried
> hw.ciss.nop_message_heartbeat=1;ciss_force_transport=1;ciss_force_interrupt=1
> without any effect (it freezes after the same time).
>
> Last time during the POST the controller said:
> Slot 2 HP Smart Array P812 Controller (1024MB, v6.40) 11 Logical
> Drives
> 1719-Slot 2 Drive Array - A controller failure event occurred prior to this
> power-up. (Previous lock up code = 0x13)
>
> Any ideas on what could cause this?
>
> Thanks,
> _______________________________________________
> freebsd-scsi at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"
Can you open a p/r on this? I'd like to keep tracking ciss(4) issues.
It seems like there is something odd with our driver when using multiple
controllers.
sean
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: This is a digitally signed message part
URL: <http://lists.freebsd.org/pipermail/freebsd-scsi/attachments/20140324/f182e17c/attachment.sig>
More information about the freebsd-scsi
mailing list