Only 0.44 (always) days of uptime with ciss (w/HP SA P812)

Mon Mar 24 12:32:59 UTC 2014

On Mon, 2014-03-24 at 10:48 +0100, Nagy, Attila wrote:
> Hi,
> 
> I have an HP DL360G7 with a HP SmartArray P812 in it, which crashes 
> exactly (well, some minutes plus or minus, but on the graph it's nearly 
> the same) at 0.44 days of uptime no matter what I do, load the machine 
> until it's so hot, I can't touch it, or just leave it idle.
> The P812 has an HP MDS600 connected to it with 70 1TB disks, with a 6 
> disk RAID6 (ADG) setup. The volumes have 128k stripe size, because I use 
> ZFS on top of them.
> The zpool is simply a stripe of the RAID6 volumes.
> What may be important: the controller's RAID6 initialization is still 
> ongoing.
> 
> In the first sentence idle means the zpool/zfs is just mounted and only 
> some stat()s happening on them (crashes after 0.44 days) and fully 
> loaded means gstat shows around 100% utilization on the disks nearly all 
> the time (crashes after 0.44 days also).
> 
> I've already tried with stable/9 at r260621 and stable/10 at r262152, it's the 
> same.
> I've also tried with Linux (Ubuntu 13.10, hpsa driver, zfs on linux 
> 0.6.2), it doesn't crash (neither idle or loaded).
> Already swapped the machine and the P812 to a different one, no effect. 
> Everything (DL360, P812, MDS600, disks) has the latest firmware.
> 
> The currently used ZFS is created under Linux to see whether this causes 
> the problems, but of course there are many different things in the two 
> OS (kernel, HP SA driver, block/SCSI layer and even ZFS is somewhat 
> different).
> Linux works, FreeBSD crashes no matter what I do.
> 
> The exact message I can see is (ciss0 is the built-in P411):
> ciss1: ADAPTER HEARTBEAT FAILED
> 
> 
> Fatal trap 1: privileged instruction fault while in kernel mode
> cpuid = 0; apic id = 00
> instruction pointer     = 0x20:0xfffffe0c59ff795d
> stack pointer           = 0x28:0xfffffe0baf1ab9d0
> frame pointer           = 0x28:0xfffffe0baf1aba20
> code segment            = base 0x0, limit 0xfffff, type 0x1b
>                          = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags        = interrupt enabled, resume, IOPL = 0
> current process         = 12 (swi4: clock)
> trap number             = 1
> panic: privileged instruction fault
> cpuid = 0
> KDB: stack backtrace:
> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 
> 0xfffffe0baf1ab560
> kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0baf1ab610
> panic() at panic+0x155/frame 0xfffffe0baf1ab690
> trap_fatal() at trap_fatal+0x3a2/frame 0xfffffe0baf1ab6f0
> trap() at trap+0x794/frame 0xfffffe0baf1ab910
> calltrap() at calltrap+0x8/frame 0xfffffe0baf1ab910
> --- trap 0x1, rip = 0xfffffe0c59ff795d, rsp = 0xfffffe0baf1ab9d0, rbp = 
> 0xfffffe0baf1aba20 ---
> (null)() at 0xfffffe0c59ff795d/frame 0xfffffe0baf1aba20
> softclock_call_cc() at softclock_call_cc+0x16c/frame 0xfffffe0000e77120
> kernphys() at 0xffffffff/frame 0xfffffe0000e778a0
> kernphys() at 0xffffffff/frame 0xfffffe0000e78aa0
> kernphys() at 0xffffffff/frame 0xfffffe0000e78c20
> Uptime: 10h18m12s
> (da4:ciss1:0:0:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 
> 00 00
> (da4:ciss1:0:0:0): CAM status: Command timeout
> (da4:ciss1:0:0:0): Error 5, Retries exhausted
> (da4:ciss1:0:0:0): Synchronize cache failed
> (da5:ciss1:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 
> 00 00
> (da5:ciss1:0:1:0): CAM status: Command timeout
> (da5:ciss1:0:1:0): Error 5, Retries exhausted
> (da5:ciss1:0:1:0): Synchronize cache failed
> (da6:ciss1:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 
> 00 00
> (da6:ciss1:0:2:0): CAM status: Command timeout
> (da6:ciss1:0:2:0): Error 5, Retries exhausted
> (da6:ciss1:0:2:0): Synchronize cache failed
> (da7:ciss1:0:3:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 
> 00 00
> (da7:ciss1:0:3:0): CAM status: Command timeout
> (da7:ciss1:0:3:0): Error 5, Retries exhausted
> (da7:ciss1:0:3:0): Synchronize cache failed
> (da8:ciss1:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 
> 00 00
> (da8:ciss1:0:4:0): CAM status: Command timeout
> (da8:ciss1:0:4:0): Error 5, Retries exhausted
> (da8:ciss1:0:4:0): Synchronize cache failed
> (da9:ciss1:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 
> 00 00
> (da9:ciss1:0:5:0): CAM status: Command timeout
> (da9:ciss1:0:5:0): Error 5, Retries exhausted
> (da9:ciss1:0:5:0): Synchronize cache failed
> (da10:ciss1:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 
> 00 00
> (da10:ciss1:0:6:0): CAM status: Command timeout
> (da10:ciss1:0:6:0): Error 5, Retries exhausted
> (da10:ciss1:0:6:0): Synchronize cache failed
> (da11:ciss1:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 
> 00 00
> (da11:ciss1:0:7:0): CAM status: Command timeout
> (da11:ciss1:0:7:0): Error 5, Retries exhausted
> (da11:ciss1:0:7:0): Synchronize cache failed
> (da12:ciss1:0:8:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 
> 00 00
> (da12:ciss1:0:8:0): CAM status: Command timeout
> (da12:ciss1:0:8:0): Error 5, Retries exhausted
> (da12:ciss1:0:8:0): Synchronize cache failed
> (da13:ciss1:0:9:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 
> 00 00
> (da13:ciss1:0:9:0): CAM status: Command timeout
> (da13:ciss1:0:9:0): Error 5, Retries exhausted
> (da13:ciss1:0:9:0): Synchronize cache failed
> (da14:ciss1:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 
> 00 00
> (da14:ciss1:0:10:0): CAM status: Command timeout
> (da14:ciss1:0:10:0): Error 5, Retries exhausted
> (da14:ciss1:0:10:0): Synchronize cache failed
> Automatic reboot in 15 seconds - press a key on the console to abort
> Rebooting...
> 
> Dmesg says:
> ciss1: <HP Smart Array P812> port 0x5000-0x50ff mem 
> 0xfbe00000-0xfbffffff,0xfbdf0000-0xfbdf0fff irq 24 at device 0.0 on pci9
> ciss1: PERFORMANT Transport
> da5 at ciss1 bus 0 scbus2 target 1 lun 0
> da4 at ciss1 bus 0 scbus2 target 0 lun 0
> da6 at ciss1 bus 0 scbus2 target 2 lun 0
> da7 at ciss1 bus 0 scbus2 target 3 lun 0
> da8 at ciss1 bus 0 scbus2 target 4 lun 0
> da9 at ciss1 bus 0 scbus2 target 5 lun 0
> da10 at ciss1 bus 0 scbus2 target 6 lun 0
> da11 at ciss1 bus 0 scbus2 target 7 lun 0
> da12 at ciss1 bus 0 scbus2 target 8 lun 0
> da13 at ciss1 bus 0 scbus2 target 9 lun 0
> da14 at ciss1 bus 0 scbus2 target 10 lun 0
> 
> I also find it interesting that the machine's IML (Integrated Management 
> Log) contains this message after every crash:
> POST Error: 1719 - A controller failure event occurred prior to this 
> power-up
> 
> Which might show that the controller indeed locks up, but why does it do 
> this under FreeBSD and doesn't under Linux?
> I've already tried
> hw.ciss.nop_message_heartbeat=1;ciss_force_transport=1;ciss_force_interrupt=1
> without any effect (it freezes after the same time).
> 
> Last time during the POST the controller said:
> Slot 2  HP Smart Array P812 Controller       (1024MB, v6.40)  11 Logical 
> Drives
> 1719-Slot 2 Drive Array - A controller failure event occurred prior to this
>       power-up.  (Previous lock up code = 0x13)
> 
> Any ideas on what could cause this?
> 
> Thanks,
> _______________________________________________
> freebsd-scsi at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"

Can you open a p/r on this?  I'd like to keep tracking ciss(4) issues.
It seems like there is something odd with our driver when using multiple
controllers.

sean
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: This is a digitally signed message part
URL: <http://lists.freebsd.org/pipermail/freebsd-scsi/attachments/20140324/f182e17c/attachment.sig>