Only 0.44 (always) days of uptime with ciss (w/HP SA P812)
Nagy, Attila
bra at fsn.hu
Mon Mar 24 09:54:09 UTC 2014
Hi,
I have an HP DL360G7 with a HP SmartArray P812 in it, which crashes
exactly (well, some minutes plus or minus, but on the graph it's nearly
the same) at 0.44 days of uptime no matter what I do, load the machine
until it's so hot, I can't touch it, or just leave it idle.
The P812 has an HP MDS600 connected to it with 70 1TB disks, with a 6
disk RAID6 (ADG) setup. The volumes have 128k stripe size, because I use
ZFS on top of them.
The zpool is simply a stripe of the RAID6 volumes.
What may be important: the controller's RAID6 initialization is still
ongoing.
In the first sentence idle means the zpool/zfs is just mounted and only
some stat()s happening on them (crashes after 0.44 days) and fully
loaded means gstat shows around 100% utilization on the disks nearly all
the time (crashes after 0.44 days also).
I've already tried with stable/9 at r260621 and stable/10 at r262152, it's the
same.
I've also tried with Linux (Ubuntu 13.10, hpsa driver, zfs on linux
0.6.2), it doesn't crash (neither idle or loaded).
Already swapped the machine and the P812 to a different one, no effect.
Everything (DL360, P812, MDS600, disks) has the latest firmware.
The currently used ZFS is created under Linux to see whether this causes
the problems, but of course there are many different things in the two
OS (kernel, HP SA driver, block/SCSI layer and even ZFS is somewhat
different).
Linux works, FreeBSD crashes no matter what I do.
The exact message I can see is (ciss0 is the built-in P411):
ciss1: ADAPTER HEARTBEAT FAILED
Fatal trap 1: privileged instruction fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer = 0x20:0xfffffe0c59ff795d
stack pointer = 0x28:0xfffffe0baf1ab9d0
frame pointer = 0x28:0xfffffe0baf1aba20
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 12 (swi4: clock)
trap number = 1
panic: privileged instruction fault
cpuid = 0
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
0xfffffe0baf1ab560
kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0baf1ab610
panic() at panic+0x155/frame 0xfffffe0baf1ab690
trap_fatal() at trap_fatal+0x3a2/frame 0xfffffe0baf1ab6f0
trap() at trap+0x794/frame 0xfffffe0baf1ab910
calltrap() at calltrap+0x8/frame 0xfffffe0baf1ab910
--- trap 0x1, rip = 0xfffffe0c59ff795d, rsp = 0xfffffe0baf1ab9d0, rbp =
0xfffffe0baf1aba20 ---
(null)() at 0xfffffe0c59ff795d/frame 0xfffffe0baf1aba20
softclock_call_cc() at softclock_call_cc+0x16c/frame 0xfffffe0000e77120
kernphys() at 0xffffffff/frame 0xfffffe0000e778a0
kernphys() at 0xffffffff/frame 0xfffffe0000e78aa0
kernphys() at 0xffffffff/frame 0xfffffe0000e78c20
Uptime: 10h18m12s
(da4:ciss1:0:0:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00
00 00
(da4:ciss1:0:0:0): CAM status: Command timeout
(da4:ciss1:0:0:0): Error 5, Retries exhausted
(da4:ciss1:0:0:0): Synchronize cache failed
(da5:ciss1:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00
00 00
(da5:ciss1:0:1:0): CAM status: Command timeout
(da5:ciss1:0:1:0): Error 5, Retries exhausted
(da5:ciss1:0:1:0): Synchronize cache failed
(da6:ciss1:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00
00 00
(da6:ciss1:0:2:0): CAM status: Command timeout
(da6:ciss1:0:2:0): Error 5, Retries exhausted
(da6:ciss1:0:2:0): Synchronize cache failed
(da7:ciss1:0:3:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00
00 00
(da7:ciss1:0:3:0): CAM status: Command timeout
(da7:ciss1:0:3:0): Error 5, Retries exhausted
(da7:ciss1:0:3:0): Synchronize cache failed
(da8:ciss1:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00
00 00
(da8:ciss1:0:4:0): CAM status: Command timeout
(da8:ciss1:0:4:0): Error 5, Retries exhausted
(da8:ciss1:0:4:0): Synchronize cache failed
(da9:ciss1:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00
00 00
(da9:ciss1:0:5:0): CAM status: Command timeout
(da9:ciss1:0:5:0): Error 5, Retries exhausted
(da9:ciss1:0:5:0): Synchronize cache failed
(da10:ciss1:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00
00 00
(da10:ciss1:0:6:0): CAM status: Command timeout
(da10:ciss1:0:6:0): Error 5, Retries exhausted
(da10:ciss1:0:6:0): Synchronize cache failed
(da11:ciss1:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00
00 00
(da11:ciss1:0:7:0): CAM status: Command timeout
(da11:ciss1:0:7:0): Error 5, Retries exhausted
(da11:ciss1:0:7:0): Synchronize cache failed
(da12:ciss1:0:8:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00
00 00
(da12:ciss1:0:8:0): CAM status: Command timeout
(da12:ciss1:0:8:0): Error 5, Retries exhausted
(da12:ciss1:0:8:0): Synchronize cache failed
(da13:ciss1:0:9:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00
00 00
(da13:ciss1:0:9:0): CAM status: Command timeout
(da13:ciss1:0:9:0): Error 5, Retries exhausted
(da13:ciss1:0:9:0): Synchronize cache failed
(da14:ciss1:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00
00 00
(da14:ciss1:0:10:0): CAM status: Command timeout
(da14:ciss1:0:10:0): Error 5, Retries exhausted
(da14:ciss1:0:10:0): Synchronize cache failed
Automatic reboot in 15 seconds - press a key on the console to abort
Rebooting...
Dmesg says:
ciss1: <HP Smart Array P812> port 0x5000-0x50ff mem
0xfbe00000-0xfbffffff,0xfbdf0000-0xfbdf0fff irq 24 at device 0.0 on pci9
ciss1: PERFORMANT Transport
da5 at ciss1 bus 0 scbus2 target 1 lun 0
da4 at ciss1 bus 0 scbus2 target 0 lun 0
da6 at ciss1 bus 0 scbus2 target 2 lun 0
da7 at ciss1 bus 0 scbus2 target 3 lun 0
da8 at ciss1 bus 0 scbus2 target 4 lun 0
da9 at ciss1 bus 0 scbus2 target 5 lun 0
da10 at ciss1 bus 0 scbus2 target 6 lun 0
da11 at ciss1 bus 0 scbus2 target 7 lun 0
da12 at ciss1 bus 0 scbus2 target 8 lun 0
da13 at ciss1 bus 0 scbus2 target 9 lun 0
da14 at ciss1 bus 0 scbus2 target 10 lun 0
I also find it interesting that the machine's IML (Integrated Management
Log) contains this message after every crash:
POST Error: 1719 - A controller failure event occurred prior to this
power-up
Which might show that the controller indeed locks up, but why does it do
this under FreeBSD and doesn't under Linux?
I've already tried
hw.ciss.nop_message_heartbeat=1;ciss_force_transport=1;ciss_force_interrupt=1
without any effect (it freezes after the same time).
Last time during the POST the controller said:
Slot 2 HP Smart Array P812 Controller (1024MB, v6.40) 11 Logical
Drives
1719-Slot 2 Drive Array - A controller failure event occurred prior to this
power-up. (Previous lock up code = 0x13)
Any ideas on what could cause this?
Thanks,
More information about the freebsd-scsi
mailing list