Outage related to hard drive failure
krichy at cflinux.hu
krichy at cflinux.hu
Sat Feb 8 21:28:24 UTC 2014
Dear Brian,
Unfortunately I just can report about same issue I ran into a few weeks
ago. We run a FreeNAS server which hosts the VM images, and serves them
over NFS. One time I began receiving notifications that the virtual
hosts served by this NAS went down. I checked the NAS, found that one
drive attached to a mps/lsi HBA stopped responding to the HBA at all,
and thus blocked the whole pool. That was also strange to me that
neither a timeout event happened, so actually zfs thought that all
drives are fine, just one blocked the whole pool IOs. And unfortunately
I even could not offline that drive, only a hard reset helped. The drive
was so unresponsive that the bootup for FreeNAS also took long, but at
least that time zfs somehow noticed the drive is missing, and removed it
from the pool. And after that the pool worked fine, of course in a
degraded state, but healthy, we could initiate a replacement.
I only have the logs for the bootup:
mps0: mpssas_scsiio_timeout checking sc 0xffffff8000fd1000 cm
0xffffff800100c0a0
(probe14:mps0:0:14:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID
500 command timeout cm 0xffffff800100c0a0 ccb 0xfffffe0010f03000
mps0: mpssas_alloc_tm freezing simq
mps0: timedout cm 0xffffff800100c0a0 allocated tm 0xffffff8000fe47b0
(probe14:mps0:0:14:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID
500 completed timedout cm 0xffffff800100c0a0 ccb 0xfffffe0010f03000
during recovery ioc 8048 scsi 0 state c xfer 0
(noperiph:mps0:0:14:0): SMID 6 abort TaskMID 500 status 0x0 code 0x0
count 1
(noperiph:mps0:0:14:0): SMID 6 finished recovery after aborting TaskMID
500
mps0: mpssas_free_tm releasing simq
(probe14:mps0:0:14:0): INQUIRY. CDB: 12 00 00 00 24 00
(probe14:mps0:0:14:0): CAM status: Command timeout
(probe14:mps0:0:14:0): Retrying command
mps0: mpssas_scsiio_timeout checking sc 0xffffff8000fd1000 cm
0xffffff8001032f50
(probe14:mps0:0:14:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID
986 command timeout cm 0xffffff8001032f50 ccb 0xfffffe0010f03000
mps0: mpssas_alloc_tm freezing simq
mps0: timedout cm 0xffffff8001032f50 allocated tm 0xffffff8000fe48f8
(probe14:mps0:0:14:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID
986 completed timedout cm 0xffffff8001032f50 ccb 0xfffffe0010f03000
during recovery ioc 8048 scsi 0 state c xfer 0
(noperiph:mps0:0:14:0): SMID 7 abort TaskMID 986 status 0x0 code 0x0
count 1
(noperiph:mps0:0:14:0): SMID 7 finished recovery after aborting TaskMID
986
mps0: mpssas_free_tm releasing simq
(probe14:mps0:0:14:0): INQUIRY. CDB: 12 00 00 00 24 00
(probe14:mps0:0:14:0): CAM status: Command timeout
(probe14:mps0:0:14:0): Retrying command
mps0: mpssas_scsiio_timeout checking sc 0xffffff8000fd1000 cm
0xffffff8001010c38
(probe14:mps0:0:14:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID
559 command timeout cm 0xffffff8001010c38 ccb 0xfffffe0010f03000
mps0: mpssas_alloc_tm freezing simq
mps0: timedout cm 0xffffff8001010c38 allocated tm 0xffffff8000fe4a40
(probe14:mps0:0:14:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID
559 completed timedout cm 0xffffff8001010c38 ccb 0xfffffe0010f03000
during recovery ioc 8048 scsi 0 state c xfer 0
(noperiph:mps0:0:14:0): SMID 8 abort TaskMID 559 status 0x0 code 0x0
count 1
(noperiph:mps0:0:14:0): SMID 8 finished recovery after aborting TaskMID
559
mps0: mpssas_free_tm releasing simq
(probe14:mps0:0:14:0): INQUIRY. CDB: 12 00 00 00 24 00
(probe14:mps0:0:14:0): CAM status: Command timeout
(probe14:mps0:0:14:0): Retrying command
mps0: mpssas_scsiio_timeout checking sc 0xffffff8000fd1000 cm
0xffffff8001007278
(probe14:mps0:0:14:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID
439 command timeout cm 0xffffff8001007278 ccb 0xfffffe0010f03000
mps0: mpssas_alloc_tm freezing simq
mps0: timedout cm 0xffffff8001007278 allocated tm 0xffffff8000fe4b88
(probe14:mps0:0:14:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID
439 completed timedout cm 0xffffff8001007278 ccb 0xfffffe0010f03000
during recovery ioc 8048 scsi 0 state c xfer 0
(noperiph:mps0:0:14:0): SMID 9 abort TaskMID 439 status 0x0 code 0x0
count 1
(noperiph:mps0:0:14:0): SMID 9 finished recovery after aborting TaskMID
439
mps0: mpssas_free_tm releasing simq
(probe14:mps0:0:14:0): INQUIRY. CDB: 12 00 00 00 24 00
(probe14:mps0:0:14:0): CAM status: Command timeout
(probe14:mps0:0:14:0): Retrying command
A side note is that the drive was removed from its hot-swap bay, then
reinserted, and since then it is working fine. That is a seagate
ST32000645SS. And no errors reported.
Regards,
2014-02-08 21:02 időpontban Brian Gardner ezt írta:
> Last year upgraded my production servers from ufs on adaptec raid
> controllers to zfs raidz on lsi controllers. Last week I had an
> outage, the culprit being a failed hard in my raidz. My log was
> littered with kernel messages such as the ones below. About 20
> minutes after the first message some aspects of my host where hung,
> and I was notified that my site was down. I noticed these messages in
> my log but running a zfs status however showed only a few checksum
> errors and the failed drive didn't get degraded. Manually degrading
> the drive solved my problems. This seems very odd to me. It's almost
> as if zfs wasn't getting messages from the lsi driver regarding these
> read/write failures. Do I need to tune something in the mps/lsi or
> zfs drivers to help it deal with failures? I'm running FreeBSD
> 8.3-Release-p3 with the generic kernel, nothing unusual about my setup
> other than I use jails extensively. There are four drives in the
> raidz configuration in question:
>
> zpool status
> pool: storage
> state: ONLINE
> scan: resilvered 67.6G in 2h21m with 0 errors on Tue Aug 6 00:44:14
> 2013
> config:
>
> NAME STATE READ WRITE CKSUM
> storage ONLINE 0 0 0
> raidz1-0 ONLINE 0 0 0
> da0 ONLINE 0 0 0
> da1 ONLINE 0 0 0
> da2 ONLINE 0 0 0
> da3 ONLINE 0 0 0
>
> Jan 29 19:03:18 host2 kernel: (da0:mpslsi0:0:0:0): WRITE(10). CDB: 2a
> 0 9 72 d8 13 0 0 1d 0
> Jan 29 19:03:18 host2 kernel: (da0:mpslsi0:0:0:0): CAM status: SCSI
> Status Error
> Jan 29 19:03:18 host2 kernel: (da0:mpslsi0:0:0:0): SCSI status: Check
> Condition
> Jan 29 19:03:18 host2 kernel: (da0:mpslsi0:0:0:0): SCSI sense:
> Deferred error: MEDIUM ERROR info:97ef13b asc:15,1 (Mechanical
> positioning error) actual retry count: 15
> Jan 29 19:04:00 host2 kernel: (da0:mpslsi0:0:0:0): READ(10). CDB: 28 0
> 9 75 b9 96 0 0 b 0
> Jan 29 19:04:00 host2 kernel: (da0:mpslsi0:0:0:0): CAM status: SCSI
> Status Error
> Jan 29 19:04:00 host2 kernel: (da0:mpslsi0:0:0:0): SCSI status: Check
> Condition
> Jan 29 19:04:00 host2 kernel: (da0:mpslsi0:0:0:0): SCSI sense: MEDIUM
> ERROR info:975b99b asc:15,1 (Mechanical positioning error) actual
> retry count: 15
> Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_scsiio_timeout checking
> sc 0xffffff80005ee000 cm 0xffffff800060d408
> Jan 29 19:04:54 host2 kernel: (da0:mpslsi0:0:0:0): READ(10). CDB: 28 0
> 2 e3 6a 1a 0 0 1 0 length 512 SMID 153 command timeout cm
> 0xffffff800060d408 ccb 0xffffff000a1ce800
> Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_alloc_tm freezing simq
> Jan 29 19:04:54 host2 kernel: mpslsi0: timedout cm 0xffffff800060d408
> allocated tm 0xffffff8000604718
> Jan 29 19:04:54 host2 kernel: (da0:mpslsi0:0:0:0): READ(10). CDB: 28 0
> 2 e3 6a 1a 0 0 1 0 length 512 SMID 153 completed timedout cm
> 0xffffff800060d408 ccb 0xffffff000a1ce800 during recovery ioc 8048
> scsi 0 state c xfer 0
> Jan 29 19:04:54 host2 kernel: (noperiph:mpslsi0:0:0:0): SMID 43 abort
> TaskMID 153 status 0x0 code 0x0 count 1
> Jan 29 19:04:54 host2 kernel: (noperiph:mpslsi0:0:0:0): SMID 43
> finished recovery after aborting TaskMID 153
> Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_free_tm releasing simq
> Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_scsiio_timeout checking
> sc 0xffffff80005ee000 cm 0xffffff800060d928
> Jan 29 19:04:54 host2 kernel: (da0:mpslsi0:0:0:0): READ(10). CDB: 28 0
> 9 41 c7 53 0 0 b 0 length 5632 SMID 157 command timeout cm
> 0xffffff800060d928 ccb 0xffffff0176ccb000
> Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_alloc_tm freezing simq
> Jan 29 19:04:54 host2 kernel: mpslsi0: timedout cm 0xffffff800060d928
> allocated tm 0xffffff8000604860
> Jan 29 19:04:54 host2 kernel: (da0:mpslsi0:0:0:0): READ(10). CDB: 28 0
> 9 41 c7 53 0 0 b 0 length 5632 SMID 157 completed timedout cm
> 0xffffff800060d928 ccb 0xffffff0176ccb000 during recovery ioc 8048
> scsi 0 state c xfer 0(noperiph:mpslsi0:0:0:0): SMID 44 abort TaskMID
> 157 status 0x0 code 0x0 count 1
> Jan 29 19:04:54 host2 kernel: (noperiph:mpslsi0:0:0:0): SMID 44
> finished recovery after aborting TaskMID 157
> Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_free_tm releasing simq
> Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_scsiio_timeout checking
> sc 0xffffff80005ee000 cm 0xffffff80006235a8
> Jan 29 19:04:54 host2 kernel: (da0:mpslsi0:0:0:0): WRITE(10). CDB: 2a
> 0 9 72 e0 d3 0 1 0 0 length 131072 SMID 429 command timeout cm
> 0xffffff80006235a8 ccb 0xffffff000c600000
> Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_alloc_tm freezing simq
> Jan 29 19:04:54 host2 kernel: mpslsi0: timedout cm 0xffffff80006235a8
> allocated tm 0xffffff80006049a8
> Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_scsiio_timeout checking
> sc 0xffffff80005ee000 cm 0xffffff800063e2e0
> Jan 29 19:04:54 host2 kernel: (da0:mpslsi0:0:0:0): WRITE(10). CDB: 2a
> 0 9 72 e1 d3 0 0 3 0 length 1536 SMID 764 command timeout cm
> 0xffffff800063e2e0 ccb 0xffffff01dfa19800
> Jan 29 19:04:54 host2 kernel: mpslsi0: queued timedout cm
> 0xffffff800063e2e0 for processing by tm 0xffffff80006049a8
> Jan 29 19:04:54 host2 kernel: (da0:mpslsi0:0:0:0): WRITE(10). CDB: 2a
> 0 9 72 e0 d3 0 1 0 0 length 131072 SMID 429 completed timedout cm
> 0xffffff80006235a8 ccb 0xffffff000c600000 during recovery ioc 8048
> scsi 0 state c xfe(noperiph:mpslsi0:0:0:0): SMID 45 abort TaskMID 429
> status 0x0 code 0x0 count 1
> Jan 29 19:04:54 host2 kernel: (noperiph:mpslsi0:0:0:0): SMID 45
> continuing recovery after aborting TaskMID 429
> Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_scsiio_timeout checking
> sc 0xffffff80005ee000 cm 0xffffff8000610890
> Jan 29 19:04:54 host2 kernel: (da0:mpslsi0:0:0:0): WRITE(10). CDB: 2a
> 0 9 72 e3 31 0 0 be 0 length 97280 SMID 194 command timeout cm
> 0xffffff8000610890 ccb 0xffffff013d0f3800
> Jan 29 19:04:54 host2 kernel: mpslsi0: queued timedout cm
> 0xffffff8000610890 for processing by tm 0xffffff80006049a8
> Jan 29 19:04:54 host2 kernel: (da0:mpslsi0:0:0:0): WRITE(10). CDB: 2a
> 0 9 72 e1 d3 0 0 3 0 length 1536 SMID 764 completed timedout cm
> 0xffffff800063e2e0 ccb 0xffffff01dfa19800 during recovery ioc 8048
> scsi 0 state c xfer (noperiph:mpslsi0:0:0:0): SMID 45 abort TaskMID
> 764 status 0x0 code 0x0 count 1
> _______________________________________________
> zfs-devel at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/zfs-devel
> To unsubscribe, send any mail to "zfs-devel-unsubscribe at freebsd.org"
More information about the freebsd-scsi
mailing list