mps0-troubles
Alexander Motin
mav at FreeBSD.org
Wed Jan 12 13:19:35 UTC 2011
Joachim Tingvold wrote:
> I'm not sure if this is the proper place to ask for help regarding this,
> but here it goes;
>
> I've got 17 disks connected to a HP SAS expander, which again is
> connected to a LSI SAS 9211-8i HBA. I also have 1 system-disk that's
> connected directly to the SATA-controller on the motherboard. This is
> running on FreeBSD 9.0-CURRENT-201012.
>
> I'm running ZFS on root (referred to as "zroot"), and also on the 17
> disks connected to the LSI-controller (6x2TB raid-z2 + 10x1TB raid-z + 1
> hot-spare, referred to as "storage").
>
> This setup has been running fine since around christmas, but today, when
> I was moving some files from the zroot to storage, it failed. First, the
> moving went just fine (I was looking at gstat while it was copying), but
> then no activity (even though I knew it wasn't done -- there was a lot
> of large files). Trying to list any files on the storage-volume didn't
> work (CTRL+C didn't work either, I had to quit the terminal). The
> mv-process was still running, even though there was no disk-activity;
>
> [jocke at filserver ~]$ ps aux | grep mv
> root 33698 0,0 0,1 10048 2132 0- D+ 11:35am
> 0:01,66 mv -PRp -- JAG /storage/series/JAG (cp)
>
> I've extracted the relevant lines from dmesg since the machine booted on
> sunday; <http://home.komsys.org/~jocke/dmesg_mps0_freebsd-scsi.txt>.
>
> After a while (couple of minutes), I could list files on the
> storage-volume, and ZFS reported no problems. Then, after a few new
> minutes, I could not list anything on the storage-volume, and it's been
> like that since (ZFS and dmesg reports no further errors, though).
>
> I mentioned that the mv-process is still running; it won't die, but I
> guess that's because it has the D-flag (disk wait).
>
> [root at filserver ~]# kill -9 33698
> [root at filserver ~]# ps aux | grep mv
> root 33698 0,0 0,1 10048 2132 0- D+ 11:35am 0:01,66 mv -PRp -- JAG
> /storage/series/JAG (cp)
>
> This isn't really my field of expertise, so I'm hoping that someone here
> on the list might enlighten me. (-:
dmesg you've shown shown many command timeouts on multiple devices. As
soon as default ATA timeout is about 30 seconds - it may cause
significant delays before recovery sequence will manage it. That could
result in delays you observed.
What's more suspicious is that timeouts happened same time on
AHCI-attached disk and several disks on mps controller. I can hardly
assume that two completely different controllers and drivers triggered
some unrelated problems simultaneously. I would suggest to check your
power supplies, cables, backplanes and other mechanical things.
--
Alexander Motin
More information about the freebsd-scsi
mailing list