[Bug 206448] ZFS hang/stall when drives in ATA mode
bugzilla-noreply at freebsd.org
bugzilla-noreply at freebsd.org
Wed Jan 20 21:37:02 UTC 2016
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=206448
Bug ID: 206448
Summary: ZFS hang/stall when drives in ATA mode
Product: Base System
Version: 10.2-RELEASE
Hardware: amd64
OS: Any
Status: New
Severity: Affects Only Me
Priority: ---
Component: kern
Assignee: freebsd-bugs at FreeBSD.org
Reporter: danmcgrath.ca at gmail.com
CC: freebsd-amd64 at FreeBSD.org
CC: freebsd-amd64 at FreeBSD.org
Created attachment 165888
--> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=165888&action=edit
Screenshot of ata console error
I had a Dell PowerEdge R210 amd64 system that was exhibiting some off
behaviour. A year or two ago I had one of the systems 2 1TB SATA drives drop
out of raid, but surprisingly it I simply added it back and it has been fine
ever since. Then this week I installed py27-salt on the servers.
After installing salt everything seemed fine for the first day. After the daily
mails for the machine came in however, I noticed that the daily periodic got
stuck running some smartd checks for the log. I tried to kill the process but
ended up not being able to, which prompted a reboot. After the reboot there
were jails that refused to start and all of a sudden found myself unable to do
any writes to the drive, and only the message "ata2: already connected!"
showing up on the console.
After some digging (thanks to auditd and salt and system logs), I was able to
narrow the trigger down to some camcontrol inquiry and identify commands that
would reliably trigger the problem.
After some more digging I was noticing that only this server (out of several
identical/near identical) was showing the problem and that for some strange
reason there were /dev/gpt/swap0 (and swap1) files only on this system. Also
odd was that when I went to try some tests with stopping swap (`gmirror stop
swap`) I found that the second I tried to stop the swap mirror, it redetected
the swap mirror but under different device names (see screenshot of the console
in attachments). I also noticed that the dmesg of this system only, was showing
some odd "unmapped" messages:
GEOM_MIRROR: cancelling unmapped because of ada0p2
GEOM_MIRROR: cancelling unmapped because of ada1p2
GEOM_MIRROR: Device mirror/swap launched (2/2).
As for the ZFS symptoms, when the console would show the "already attached!"
error, ZFS (this was a zfs install with the mirrored swap option enabled) would
no longer allow writes (or at least very slowly, in the area of 1 IOPS), and
reads would eventually fail (when doing a test with `find /`), which I assume
happens when they run out of cache entries.
In the end I stumbled on the BIOS setting having the drives set to ATA mode
instead of AHCI or RAID, and correcting this setting seems to have solved the
problem. While I can't know for sure if this is a "bug" or just a known
limitation of ATA, it would almost seem like camcontrol was somehow briefly
disconnecting the drives when being issued commands, and in turn was causing
the swap device to switch from ada0p2 to gpt/swap0 and vice versa, possibly
causing some sort of bug in ZFS.
Anyway, this is the report, and hopefully helps fix a possible bug lurking
around the system that could cause problems for other users.
Cheers o/
--
You are receiving this mail because:
You are on the CC list for the bug.
More information about the freebsd-amd64
mailing list