AHCI Time-outs while doing scrub or sysctl off line tests
Gijs
gijsje at heteigenwijsje.nl
Fri Apr 20 20:03:10 UTC 2012
Hey all,
I'm running 9-Stable with a zfs pool containing 2 sets of 3 disks in
raidz1.
I added the second set of disks (3 times 1,5 samsung F2EG) after my
first 3 got filled up (3x1tb only 50G left, it's baaad I know).
After adding the second set of disks I noticed that during scrubs (wich
would basicly be the highest load the system receives besides some
bittorrent traffic and file serving) I would start receiving AHCI
timeouts on ports 3-5, the newly added disks.
Together with that scrub performance is increadibly bad, it dropped to
below 900kb/s. This might be a result however of zfs fragmentation due
to the first set being filled up way above the adviced 80% as well as it
being filled up by torrent clients.
After a port starts sending AHCI errors connection will be dropped after
some time. If this happens I have to physically disconnect and reconnect
the drive or do a full system halt (reboot/reset does not help) to get
the functionality back.
Motherboard is an Asus M4A89GTD Pro wich has an 890GX northbridge and an
SB850 southbridge.
I've been searching around a lot and did not find anything conclusive
how to permanently fix this problem.
Some posts seem to suggest that a "cheap" controller like the onboard
one might suffer from the strain put on it by the heavy ZFS workloads,
and thus start randomly dropping connections. Some posts suggest that
the problem is in AHCI/NCQ and that disabling those results in
resolution of the problems. During boot time the drives are indeed
configured with NCQ turned on, camcontrol shows both NCQ and tagged
queuing turned on for the 3 samsung drives wich fail, minimum tagged
queue depth for tagged queueing is set at 32.
Indeed disabling AHCI resulted in the disappearance of the AHCI
time-outs, unfortunately of course also in a performance drop and loss
of hot swap capabilities.
The samsung F3EG drives had problems with NCQ in combination together
with the SB850 southbridge, so this migh also be a cause into the
problem, unfortunately seagate did not yet respond to my support question.
As a loss of AHCI functionality is kinda big, I would like to see if
toggling NCQ per drive is possible, and if it does resolve the problem.
Any advice on this ?
Cheers,
Gijs
More information about the freebsd-hardware
mailing list