ata timeouts under load
Miroslav Lachman
000.fbsd at quip.cz
Mon Sep 14 15:21:42 UTC 2009
Alexandre Sunny wrote:
> On Sun, 13 Sep 2009 22:02:10 +0100
> Kris Kennaway <kris at FreeBSD.org> wrote:
>
>
>>Alexander Motin wrote:
>>
>>>Kris Kennaway wrote:
[...]
>>>There are two different kinds of timeouts we can see:
>>> - first one, "ad4: WARNING - ..." is just a queue waiting timeout.
>>>It is not the reason, but consequence of the problem. And I have
>>>doubts that it is reasonable to do it.
>>> - second one, "TIMEOUT - WRITE_DMA48 ..." is a real command
>>>execution timeout. I don't know whether this is result of some
>>>improper error recovery, or you drive indeed lost required servo
>>>information near LBA=344052040 and tries to find it too long. You
>>>can try to read that sector and nearby ones with dd.
>>>
>>
>>It's always that sequence (with setfeatures timing out first, then
>>the dma later)...and the block number varies widely, also whether
>>it's read/write. The disk itself & the data it contains appears to
>>be OK as far as I have been able to determine so far.
>
>
> Does smartctl -A /dev/ad4 report "Seek Error Rate" and/or "ECC Error
> Rate", and, if so, do those values change while errors are being
> reported?
>
> "Replaced Sector Count" or something similar might give some insight
> too.
I have very similar problem with one disk in gmirror, but it is on 7.2
not current.
Sep 14 04:48:29 jimi kernel: ad6: timeout waiting to issue command
Sep 14 04:48:29 jimi kernel: ad6: error issuing FLUSHCACHE command
Sep 14 04:48:29 jimi kernel: ad6: TIMEOUT - READ_DMA48 retrying (1 retry
left) LBA=447001516
Sep 14 04:48:29 jimi kernel: ad6: FAILURE - READ_DMA48
status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=447001516
Sep 14 04:48:29 jimi kernel: GEOM_MIRROR: Request failed (error=5).
ad6[READ(offset=228864776192, length=2048)]
Sep 14 04:48:29 jimi kernel: GEOM_MIRROR: Device gm0: provider ad6
disconnected.
But no errors in SMART log:
Device Model: Hitachi HDP725050GLA360
Firmware Version: GM4OA52A
User Capacity: 500,107,862,016 bytes
SMART overall-health self-assessment test result: PASSED
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail
Always - 1
2 Throughput_Performance 0x0005 130 130 054 Pre-fail
Offline - 151
3 Spin_Up_Time 0x0007 116 116 024 Pre-fail
Always - 312 (Average 350)
4 Start_Stop_Count 0x0012 100 100 000 Old_age
Always - 23
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail
Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail
Always - 0
8 Seek_Time_Performance 0x0005 129 129 020 Pre-fail
Offline - 30
9 Power_On_Hours 0x0012 099 099 000 Old_age
Always - 13911
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail
Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age
Always - 23
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always
- 545
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always
- 545
194 Temperature_Celsius 0x0002 240 240 000 Old_age Always
- 25 (Lifetime Min/Max 20/34)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always
- 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always
- 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always
- 0
As it was discussed many times - it should be fixed by increasing the
hardcoded timouts.
Is it time to make the ATA timeout sysctl tunables?
There were patches from FreeNAS and some PRs about longer timeouts.
kern/136182: [ata] Heavy disk writes (e.g. ZFS resilver to a drive) can
cause "adX: TIMEOUT - FLUSHCACHE retrying (1 retry left)" on console.
http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/136182
kern/111023: [ata] [request] [patch] please expand ata timeouts
http://www.freebsd.org/cgi/query-pr.cgi?pr=111023
ATA/SATA DMA timeout issues
http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting#line-53
HowTo: Fix SATA DMA timeout issues on FreeBSD
http://linux-bsd-sharing.blogspot.com/2009/03/howto-fix-sata-dma-timeout-issues-on.html
Western Digital hard disks and ATA timeouts
http://www.mail-archive.com/freebsd-hardware@freebsd.org/msg03135.html
ata FLUSHCACHE timeout errors? [patch]
http://lists.freebsd.org/pipermail/freebsd-current/2009-April/005939.html
And I am sure, you can find many more reports floating around.
Miroslav Lachman
More information about the freebsd-current
mailing list