amd64/157615: AHCI device timeouts with ATI IXP700 SATA controller
on high IO load
Petteri Valkonen
petteri.valkonen at iki.fi
Sat Jun 4 17:50:10 UTC 2011
>Number: 157615
>Category: amd64
>Synopsis: AHCI device timeouts with ATI IXP700 SATA controller on high IO load
>Confidential: no
>Severity: serious
>Priority: low
>Responsible: freebsd-amd64
>State: open
>Quarter:
>Keywords:
>Date-Required:
>Class: sw-bug
>Submitter-Id: current-users
>Arrival-Date: Sat Jun 04 17:50:09 UTC 2011
>Closed-Date:
>Last-Modified:
>Originator: Petteri Valkonen
>Release: 8.2-RELEASE
>Organization:
>Environment:
FreeBSD microserver 8.2-RELEASE FreeBSD 8.2-RELEASE #0: Thu Feb 17 02:41:51 UTC 2011 root at mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC amd64
>Description:
I'm running 8.2-RELEASE with the ahci(4) driver loaded at boot time on a HP ProLiant N36L with four Samsung HD204UI drives attached to a simple (striped) ZFS pool via an ATI IXP700 SATA controller:
ahci0: <ATI IXP700 AHCI SATA controller> port 0xd000-0xd007,0xc000-0xc003,0xb000-0xb007,0xa000-0xa003,0x9000-0x900f mem
0xfe6ffc00-0xfe6fffff irq 19 at device 17.0 on pci0
ahci0: [ITHREAD]
ahci0: AHCI v1.20 with 4 3Gbps ports, Port Multiplier supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich0: [ITHREAD]
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich1: [ITHREAD]
ahcich2: <AHCI channel> at channel 2 on ahci0
ahcich2: [ITHREAD]
ahcich3: <AHCI channel> at channel 3 on ahci0
ahcich3: [ITHREAD]
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <SAMSUNG HD204UI 1AQ10001> ATA-8 SATA 2.x device
ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: <SAMSUNG HD204UI 1AQ10001> ATA-8 SATA 2.x device
ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada2 at ahcich2 bus 0 scbus2 target 0 lun 0
ada2: <SAMSUNG HD204UI 1AQ10001> ATA-8 SATA 2.x device
ada2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada2: Command Queueing enabled
ada2: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada3 at ahcich3 bus 0 scbus3 target 0 lun 0
ada3: <SAMSUNG HD204UI 1AQ10001> ATA-8 SATA 2.x device
ada3: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada3: Command Queueing enabled
ada3: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
If I attempt to scrub the pool, at some point one of the disks will time out:
Jun 1 22:48:59 microserver kernel: ahcich1: Timeout on slot 1
Jun 1 22:48:59 microserver kernel: ahcich1: is 00000000 cs 000007f8 ss 000007fe rs 000007fe tfd 40 serr 00000000
Jun 1 22:48:59 microserver kernel: ahcich1: device is not ready (timeout 15000ms) tfd = 00000080
Jun 1 22:49:45 microserver kernel: ahcich1: Timeout on slot 10
Jun 1 22:49:45 microserver kernel: ahcich1: is 00000000 cs 00000400 ss 00000000 rs 00000400 tfd 80 serr 00000000
Jun 1 22:49:45 microserver kernel: ahcich1: device is not ready (timeout 15000ms) tfd = 00000080
Jun 1 22:50:31 microserver kernel: ahcich1: Timeout on slot 10
Jun 1 22:50:31 microserver kernel: ahcich1: is 00000000 cs 00000400 ss 00000000 rs 00000400 tfd 80 serr 00000000
Jun 1 22:50:31 microserver kernel: ahcich1: device is not ready (timeout 15000ms) tfd = 00000080
Jun 1 22:50:31 microserver kernel: (ada1:ahcich1:0:0:0): lost device
Jun 1 22:51:34 microserver kernel: ahcich1: Timeout on slot 10
Jun 1 22:51:34 microserver kernel: ahcich1: is 00000000 cs 000ffc00 ss 000ffc00 rs 000ffc00 tfd 80 serr 00000000
Jun 1 22:51:34 microserver kernel: ahcich1: device is not ready (timeout 15000ms) tfd = 00000080
Jun 1 22:51:34 microserver kernel: ahcich1: Poll timeout on slot 19
Jun 1 22:51:34 microserver kernel: ahcich1: is 00000000 cs 00080000 ss 00000000 rs 00080000 tfd 80 serr 00000000
Jun 1 22:52:36 microserver kernel: ahcich1: Timeout on slot 19
Jun 1 22:52:36 microserver kernel: ahcich1: is 00000000 cs 1ff80000 ss 1ff80000 rs 1ff80000 tfd 80 serr 00000000
Jun 1 22:52:36 microserver kernel: ahcich1: device is not ready (timeout 15000ms) tfd = 00000080
Jun 1 22:52:36 microserver kernel: ahcich1: Poll timeout on slot 28
Jun 1 22:52:36 microserver kernel: ahcich1: is 00000000 cs 10000000 ss 00000000 rs 10000000 tfd 80 serr 00000000
Jun 1 22:53:38 microserver kernel: ahcich1: Timeout on slot 28
Jun 1 22:53:38 microserver kernel: ahcich1: is 00000000 cs f000003f ss f000003f rs f000003f tfd 80 serr 00000000
Jun 1 22:53:38 microserver kernel: ahcich1: device is not ready (timeout 15000ms) tfd = 00000080
Jun 1 22:53:38 microserver kernel: ahcich1: Poll timeout on slot 5
Jun 1 22:53:38 microserver kernel: ahcich1: is 00000000 cs 00000020 ss 00000000 rs 00000020 tfd 80 serr 00000000
Jun 1 22:54:41 microserver kernel: ahcich1: Timeout on slot 5
Jun 1 22:54:41 microserver kernel: ahcich1: is 00000000 cs 00007fe0 ss 00007fe0 rs 00007fe0 tfd 80 serr 00000000
Jun 1 22:54:41 microserver kernel: ahcich1: device is not ready (timeout 15000ms) tfd = 00000080
Jun 1 22:54:41 microserver kernel: ahcich1: Poll timeout on slot 14
Jun 1 22:54:41 microserver kernel: ahcich1: is 00000000 cs 00004000 ss 00000000 rs 00004000 tfd 80 serr 00000000
Jun 1 22:54:41 microserver root: ZFS: vdev I/O failure, zpool=backup path=/dev/label/disk2 offset=270336 size=8192 error=6
Jun 1 22:54:41 microserver root: ZFS: vdev I/O failure, zpool=backup path=/dev/label/disk2 offset=2000398327808 size=8192 error=6
Jun 1 22:54:41 microserver root: ZFS: vdev I/O failure, zpool=backup path=/dev/label/disk2 offset=2000398589952 size=8192 error=6
The offending disk is then taken offline:
# camcontrol devlist
<SAMSUNG HD204UI 1AQ10001> at scbus0 target 0 lun 0 (ada0,pass0)
<SAMSUNG HD204UI 1AQ10001> at scbus2 target 0 lun 0 (ada2,pass2)
<SAMSUNG HD204UI 1AQ10001> at scbus3 target 0 lun 0 (ada3,pass3)
I have upgraded the server's BIOS to the latest available version (2011.04.02 (A)), but the problem still persists. Furthermore, extended offline SMART self-tests (smartctl -t long) performed on all the disks report no errors.
If I switch to the old ata(4) driver, the scrub job completes without any errors.
Others have also reported the same symptoms on similar hardware (a N36L with Samsung disks), and switching drivers has also remedied the problem for them:
http://freebsd.1045724.n5.nabble.com/ahci-ko-and-IXP700-800-gt-no-disk-found-tt3948669.html#a3948673
>How-To-Repeat:
Load the ahci(4) module and begin an disk IO intensive process (e.g. a ZFS scrub).
>Fix:
>Release-Note:
>Audit-Trail:
>Unformatted:
More information about the freebsd-amd64
mailing list