9.1-current disk throughput stalls ?
Jeremy Chadwick
jdc at koitsu.org
Mon Jun 3 20:32:09 UTC 2013
On Mon, Jun 03, 2013 at 09:38:45AM -0600, Ross Alexander wrote:
> I wonder if anyone here has insight on a disk throughput problem
> that's come up over the last week or two. Now, I habitually run an
> 'svn up' and then rebuild world + kernel every Saturday morning on the
> home machines. It's all scripted and logged; I've been doing this for
> years and the process is very cut and dried. Saturday AM, I started
> it as usual - today it was still running, but only about 15% done.
> Normally it completes in 39 minutes, +/- 1 minute.
>
> What I've noticed is that disk performance on disk intensive stuff has
> gotten very flaky over the last two or three weeks. A buildworld, to
> pick an example, will run nicely for three to five minutes and then
> bog down. The disks stay busy, but forward progress slows to a crawl
> and then apparently stops. Individual cleandirs are taking five to
> ten seconds each on an otherwise unloaded machine. It feels like
> a vax-11/780 with 30 users and RA-80s, if anyone here remembers those
> days :).
>
> Here's a 'systat -vms':
>
> 5 users Load 0.30 0.30 0.27 Jun 3 09:07
>
> Mem:KB REAL VIRTUAL VN PAGER SWAP PAGER
> Tot Share Tot Share Free in out in out
> Act 84032 13908 1949112 40736 15071k count
> All 671192 16300 1076410k 61416 pages
> Proc: Interrupts
> r p d s w Csw Trp Sys Int Sof Flt cow 630 total
> 113 3573 29 113 630 83 26 26 zfod hdac1 16
> ozfod xhci0 ehci
> 0.9%Sys 0.2%Intr 0.3%User 0.0%Nice 98.6%Idle %ozfod ohci0 ohci
> | | | | | | | | | | | daefr 93 emu10kx0
> + prcfr 178 hpet0:t0
> dtbuf 596 totfr hdac0 259
> Namei Name-cache Dir-cache 329578 desvn react 359 ahci0 260
> Calls hits % hits % 17505 numvn pdwak re0 261
> 475 294 62 14841 frevn pdpgs
> intrn
> Disks ada0 ada1 pass0 pass1 796676 wire
> KB/t 5.42 5.96 0.00 0.00 65484 act
> tps 197 192 0 0 45332 inact
> MB/s 1.04 1.12 0.00 0.00 cache
> %busy 74 82 0 0 15071692 free
> buf
>
> This is taken during the early stages of a builworld. The cleandir
> job steps are *crawling* along. Rattling the keyboard (USB or serial,
> although an SSH sessions seems to work sometimes as well) gets the
> buildworld doing some useful work again. Meanwhile, the apps load
> (which is two instances of WSPR, an instance of baudline, KDE, and a
> vncserver), which is soundcard I/O bound and does little to no disk
> I/O) runs along perfectly happily.
>
> The oldest kernel I have that shows the syndrome is -
>
> FreeBSD aukward.bogons 9.1-STABLE FreeBSD 9.1-STABLE #59 r250498:
> Sat May 11 00:03:15 MDT 2013
> toor at aukward.bogons:/usr/obj/usr/src/sys/GENERIC amd64
>
> H/W info -
>
> hw.machine: amd64
> hw.model: AMD Phenom(tm) II X4 965 Processor
> hw.ncpu: 4
> hw.physmem: 16883937280
> hw.clockrate: 3411
> kern.sched.name: ULE
>
> ahci0: <ATI IXP700 AHCI SATA controller> port 0xa000-0xa007,0x9000-0x9003,\
> 0x8000-0x8007,0x7000-0x7003,0x6000-0x600f mem 0xfe6ffc00-0xfe6fffff \
> irq 19 at device 17.0 on pci0
> ahci0: AHCI v1.20 with 6 6Gbps ports, Port Multiplier supported
> ahcich0: <AHCI channel> at channel 0 on ahci0
> [...]
> ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
> ada0: <WDC WD1200JD-22HBC0 08.02D08> ATA-6 SATA 1.x device
> ada0: 150.000MB/s transfers (SATA 1.x, UDMA6, PIO 8192bytes)
> ada0: 114473MB (234441648 512 byte sectors: 16H 63S/T 16383C)
> ada0: Previously was known as ad4
> ada1 at ahcich2 bus 0 scbus2 target 0 lun 0
> ada1: <WDC WD1200JD-22HBC0 08.02D08> ATA-6 SATA 1.x device
> ada1: 150.000MB/s transfers (SATA 1.x, UDMA6, PIO 8192bytes)
> ada1: 114473MB (234441648 512 byte sectors: 16H 63S/T 16383C)
> ada1: Previously was known as ad8
>
> I'm not paging, I don't have wild interrupt loads (checked with
> 'vmstat -i'), the ZFS pool is not in the middle of a scrub, but the
> machine has bad trivial response and buildworld doesn't get finished.
> I am seeing very similar behaviour on three other 9.1-current
> machines, all of which are AHCI/SATA setups, using both Seagate and WD
> disks (of random sizes and ages). All these boxes ran fine a month
> ago.
>
> BTW, when I do the rattle-keyboard-to-get-disks-going trick, the NFS
> daemon reports that the system clock slews badly - machine time drops
> behind wall clock time. Something is locking the clock update off.
>
> (Hmmm, I see I'm running a pre-5000/feature flags ZFS pool, FWTW.
> I'll run zpool upgrade, my bad.)
1. There is no such thing as 9.1-CURRENT. Either you meant 9.1-STABLE
(what should be called stable/9) or -CURRENT (what should be called
head).
2. Is there some reason you excluded details of your ZFS setup? "zpool
status" would be a good start.
3. Do any of your filesystems/pools have ZFS compression enabled, or
have in the past?
4. Do any of your filesystems/pools have ZFS dedup enabled, or have in
the past?
5. Does the problem go away after a reboot?
6. Can you provide smartctl -x output for both ada0 and ada1? You will
need to install ports/sysutils/smartmontools for this. The reason I'm
asking for this is there may be one of your disks which is causing I/O
transactions to stall for the entire pool (i.e. "single point of
annoyance").
7. Can you remove ZFS from the picture entirely (use UFS only) and
re-test? My guess is that this is ZFS behaviour, particularly the ARC
being flushed to disk, and your disks are old/slow. (Meaning: you have
16GB RAM + 4 core CPU but with very old disks).
--
| Jeremy Chadwick jdc at koitsu.org |
| UNIX Systems Administrator http://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |
More information about the freebsd-stable
mailing list