9.1-current disk throughput stalls ?

Ross Alexander rwa at athabascau.ca
Mon Jun 3 16:08:02 UTC 2013


I wonder if anyone here has insight on a disk throughput problem
that's come up over the last week or two.  Now, I habitually run an
'svn up' and then rebuild world + kernel every Saturday morning on the
home machines.  It's all scripted and logged; I've been doing this for
years and the process is very cut and dried.  Saturday AM, I started
it as usual - today it was still running, but only about 15% done.
Normally it completes in 39 minutes, +/- 1 minute.

What I've noticed is that disk performance on disk intensive stuff has
gotten very flaky over the last two or three weeks.  A buildworld, to
pick an example, will run nicely for three to five minutes and then
bog down.  The disks stay busy, but forward progress slows to a crawl
and then apparently stops.  Individual cleandirs are taking five to
ten seconds each on an otherwise unloaded machine.  It feels like
a vax-11/780 with 30 users and RA-80s, if anyone here remembers those
days :).

Here's a 'systat -vms':

     5 users    Load  0.30  0.30  0.27                  Jun  3 09:07

Mem:KB    REAL            VIRTUAL                       VN PAGER   SWAP PAGER
         Tot   Share      Tot    Share    Free           in   out     in   out
Act   84032   13908  1949112    40736  15071k  count
All  671192   16300 1076410k    61416          pages
Proc:                                                            Interrupts
   r   p   d   s   w   Csw  Trp  Sys  Int  Sof  Flt        cow     630 total
             113      3573   29  113  630   83   26     26 zfod        hdac1 16
                                                           ozfod       xhci0 ehci
  0.9%Sys   0.2%Intr  0.3%User  0.0%Nice 98.6%Idle        %ozfod       ohci0 ohci
|    |    |    |    |    |    |    |    |    |    |       daefr    93 emu10kx0
+                                                         prcfr   178 hpet0:t0
                                            dtbuf      596 totfr       hdac0 259
Namei     Name-cache   Dir-cache    329578 desvn          react   359 ahci0 260
    Calls    hits   %    hits   %     17505 numvn          pdwak       re0 261
      475     294  62                 14841 frevn          pdpgs
Disks  ada0  ada1 pass0 pass1                      796676 wire
KB/t   5.42  5.96  0.00  0.00                       65484 act
tps     197   192     0     0                       45332 inact
MB/s   1.04  1.12  0.00  0.00                             cache
%busy    74    82     0     0                    15071692 free

This is taken during the early stages of a builworld.  The cleandir
job steps are *crawling* along.  Rattling the keyboard (USB or serial,
although an SSH sessions seems to work sometimes as well) gets the
buildworld doing some useful work again.  Meanwhile, the apps load
(which is two instances of WSPR, an instance of baudline, KDE, and a
vncserver), which is soundcard I/O bound and does little to no disk
I/O) runs along perfectly happily.

The oldest kernel I have that shows the syndrome is -

     FreeBSD aukward.bogons 9.1-STABLE FreeBSD 9.1-STABLE #59 r250498:
     Sat May 11 00:03:15 MDT 2013
     toor at aukward.bogons:/usr/obj/usr/src/sys/GENERIC  amd64

H/W info -

     hw.machine: amd64
     hw.model: AMD Phenom(tm) II X4 965 Processor
     hw.ncpu: 4
     hw.physmem: 16883937280
     hw.clockrate: 3411
     kern.sched.name: ULE

     ahci0: <ATI IXP700 AHCI SATA controller> port 0xa000-0xa007,0x9000-0x9003,\
         0x8000-0x8007,0x7000-0x7003,0x6000-0x600f mem 0xfe6ffc00-0xfe6fffff \
 	irq 19 at device 17.0 on pci0
     ahci0: AHCI v1.20 with 6 6Gbps ports, Port Multiplier supported
     ahcich0: <AHCI channel> at channel 0 on ahci0
     ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
     ada0: <WDC WD1200JD-22HBC0 08.02D08> ATA-6 SATA 1.x device
     ada0: 150.000MB/s transfers (SATA 1.x, UDMA6, PIO 8192bytes)
     ada0: 114473MB (234441648 512 byte sectors: 16H 63S/T 16383C)
     ada0: Previously was known as ad4
     ada1 at ahcich2 bus 0 scbus2 target 0 lun 0
     ada1: <WDC WD1200JD-22HBC0 08.02D08> ATA-6 SATA 1.x device
     ada1: 150.000MB/s transfers (SATA 1.x, UDMA6, PIO 8192bytes)
     ada1: 114473MB (234441648 512 byte sectors: 16H 63S/T 16383C)
     ada1: Previously was known as ad8

I'm not paging, I don't have wild interrupt loads (checked with
'vmstat -i'), the ZFS pool is not in the middle of a scrub, but the
machine has bad trivial response and buildworld doesn't get finished.
I am seeing very similar behaviour on three other 9.1-current
machines, all of which are AHCI/SATA setups, using both Seagate and WD
disks (of random sizes and ages).  All these boxes ran fine a month

BTW, when I do the rattle-keyboard-to-get-disks-going trick, the NFS
daemon reports that the system clock slews badly - machine time drops
behind wall clock time.  Something is locking the clock update off.

(Hmmm, I see I'm running a pre-5000/feature flags ZFS pool, FWTW.
I'll run zpool upgrade, my bad.)

Ross Alexander, (780) 675-6823 / (780) 689-0749, rwa at athabascau.ca

 	"Always do right. This will gratify some people,
 	 and astound the rest."  -- Samuel Clemens

    This communication is intended for the use of the recipient to whom it
    is addressed, and may contain confidential, personal, and or privileged
    information. Please contact us immediately if you are not the intended
    recipient of this communication, and do not copy, distribute, or take
    action relying on it. Any communications received in error, or
    subsequent reply, should be deleted or destroyed.

More information about the freebsd-stable mailing list