Re: close() taking minutes (ZFS related)

From: Peter <pmc_at_citylink.dinoex.sub.org>
Date: Sat, 21 Jan 2023 12:53:26 UTC
On Fri, Jan 20, 2023 at 03:55:43PM +0100, Mateusz Guzik wrote:
! offcpu.d:
! 
! #pragma D option dynvarsize=32m
! 
! /*
!  * The p_flag & 0x4 test filters out kernel threads.
!  */
! 
! sched:::sleep
! /(curthread->td_proc->p_flag & 0x4) == 0/
! {
!         self->ts = timestamp;
!         self->mesg = curthread->td_wmesg;
! }
! 
! sched:::wakeup
! /self->ts/
! {
!         @[stack(30), stringof(self->mesg)] = sum(timestamp - self->ts);
!         self->ts = 0;
!         self->mesg = 0;
! }
! 
! dtrace:::END
! {
!         printa("%k\n%s\n%@d\n\n", @);
! }
! 
! dtrace -s offcpu.d -o offcpu.out
! 
! git clone https://github.com/brendangregg/FlameGraph.git
! cat offcpu.out | perl FlameGraph/stackcollapse.pl | perl
! FlameGraph/flamegraph.pl --color=io > offcpu.svg

Okay, got it. And now?

I created two pictures, one when the machine is idle, and one when it
is completely stuck. But I don't see significant differences.

I drop them into the ACME directory, so You can have a look:
http://flag.daemon.contact/.well-known/acme-challenge/ZFS.idle.svg
http://flag.daemon.contact/.well-known/acme-challenge/ZFS.busy.svg

Probably this needs more time to really understand - my brains are
currently engaged with the fallout of these table migrations.

I also tried to get a better view on what is exactly happening.
This is a little database load:

The beginning, everything normal:
 770 85475 83995  0  21  0   17812    6032 select   S+J    2       0:00.18 pg_restore -C -d postgres pgdump_TESTLOAD

The backend starts working, and things start to hang:
 770 85503 44085  4  52  0  156652  109460 zcw->zcw DsJ    -       0:05.78 postgres: postgres testload [local] ALTER TABLE (post
 770 85504 44090  1  20  0     580     384 tx->tx_s DJ     -       0:00.00 sh -c /ext/libexec/RedoLog.copy "00000001000000290000
 770 85475 83995  4  22  0   17812    6036 select   S+J    2       0:00.84 pg_restore -C -d postgres pgdump_TESTLOAD

It's getting bad:
   0 13187     1  1  20  0   13276     768 tx->tx_s DsJ    -       0:14.51 /usr/sbin/blacklistd
 770 44089 44085 16  20  0   80464   12620 tx->tx_s DsJ    -       0:09.90 postgres: walwriter    (postgres)
 770 85503 44085 15  52  0  177132  124516 tx->tx_s DsJ    -       0:06.65 postgres: postgres testload [local] CREATE INDEX (pos
 770 85504 44090 10  20  0     580     384 tx->tx_s DJ     -       0:00.00 sh -c /ext/libexec/RedoLog.copy "00000001000000290000
 770 85475 83995 19  22  0   17812    6036 select   S+J    2       0:00.84 pg_restore -C -d postgres pgdump_TESTLOAD

And worse:
 770 44089 44085 17  20  0   80464   12620 zcw->zcw DsJ    -       0:09.92 postgres: walwriter    (postgres)
 770 85503 44085  2  27  0  177132  124980 usem     SsJ    -       0:07.39 postgres: postgres testload [local] CREATE INDEX (pos
   8 85660 10053  1  52  0     428     252 tx->tx_s DJ     -       0:00.00 expr ## : \\(.\\).*
 770 85661 85504  8  22  0     416     232 zfs      DJ     -       0:00.00 fsync /var/db/pg-int/arch/000000010000002900000014
 770 85475 83995 19  20  0   17812    6036 select   S+J    2       0:00.84 pg_restore -C -d postgres pgdump_TESTLOAD

DB loaded, but things are ugly now:
   8 85660 10053 16  52  0     428     252 tx->tx_s DJ     -       0:00.00 expr ## : \\(.\\).*
 770 85661 85504  8  22  0     416     232 zfs      DJ     -       0:00.00 fsync /var/db/pg-int/arch/000000010000002900000014
5001 85683 85674 16  20  0     580     380 zfs      DsJ    -       0:00.00 /bin/sh -c /ext/libexec/svnsync.cron /ext/SVNR svn+ss
   0 85687 85677  4  20  0     580     380 zfs      DsJ    -       0:00.00 /bin/sh -c /usr/libexec/atrun
   8 85688 85679  5  20  0     580     380 zfs      DsJ    -       0:00.00 /bin/sh -c /usr/local/news/bin/send-uucp
 770 85666 83995 10  20  0    1020     764 zfs      D+J    2       0:00.00 psql -c drop database testload


Then, I tried to do that load again with sync write disabled, to
see if things change (they do not). But, at that time the
daily/periodic appeared, and it does a VACUUM - that is, a first
VACUUM on all the migrated tables. And now I give up, because this is
unuseable: 

   0  1430  1429  4  20  0   12860   3496 tx->tx_s D      -       0:00.09 / /usr /var /home /ext /j/kerb /media /j/oper /usr/loc
   0  2569 10505 14  20  0   21072   5800 tx->tx_s DJ     -       0:00.03 sendmail: 30L26S6D002569 wand-n.intra.daemon.contact [
   0  2273  2269 13  30  0   13680   3480 tx->tx_s Ds     -       0:00.02 /bin/sh /ext/libexec/Key.renew
 770  2645 44085  2  23  0   83024  23528 tx->tx_s DsJ    -       0:25.15 postgres: postgres bareos [local] VACUUM (postgres)
 770  2924 89818  9  20  0   17176   5308 tx->tx_s DJ     -       0:00.01 fetch --no-verify-peer -1 -T 30 -o /tmp/kurse.89801.ht
   8  2936 10053  2  22  0     428    252 tx->tx_s DJ     -       0:00.00 expr 29 + 1
   0  2947  2845  0  20  0   18384   3736 tx->tx_s DJ     -       0:00.00 sendmail: ./30L26SBv001628 from queue (sendmail)
 770  2949  2668  2  21  0     424    240 zfs      DJ     -       0:00.00 mv /var/db/pg-int/arch/000000010000002900000023 /var/d
   0  3110  3108 12  20  0   18384   3656 zfs      DJ     -       0:00.00 sendmail: ./30L26SDj001446 from queue (sendmail)
   0  3335  3334 11  20  0   18384   3656 zfs      DJ     -       0:00.00 sendmail: ./30L26Sjq001053 from queue (sendmail)
 997  3504  4734  7  20  0     580    380 zfs      DJ     -       0:00.01 sh -c /sbin/ping -c2 wand-n.intra.daemon.contact > /de
 997  3505  4734 15  20  0     580    380 zfs      DJ     -       0:00.01 sh -c sleep `jot -r 1 1 55`
 997  3506  4734 12  20  0     580    380 zfs      DJ     -       0:00.01 sh -c /sbin/ping -c2 pole-n.intra.daemon.contact > /de
 997  3507  4734  8  21  0     580    380 zfs      DJ     -       0:00.01 sh -c /usr/local/bin/xlsclients -display disp-e.intra.
   0  3548  3547  4  21  0   18384   3696 zfs      DJ     -       0:00.00 sendmail: ./30L26SRV001167 from queue (sendmail)
   0  3742  3740  9  20  0   18384   3656 zfs      DJ     -       0:00.00 sendmail: ./30L26SrD001048 from queue (sendmail)
   0  3955  3954  8  20  0   18384   3656 zfs      DJ     -       0:00.00 sendmail: ./30L26Sjs001053 from queue (sendmail)
   0  4155  4154 17  20  0   18384   3656 zfs      DJ     -       0:00.00 sendmail: ./30L26SBx001628 from queue (sendmail)
   0  4375  4374  6  20  0   18384   3656 zfs      DJ     -       0:00.00 sendmail: ./30L26SLE002237 from queue (sendmail)
   0  4575  4574 17  20  0   18384   3656 zfs      DJ     -       0:00.00 sendmail: ./30L26SDl001446 from queue (sendmail)
   0  4753  4752 11  20  0   18384   3656 zfs      DJ     -       0:00.00 sendmail: ./30L26SrF001048 from queue (sendmail)
   0  4952  4951 12  20  0   18384   3696 zfs      DJ     -       0:00.00 sendmail: ./30L26SRX001167 from queue (sendmail)
   0  5170  5169  3  20  0   18384   3700 zfs      DJ     -       0:00.00 sendmail: ./30L26SWb001552 from queue (sendmail)
   0  5383  5382  1  20  0   18384   3660 zfs      DJ     -       0:00.00 sendmail: ./30L26SYD000965 from queue (sendmail)
   0 10486     1  6  20  0   13000    620 tx->tx_s DsJ    -       0:13.29 /usr/sbin/cron -s
 770  3514 83995 16  20  0     596    288 zfs      D+J    2       0:00.00 pg_restore -C -d postgres pgdump_TESTLOAD
1000  3518 83924 18  20  0   13396   3200 zfs      D+     4       0:00.00 ls -la /j/admn/etc/

I attached truss to the VACUUM worker, and it does file reads and
writes in 8k blocks continuousely, just as it should, keeping the 
disks busy.
But meanwhile, nothing else is moving any step further.

I had to wait half an hour until the VACUUM was done, then everything
went back to normal, and nothing bad had happened (my scripting is
usually written to cope with indefinte stalls at any place).

The issue is a lot worse than I initially thought, it concerns
everything: sort, reindex, vacuum, ...

As it looks, ZFS allows only one single process to write to the pool,
and everything else is blocked from accessing it during that time.
And this seems SSD-specific; the DB was on mechanical disks before
and did fine for the last 15 years, precisely saturating the SCSI
and no stalls.
I might think this could be related to some of the lots of pseudo-AI
that went into ZFS over the years: all the knobs I used to tune have
disappeared, even arc-min/arc-max seem in the process of disappearing.
Initially in 2008, this DB did run on a Pentium-2 with 750 MB ram, and
things did work, after proper tuning. Now it has 20 cores and 48 GB
ram, and doesn't work.