CURRENT slow and shaky network stability

Tue Mar 29 06:08:49 UTC 2016

On Mon, 28 Mar 2016 14:52:09 -0700 (PDT)
Don Lewis <truckman at FreeBSD.org> wrote:

> On 28 Mar, O. Hartmann wrote:
> > Am Sat, 26 Mar 2016 14:26:45 -0700 (PDT)
> > Don Lewis <truckman at FreeBSD.org> schrieb:
> >   
> >> On 26 Mar, Michael Butler wrote:  
> >> > -current is not great for interactive use at all. The strategy of
> >> > pre-emptively dropping idle processes to swap is hurting .. big time.
> >> > 
> >> > Compare inactive memory to swap in this example ..
> >> > 
> >> > 110 processes: 1 running, 108 sleeping, 1 zombie
> >> > CPU:  1.2% user,  0.0% nice,  4.3% system,  0.0% interrupt, 94.5% idle
> >> > Mem: 474M Active, 1609M Inact, 764M Wired, 281M Buf, 119M Free
> >> > Swap: 4096M Total, 917M Used, 3178M Free, 22% Inuse
> >> > 
> >> >   PID USERNAME       THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU
> >> > COMMAND
> >> >  1819 imb              1  28    0   213M 11284K select  1 147:44   5.97%
> >> > gkrellm
> >> > 59238 imb             43  20    0   980M   424M select  0  10:07   1.92%
> >> > firefox
> >> > 
> >> >  .. it shouldn't start randomly swapping out processes because they're
> >> > used infrequently when there's more than enough RAM to spare ..    
> >> 
> >> I don't know what changed, and probably something can use some tweaking,
> >> but paging out idle processes isn't always the wrong thing to do.  For
> >> instance if I'm using poudriere to build a bunch of packages and its
> >> heavy use of tmpfs is pushing the machine into many GB of swap usage, I
> >> don't want interactive use like:
> >> 	vi foo.c
> >> 	cc foo.c
> >> 	vi foo.c
> >> to suffer because vi and cc have to be read in from a busy hard drive
> >> each time while unused console getty and idle sshd processes in a bunch
> >> of jails are still hanging on to memory even though they haven't
> >> executed any instructions since shortly after the machine was booted
> >> weeks ago.
> >>   
> >> > It also shows up when trying to reboot .. on all of my gear, 90 seconds
> >> > of "fail-safe" time-out is no longer enough when a good proportion of
> >> > daemons have been dropped onto swap and must be brought back in to flush
> >> > their data segments :-(    
> >> 
> >> That's a different and known problem.  See:
> >> <https://svnweb.freebsd.org/base/releng/10.3/bin/csh/config_p.h?revision=297204&view=markup>  
> > 
> > CURRENT has rendered unusable and faulty. Updating ports for poudriere ends
> > up in this error/broken pipe from remote console:
> > 
> >  [~] poudriere ports -u -p head
> > [00:00:00] ====>> Updating portstree "head"
> > [00:00:00] ====>> Updating the ports tree... done
> > root at gate [~] Fssh_packet_write_wait: Connection to 192.168.250.111 port
> > 22: Broken pipe
> > 
> > 
> > Although not under load, several processes over time gets idled/paged out -
> > and they never recover, the connection is then sabott, the whole thing
> > unusable :-(  
> 
> I'm definitely not seeing that here.  This is getting close to the end
> of a big poudriere run:
> 
> last pid: 82549;  load averages: 20.05, 20.72, 23.51    up 5+12:34:14
> 12:51:55 144 processes: 20 running, 109 sleeping, 15 stopped
> CPU: 85.3% user,  0.0% nice, 14.7% system,  0.0% interrupt,  0.0% idle
> Mem: 1082M Active, 19G Inact, 9718M Wired, 249M Buf, 1095M Free
> ARC: 3841M Total, 2039M MFU, 642M MRU, 3395K Anon, 111M Header, 1044M Other
> Swap: 40G Total, 9691M Used, 31G Free, 23% Inuse, 196K In
> 
> At the moment, openoffice-4, openoffice-devel, libreoffice, and chromium
> are all being built and are using tmpfs for "wrkdir data localbase", so
> there are many GB of data in tmpfs, which is the reason for the high
> inact and swap usage.  I just hit the return key in an idle (for a
> couple of hours) terminal window containing an ssh login session to the
> same machine.  I got a fresh command prompt essentially instantaneously.
> It couldn't have taken more than a couple hundred milliseconds to wake
> up and page in the idle sshd and shell processes on the build server.
> 
> [a couple hours later, after poudriere is done and all tmpfs is gone]
> 
> last pid: 66089;  load averages:  0.13,  1.59,  4.61    up 5+14:14:33
> 14:32:14 71 processes:  1 running, 55 sleeping, 15 stopped
> CPU:  3.1% user,  0.0% nice,  0.0% system,  0.0% interrupt, 96.9% idle
> Mem: 58M Active, 85M Inact, 12G Wired, 249M Buf, 19G Free
> ARC: 6249M Total, 2792M MFU, 2246M MRU, 16K Anon, 133M Header, 1078M Other
> Swap: 40G Total, 81M Used, 40G Free
> 
> [after tracking down and exiting all of those stopped processes]
> 
> last pid: 66103;  load averages:  0.20,  0.99,  3.80    up 5+14:17:18
> 14:34:59 56 processes:  1 running, 55 sleeping
> CPU:  0.0% user,  0.0% nice,  0.1% system,  0.1% interrupt, 99.9% idle
> Mem: 57M Active, 88M Inact, 12G Wired, 249M Buf, 19G Free
> ARC: 6251M Total, 2793M MFU, 2247M MRU, 16K Anon, 133M Header, 1078M Other
> Swap: 40G Total, 63M Used, 40G Free
> 
> The biggest chunk of the 63 MB of swap appears to be nginx.  It's
> process size is 29 MB, but it has zero resident.  It hasn't executed any
> code since it was first started when I booted the system several days
> ago.  Other consumers appear to be getty and sshd and syslogd in various
> untouched jails.
> 
> 
> I've seen reports that r296137 and r297267 show the ssh problem, but
> this machine is in the middle with r297204 and I don't see it.
> 
> As mentioned previously, I'm not running Xorg and a bunch of bloated
> X11 clients on this machine.  Those make fat targets for having RAM
> taken from them, which would probably make my interactive experience
> less pleasant, but that should still not affect ssh.
> 
> On my FreeBSD 10 machine, which has only 8 GB of RAM, my experience is
> that firefox gets pretty bloated after a while.  It's currently at 2.6
> GB (with 2.8 GB of swap currently in use - I've got some other RAM hogs
> running as well) and I'm not seeing any problems, but when it gets up in
> the 4-5 GB range, things can start to get pretty laggy, but I don't see
> problems with ssh.  The biggest problem with firefox seems to be
> javascript, which seems to leak memory like a sieve.  Making heavy use
> of the noscript plugin is the only way to keep Firefox usable.
> 
> The only thing I can think of is that this is triggered by something in
> the machine configuration or the specific hardware.  I'm running a
> GENERIC kernel and the only non-standard modification to /usr/src is the
> dummynet AQM patchset.  The latter should have no effect since I"m not
> using ipfw on this machine.
> 
> If I get a chance, I try booting my FreeBSD 11 machine with less RAM to
> see if that is a trigger.

Several of my boxes do not run X11 or "... a bunch of bloated X11 clients" 
and they run with 8 GB, 16 GB or 32 GB of RAM (the latter one
does have X11). On all remote systems with most recent CURRENT (we are talking
about r297237 - 297369 tight now) I definitely do not get "immediately" a fresh
prompt. it takes up to 60 seconds (and more) to recover, even if the box is in
a state of unemployment (idle!). In a seriously rising bunch of cases I get now
broken pipes. This also happens with sessions, when performing "poudriere
options" on larger installations and this is completely unacceptable.