Re: Strange network/socket anomalies since about a month

From: Alexander Leidinger <Alexander_at_Leidinger.net>
Date: Wed, 24 Apr 2024 10:01:26 UTC
Am 2024-04-22 18:12, schrieb Gleb Smirnoff:

> There were several preparatory commits that were not reverted and one 
> of them
> had a bug.  The bug manifested itself as failure to send(2) zero bytes 
> over
> unix/stream.  It was fixed with 
> e6a4b57239dafc6c944473326891d46d966c0264. Can
> you please check you have this revision? Other than that there are no 
> known
> bugs left.

Yes, I have this fix in my running kernel.

> A> Any ideas how to track this down more easily than running the entire
> A> poudriere in ktrace (e.g. a hint/script which dtrace probes to use)?
> 
> I don't have any better idea than ktrace over failing application.  
> Yep, I
> understand that poudriere will produce a lot.  But first we need to 
> determine

Yes, it does. 4.4 GB just for the start of poudriere until the first 
package build fails due to a failed sccache start (luckily in the first 
builder, but I had at least 2 builders automatically spin up by 
poudriere at the time when I validated the failure in the logs and 
disabled the tracing).

> what syscall fails and on what type of socket.  After that we can scope 
> down to
> using dtrace on very particular functions.

I'm not sure I manage to find the cause of the failure... the only thing 
which remotely looks like an issue is "Resource temporarily 
unavailable", but this is from the process which waits for the server to 
have started:
---snip---
  58406 sccache  1713947887.504834367 RET   __sysctl 0
  58406 sccache  1713947887.505521884 CALL  
rfork(0x80000000<><invalid>2147483648)
  58406 sccache  1713947887.505757777 CAP   system call not allowed: 
rfork
  58406 sccache  1713947887.505774176 RET   rfork 58426/0xe43a
  58406 sccache  1713947887.507304865 CALL  
compat11.kevent(0x3,0x371d360f89e8,0x2,0x371d360f89e8,0x2,0)
  58406 sccache  1713947887.507657906 STRU  struct freebsd11_kevent[] = { 
{ ident=11, filter=EVFILT_READ, flags=0x61<EV_ADD|EV_CLEAR|EV_RECEIPT>, 
fflags=0, data=0, udata=0x0 }
              { ident=11, filter=EVFILT_WRITE, 
flags=0x61<EV_ADD|EV_CLEAR|EV_RECEIPT>, fflags=0, data=0, udata=0x0 } }
  58406 sccache  1713947887.507689980 STRU  struct freebsd11_kevent[] = { 
{ ident=11, filter=EVFILT_READ, flags=0x4000<EV_ERROR>, fflags=0, 
data=0, udata=0x0 }
              { ident=11, filter=EVFILT_WRITE, flags=0x4000<EV_ERROR>, 
fflags=0, data=0, udata=0x0 } }
  58406 sccache  1713947887.507977155 RET   compat11.kevent 2
  58406 sccache  1713947887.508015751 CALL  write(0x5,0x371515685bcc,0x1)
  58406 sccache  1713947887.508086434 GIO   fd 5 wrote 1 byte
        0x0000 01                                       |.|

  58406 sccache  1713947887.508145930 RET   write 1
  58406 sccache  1713947887.508183140 CALL  
compat11.kevent(0x7,0,0,0x5a5689ab0c40,0x400,0)
  58406 sccache  1713947887.508396614 STRU  struct freebsd11_kevent[] = { 
  }
  58406 sccache  1713947887.508156537 STRU  struct freebsd11_kevent[] = { 
{ ident=4, filter=EVFILT_READ, flags=0x60<EV_CLEAR|EV_RECEIPT>, 
fflags=0, data=0x1, udata=0xffffffffffffffff } }
  58406 sccache  1713947887.508530888 RET   compat11.kevent 1
  58406 sccache  1713947887.508563736 CALL  read(0x4,0x371d3a2887c0,0x80)
  58406 sccache  1713947887.508729102 GIO   fd 4 read 1 byte
        0x0000 01                                       |.|

  58406 sccache  1713947887.508967661 RET   read 1
  58406 sccache  1713947887.508996753 CALL  read(0x4,0x371d3a2887c0,0x80)
  58406 sccache  1713947887.509028311 RET   read -1 errno 35 Resource 
temporarily unavailable
  58406 sccache  1713947887.509068838 CALL  
compat11.kevent(0x3,0,0,0x5a5689a97540,0x400,0x371d3a2887c8)
...
  58406 sccache  1713947897.514352552 CALL  
_umtx_op(0x5a5689a3d290,0x10<UMTX_OP_WAKE_PRIVATE>,0x7fffffff,0,0)
  58406 sccache  1713947897.514383653 RET   _umtx_op 0
  58406 sccache  1713947897.514421273 CALL  write(0x5,0x371515685bcc,0x1)
  58406 sccache  1713947897.515050967 STRU  struct freebsd11_kevent[] = { 
{ ident=4, filter=EVFILT_READ, flags=0x60<EV_CLEAR|EV_RECEIPT>, 
fflags=0, data=0x1, udata=0xffffffffffffffff } }
  58406 sccache  1713947897.515146151 RET   compat11.kevent 1
  58406 sccache  1713947897.515178978 CALL  read(0x4,0x371d3a2887c0,0x80)
  58406 sccache  1713947897.515368070 GIO   fd 4 read 1 byte
        0x0000 01                                       |.|

  58406 sccache  1713947897.515396600 RET   read 1
  58406 sccache  1713947897.515426523 CALL  read(0x4,0x371d3a2887c0,0x80)
  58406 sccache  1713947897.515457073 RET   read -1 errno 35 Resource 
temporarily unavailable
  58406 sccache  1713947897.515004494 GIO   fd 5 wrote 1 byte
        0x0000 01                                       |.|
---snip---

https://www.leidinger.net/test/sccache.tar.bz2 contains the parts of the 
trace of the sccache processes (in case someone wants to have a look).

The poudriere run had several builders in parallel, at least 2 were 
running at that point in time. What the overlay does is to startup 
(sccache --start-server) the sccache server process (forks to return 
back on the command line) which creates a file system socket, and then 
it queries the stats (sccache --show-stats). So some of the traces in 
the tarball are the server start (those with "Timed out waiting for 
server startup" are maybe the processes which fork to start the server 
and wait for it to be started), some are the stat-query, and some seem 
to be a successful start in another poudriere-builder (those with a 
successful /root/.ccache/sccache/5/4/<hashvalue> access look like from a 
successful start in another jail). Maybe there is also a --stop-server 
from poudriere somewhere.

What I noticed (except that printing the new CAP stuff for non-CAP 
enabled processes by default is disturbing) is, that compat11 stuff is 
called (seems the rust ecosystem is not keeping up with our speed of 
development...). Not sure if it matters here that some compat stuff is 
called.

Bye,
Alexander.

-- 
http://www.Leidinger.net Alexander@Leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.org    netchild@FreeBSD.org  : PGP 0x8F31830F9F2772BF