From nobody Mon Apr 22 16:12:49 2024 X-Original-To: current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4VNVfC633Zz5HPBl for ; Mon, 22 Apr 2024 16:13:07 +0000 (UTC) (envelope-from glebius@freebsd.org) Received: from smtp.freebsd.org (smtp.freebsd.org [IPv6:2610:1c1:1:606c::24b:4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4VNVfC4TM3z47G3; Mon, 22 Apr 2024 16:13:07 +0000 (UTC) (envelope-from glebius@freebsd.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1713802387; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=YY3dC++mwrwwmO+turu6fdwVxxIBSyNkib5wloCJ+2o=; b=aw+N8xneC+8gtyjs+5Wir9zb22LRfIUTrluUIWxZGqBn3POAyICtbVg00ek4gBG4FB82Bf TDkEML/3+iAtO8/xScgeMYDYyt4zC2fYY14cLhErlB3IuQJ3ae+2fZDBjkijqHc9quuEMR exDKvKmP39Zk3ESyLQls2r46I+B1jZ64LH8SWVQHoYK1VZmWPktjQSnR7cZpgPYWWlnxkm +RJuulwlpVSRujlmQx7SNYQLS8KvwsZhPSaoZCjzsYuisxwlBEDLq/VQTXAz55uU/Eym4a UEA1EQ9zoW0wb/d+8o0CdObZTdnfnNk+xmPlN/4mxRNqUXspdwCb0dbulM91dA== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1713802387; a=rsa-sha256; cv=none; b=UakxYu9DYBPNmHJjgsBZTmLA7MvC/GGKJPB+B97RyvlCEwSXrRhAJnLCFvNMVSDmsDvAWJ NEwlrzlh+BmZ7KfocC/dDlxCPPe99xdoKunt1BJ3bB5eodlPRJXOzq8DwY18OBzgKEIRPV MG2q1+FEEVTcyUR6NKPkh5VD39/1+5O7JNyRoYqj+F+zNR0NBRLi1V7G0iM0ZBjECDzHeO 4sI/jqfSHVfrn+hv1r9b376vxkyjflqDnJ9ncfNiUZ4eWnuAl9OPWLYwdqS1hcVGSOO0tf rfPhhZWN+rd6yFrXZ62VEivuI4yEoM2BbFoGrqWpuzsJTBIeGfxtlcEDn1bORg== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1713802387; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=YY3dC++mwrwwmO+turu6fdwVxxIBSyNkib5wloCJ+2o=; b=CO6xBHwgjvrQZHmZkrDDPMkdTcV4EFwFbdMwAxeMJALMU/7J/cuZL20sl639lboB2/X2GZ 7F+o1rWlsM8lXK+jgjCBplZ+25qYUdGVw/yzUGp60oJU2B3PN3TWYEajGI3bBJp7H2fu0+ GfJhF8G9W2R/RmpC7Xa/F4XTVlxmvnvvS4C6qWoHG2Dw/LOy+GKKTIGjnReniahkCufpNE JLKc8p5D/95lqHFZX5Y+z17fHx2CWqQ6YNiHYOO6Anl87s3dnDATii2m6Ehsp+aBtMH9og B+RWnxGxv8DPjSMVACBu3BsLtagXioqo/d4yjYMQIDN2ef36cESCvZteYxoyfg== Received: from cell.glebi.us (unknown [162.251.186.162]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) (Authenticated sender: glebius) by smtp.freebsd.org (Postfix) with ESMTPSA id 4VNVfC18nTz1NPf; Mon, 22 Apr 2024 16:13:07 +0000 (UTC) (envelope-from glebius@freebsd.org) Date: Mon, 22 Apr 2024 09:12:49 -0700 From: Gleb Smirnoff To: Alexander Leidinger Cc: Current Subject: Re: Strange network/socket anomalies since about a month Message-ID: References: <1fe609f252e7fae6d746530d5035ec0e@Leidinger.net> List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@FreeBSD.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1fe609f252e7fae6d746530d5035ec0e@Leidinger.net> Alexander, On Mon, Apr 22, 2024 at 09:26:59AM +0200, Alexander Leidinger wrote: A> I see a higher failure rate of socket/network related stuff since a while. A> Those failures are transient. Directly executing the same thing again may A> or may not result in success/failure. I'm not able to reproduce this at A> will. Sometimes they show up. A> A> Examples: A> - poudriere runs with the sccache overlay (like ccache but also works for A> rust) sometimes fail to create the communication socket and as such the A> build fails. I have 3 different poudriere bulk runs after each other in my A> build script, and when the first one fails, the second and third still run. A> If the first fails due to the sccache issue, the second and 3rd may or may A> not fail. Sometimes the first fails and the rest is ok. Sometimes all fail, A> and if I then run one by hand it works (the script does the same as the A> manual run, the script is simply a "for type in A B C; do; poudriere bulk A> -O sccache -j $type -f ${type}.pkglist; done" which I execute from the A> same shell, and the script doesn't do env-sanityzing). A> - A webmail interface (inet / local net -> nginx (rev-proxy) -> nginx A> (webmail service) -> php -> imap) sees intermittent issues sometimes. A> Opening the same email directly again afterwards normally works. I've also A> seen transient issues with pgp signing (webmail interface -> gnupg / A> gpg-agent on the server), simply hitting send again after a failure works A> fine. A> A> Gleb, could this be related to the socket stuff you did 2 weeks ago? My A> world is from 2024-04-17-112537. I do notice this since at least then, but A> I'm not sure if they where there before that and I simply didn't notice A> them. They are surely "new recently", that amount of issues I haven's seen A> in January. The last two updates of current I did before the last one where A> on 2024-03-31-120210 and 2024-04-08-112551. The stuff I pushed 2 weeks ago was a large rewrite of unix/stream, but that was reverted as it appears needs more work wrt to aio(4), nfs/rpc and also appeared that sendfile(2) over unix(4) has some non-zero use. There were several preparatory commits that were not reverted and one of them had a bug. The bug manifested itself as failure to send(2) zero bytes over unix/stream. It was fixed with e6a4b57239dafc6c944473326891d46d966c0264. Can you please check you have this revision? Other than that there are no known bugs left. A> I could also imagine that some memory related transient failure could cause A> this, but with >3 GB free I do not expect this. Important here may be that A> I have https://reviews.freebsd.org/D40575 in my tree, which is memory A> related, but it's only a metric to quantify memory fragmentation. A> A> Any ideas how to track this down more easily than running the entire A> poudriere in ktrace (e.g. a hint/script which dtrace probes to use)? I don't have any better idea than ktrace over failing application. Yep, I understand that poudriere will produce a lot. But first we need to determine what syscall fails and on what type of socket. After that we can scope down to using dtrace on very particular functions. -- Gleb Smirnoff