From nobody Tue Apr 02 07:20:42 2024
X-Original-To: freebsd-questions@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4V7znK0YW5z5GK5Q
	for <freebsd-questions@mlmmj.nyi.freebsd.org>; Tue,  2 Apr 2024 07:20:53 +0000 (UTC)
	(envelope-from ml@netfence.it)
Received: from soth.netfence.it (mailserver.netfence.it [78.134.96.152])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "mailserver.netfence.it", Issuer "R3" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4V7znH2Vcqz4G0Y
	for <freebsd-questions@freebsd.org>; Tue,  2 Apr 2024 07:20:51 +0000 (UTC)
	(envelope-from ml@netfence.it)
Authentication-Results: mx1.freebsd.org;
	dkim=none;
	dmarc=pass (policy=none) header.from=netfence.it;
	spf=pass (mx1.freebsd.org: domain of ml@netfence.it designates 78.134.96.152 as permitted sender) smtp.mailfrom=ml@netfence.it
Received: from [10.1.2.18] (alamar.local.netfence.it [10.1.2.18])
	(authenticated bits=0)
	by soth.netfence.it (8.17.2/8.17.2) with ESMTPSA id 4327Kg6K024523
	(version=TLSv1.3 cipher=TLS_AES_128_GCM_SHA256 bits=128 verify=NO)
	for <freebsd-questions@freebsd.org>; Tue, 2 Apr 2024 09:20:43 +0200 (CEST)
	(envelope-from ml@netfence.it)
X-Authentication-Warning: soth.netfence.it: Host alamar.local.netfence.it [10.1.2.18] claimed to be [10.1.2.18]
Message-ID: <1ca17a7a-025d-4403-a7f3-2892408ad628@netfence.it>
Date: Tue, 2 Apr 2024 09:20:42 +0200
List-Id: User questions <freebsd-questions.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-questions
List-Help: <mailto:questions+help@freebsd.org>
List-Post: <mailto:questions@freebsd.org>
List-Subscribe: <mailto:questions+subscribe@freebsd.org>
List-Unsubscribe: <mailto:questions+unsubscribe@freebsd.org>
Sender: owner-freebsd-questions@freebsd.org
X-BeenThere: freebsd-questions@freebsd.org
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Content-Language: en-US
From: Andrea Venturoli <ml@netfence.it>
Subject: 13.3 troubles under load
To: freebsd-questions@freebsd.org
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Spamd-Bar: ---
X-Spamd-Result: default: False [-3.79 / 15.00];
	NEURAL_HAM_LONG(-1.00)[-1.000];
	NEURAL_HAM_MEDIUM(-1.00)[-1.000];
	NEURAL_HAM_SHORT(-1.00)[-1.000];
	DMARC_POLICY_ALLOW(-0.50)[netfence.it,none];
	R_SPF_ALLOW(-0.20)[+ip4:78.134.96.152];
	MIME_GOOD(-0.10)[text/plain];
	XM_UA_NO_VERSION(0.01)[];
	RCVD_VIA_SMTP_AUTH(0.00)[];
	ASN(0.00)[asn:35612, ipnet:78.134.0.0/17, country:IT];
	RCPT_COUNT_ONE(0.00)[1];
	MIME_TRACE(0.00)[0:+];
	RCVD_COUNT_ONE(0.00)[1];
	RCVD_TLS_ALL(0.00)[];
	R_DKIM_NA(0.00)[];
	MLMMJ_DEST(0.00)[freebsd-questions@freebsd.org];
	FROM_HAS_DN(0.00)[];
	HAS_XAW(0.00)[];
	MID_RHS_MATCH_FROM(0.00)[];
	FROM_EQ_ENVFROM(0.00)[];
	TO_MATCH_ENVRCPT_ALL(0.00)[];
	TO_DN_NONE(0.00)[];
	PREVIOUSLY_DELIVERED(0.00)[freebsd-questions@freebsd.org];
	ARC_NA(0.00)[]
X-Rspamd-Queue-Id: 4V7znH2Vcqz4G0Y

Hello.

Now that 13.3 is out, and given the relatively short overlap support 
window, I started upgrading my 13.2 machines as soon as I had the chance.

However, I'm experiencing some troubles under load (in cases where every 
version up to 13.2 has always worked without troubles).


Scenario 1:

Box A is ZFS/SSD based, but has an UFS HD (with only specific data) 
which is exported via NFSv4.
Box B mounts that NFSv4 share and backs in up to an UFS/USB disk via rsync.
This has always worked fine until I upgraded box A to 13.3.
Now, while rsync does it jobs, box A starts crawling: Nagios reports 
several failures (either daemons which die or daemons which are no 
longer able to answer timely) and logging in via SSH becomes almost 
impossible (with already open sessions almost unusable).

System is on ZFS so it should not be affected by the load on the UFS HD; 
besides, a single UFS HD should not be able to provide so much load to 
halt an 8 core system with 32GiB or RAM.
Is it possible that such not so high network traffic (lagg with two em 
cards) brings this box to an almost halt?
Unfortunately, so far I don't have any useful logs.


Scenario 1:

A box is running with several services (including two clamd instances in 
two different jails). Once a week, it connects to a NAS via Bacula and 
copies ~1TB of data to an external UFS HD.
As in the previous example, after I upgraded to 13.3 this simple 
operation (which has worked for several years) has started to be 
problematic, as daemons are killed all through it:
> Apr  1 20:01:31 xxxxxxx kernel: pid 11753 (clamd), jid 3, uid 26, was killed: a thread waited too long to allocate a page
> Apr  1 20:02:18 xxxxxxx kernel: pid 11720 (clamd), jid 5, uid 26, was killed: a thread waited too long to allocate a page
> Apr  1 20:03:16 xxxxxxx kernel: pid 3707 (squid), jid 3, uid 100, was killed: a thread waited too long to allocate a page
> Apr  1 20:03:54 xxxxxxx kernel: pid 7400 (zeek), jid 7, uid 782, was killed: a thread waited too long to allocate a page
> Apr  1 20:04:25 xxxxxxx kernel: pid 1813 (snort), jid 0, uid 0, was killed: a thread waited too long to allocate a page
> Apr  1 20:05:59 xxxxxxx kernel: pid 7399 (zeek), jid 7, uid 782, was killed: a thread waited too long to allocate a page
> Apr  1 20:05:59 xxxxxxx kernel: pid 1820 (snort), jid 0, uid 0, was killed: a thread waited too long to allocate a page
> Apr  1 20:06:48 xxxxxxx kernel: pid 44493 (perl), jid 5, uid 26, was killed: a thread waited too long to allocate a page
> Apr  1 20:07:22 xxxxxxx kernel: pid 44512 (perl), jid 5, uid 26, was killed: a thread waited too long to allocate a page
> Apr  1 20:09:23 xxxxxxx kernel: pid 7254 (zeek), jid 7, uid 782, was killed: a thread waited too long to allocate a page
> Apr  1 20:10:17 xxxxxxx kernel: pid 14462 (mysqld), jid 11, uid 88, was killed: a thread waited too long to allocate a page
> Apr  1 20:10:17 xxxxxxx kernel: pid 83231 (smbd), jid 8, uid 0, was killed: a thread waited too long to allocate a page
> Apr  1 20:10:17 xxxxxxx kernel: pid 28868 (smbd), jid 8, uid 0, was killed: a thread waited too long to allocate a page
> Apr  1 20:10:17 xxxxxxx kernel: pid 92611 (smbd), jid 8, uid 0, was killed: a thread waited too long to allocate a page
> Apr  1 20:12:20 xxxxxxx kernel: pid 77438 (clamd), jid 3, uid 26, was killed: a thread waited too long to allocate a page
> Apr  1 20:13:47 xxxxxxx kernel: pid 77473 (clamd), jid 5, uid 26, was killed: a thread waited too long to allocate a page

Again, system/swap is on a SSD ZFS RAID pool, so disk load on the UFS 
USB HD shouldn't hamper its throughput.
This time network is still a lagg, but with igb cards (so a similar driver).


Any hint what to look for?
Is there some known problem with LAGG, if_em/if_igb, USB, UFS, other?

  bye & Thanks
	av.